A one stop solution to build your recommendation models, train them and, deploy them in a privacy preserving manner-- right on the users' devices.
We integrate the phenomenal works by OpenMined and FedML to easily explore new federated learning algorithms and deploy them into production.
The steps to building an awesome recommendation system:
- 🔩 Standard ML training: Pick up any ML model and benchmark it using BaseTrainer
- 🎮 Federated Learning Simulation: Once you are satisfied with your model, explore a host of FL algorithms with FederatedWorker
- 🏭 Industrial Deployment: After all the testing and simulation, deploy easily using PySyft from OpenMined
- 🚀 Edge Computing: Integrate with NimbleEdge to improve FL training times by over 100x
Let's train Facebook AI's DLRM on the edge. DLRM has been a standard baseline for all neural network based recommendation models.
Download and extract the Kaggle Criteo dataset from Google Drive
mkdir criteo; cd criteo
gdown https://drive.google.com/file/d/17K5ntN30LbMWJ2gHHSkwCGHEcAjShm2_
tar -xf dac.tar.gz
Clone this repo and change the argument datafile
in configs/dlrm.yml to the above path.
git clone https://github.com/NimbleEdge/RecoEdge
model :
name : 'dlrm'
...
preproc :
datafile : "<Path to Criteo>/criteo/train.txt"
Install the dependencies with conda or pip
conda create --name recsys python=3.8 --file conda_requirements.txt
conda activate recsys
Run data preprocessing with preprocess_data and supply the config file. You should be able to generate per-day split from the entire dataset as well a processed data file
python preprocess_data.py --config configs/dlrm.yml --logdir $HOME/logs/kaggle_criteo/exp_1
Begin Training
python train.py --config configs/dlrm.yml --logdir $HOME/logs/kaggle_criteo/exp_3 --num_eval_batches 1000 --devices 0
Run tensorboard to view training loss and validation metrics at localhost:8888
tensorboard --logdir $HOME/logs/kaggle_criteo --port 8888
This section is still work in progress. Reach out to us directly if you need help with FL deployment
Now we will simulate DLRM in federated setting. Create data split to mimic your users. We use Drichlet sampling for creating non-IID datasets for the model.
Adjust the parameters for distributed training like MPI in the config file
communications:
gpu_map:
host1: [0, 2]
host2: [1, 0, 1]
host3: [1, 1, 0, 1]
host4: [0, 1, 0, 0, 0, 1, 0, 2]
Implement your own federated learning algorithm. In the demo we are using Federated Averaging. You just need to sub-class FederatedWorker and implement run()
method.
@registry.load('fl_algo', 'fed_avg')
class FedAvgWorker(FederatedWorker):
def __init__(self, ...):
super().__init__(...)
async def run(self):
'''
`Run` function updates the local model.
Implement this method to determine how the roles interact with each other to determine the final updated model.
For example a worker which has both the `aggregator` and `trainer` roles might first train locally then run discounted `aggregate()` to get the fianl update model
In the following example,
1. Aggregator requests models from the trainers before aggregating and updating its model.
2. Trainer responds to aggregators' requests after updating its own model by local training.
Since standard FL requires force updates from central entity before each cycle, trainers always start with global model/aggregator's model
'''
assert role in self.roles, InvalidStateError("unknown role for worker")
if role == 'aggregator':
neighbours = await self.request_models_suspendable(self.sample_neighbours())
weighted_params = self.aggregate(neighbours)
self.update_model(weighted_params)
elif role == 'trainer':
# central server in this case
aggregators = list(self.out_neighbours.values())
global_models = await self.request_models_suspendable(aggregators)
self.update_model(global_models[0])
await self.train(model_dir=self.persistent_storage)
self.round_idx += 1
# Your aggregation strategy
def aggregate(self, neighbour_ids):
model_list = [
(self.in_neighbours[id].sample_num, self.in_neighbours[id].model)
for id in neighbour_ids
]
(num0, averaged_params) = model_list[0]
for k in averaged_params.keys():
for i in range(0, len(model_list)):
local_sample_number, local_model_params = model_list[i]
w = local_sample_number / training_num
if i == 0:
averaged_params[k] = local_model_params[k] * w
else:
averaged_params[k] += local_model_params[k] * w
return averaged_params
# Your sampling strategy
def sample_neighbours(self, round_idx, client_num_per_round):
num_neighbours = len(self.in_neighbours)
if num_neighbours == client_num_per_round:
selected_neighbours = [
neighbour for neighbour in self.in_neighbours]
else:
with RandomContext(round_idx):
selected_neighbours = np.random.choice(
self.in_neighbours, min(client_num_per_round, num_neighbours), replace=False)
logging.info("worker_indexes = %s" % str(selected_neighbours))
return selected_neighbours
Begin FL simulation by
mpirun -np 20 python -m mpi4py.futures train_fl.py --num_workers 1000.
Deploy with PySyft
There are two ways to adjust training hyper-parameters:
- Set values in config/*.yml persistent settings which are necessary for reproducibility eg randomization seed
- Pass them as CLI argument Good for non-persistent and dynamic settings like gpu device
In case of conflict, CLI argument supercedes config file parameter. For further reference, check out training config flags
Any parameter needed to instantiate the pytorch module can be supplied by simply creating a key-value pair in the config file.
For example DLRM requires arch_feature_emb_size
, arch_mlp_bot
, etc
model:
name : 'dlrm'
arch_sparse_feature_size : 16
arch_mlp_bot : [13, 512, 256, 64]
arch_mlp_top : [367, 256, 1]
arch_interaction_op : "dot"
arch_interaction_itself : False
sigmoid_bot : "relu"
sigmoid_top : "sigmoid"
loss_function: "mse"
Model architecture can only be changed via configs/*.yml
files. Every model declaration is tagged with an appropriate name and loaded into registry.
@registry.load('model','<model_name>')
class My_Model(torch.nn.Module):
def __init__(num):
...
You can define your own modules and add them in the fedrec/modules. Finally set the name
flag of model
tag in config file
model :
name : "<model name>"
- Star, fork, and clone the repo.
- Do your work.
- Push to your fork.
- Submit a PR to NimbleEdge/RecoEdge
We welcome you to the slack for queries related to the library and contribution in general. See you there!