# FedDropoutAvg Tutorial using OpenFL Workflow Interface - PyTorch CIFAR10

*  This notebook provides implementation of the __"FedDropoutAvg" algorithm__ __[[arXiv link]](https://arxiv.org/abs/2111.13230)__, together with the ResNet18 model with GroupNorm layers used in the paper. <br>

    * In a nutshell, FedDropoutAvg proposes to use dropout mechanisms to aggregate parameters of deep neural network models trained at different client sites into a federated model. 

    * It proposes to use dropout mechanisms in two aspects: 
        1. __client selection__: random dropout of clients for each round of federated training,
        2. __federated averaging (aggregation)__: random dropout of parameters of locally trained models for aggregation into a federated model.
    
    <br>
    
    * FedDropoutAvg is designed to mitigate the effects of the heterogeneity of the real-world multi-institutional histopathological datasets. However, in this tutorial we are using a toy dataset (CIFAR10) and we are randomly dividing the data between collaborators.  
    

<br>


* This tutorial is adapted from the OpenFL tutorial __["Workflow_Interface_101_MNIST.ipynb"](https://github.com/securefederatedai/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb)__



## Getting Started 
First we start by installing the necessary dependencies for the workflow interface

In [1]:
!pip install git+https://github.com/intel/openfl.git
!pip install -r requirements_workflow_interface.txt

# Uncomment this if running in Google Colab
#!pip install -r https://raw.githubusercontent.com/intel/openfl/develop/openfl-tutorials/experimental/requirements_workflow_interface.txt
#import os
#os.environ["USERNAME"] = "colab"

Collecting git+https://github.com/intel/openfl.git
  Cloning https://github.com/intel/openfl.git to /tmp/pip-req-build-dv8v4swi
  Running command git clone --filter=blob:none --quiet https://github.com/intel/openfl.git /tmp/pip-req-build-dv8v4swi
  Resolved https://github.com/intel/openfl.git to commit ed501ebbd6ffab6d10b4347a8c34369564a373b2
  Preparing metadata (setup.py) ... [?25ldone


## Defining our dataloaders, model, optimizer, some helper functions, and the _`cdr` (client dropout rate)_ and _`fdr` (federated dropout rate)_ parameters which will be used for the FedDropoutAvg.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import torchvision
from torchvision import models
import numpy as np

n_rounds = 2 # number of rounds
batch_size_train = 256
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 1

#FedDropoutAvg parameters, if fdr==0 and cdr==0 it is same with FedAvg
fdr = 0.3 # federated dropout rate
cdr = 0.2 # client dropout rate

random_seed = 1
torch.backends.cudnn.enabled = True 
torch.manual_seed(random_seed)




transforms_train = torchvision.transforms.Compose([torchvision.transforms.RandomHorizontalFlip(),
                                        torchvision.transforms.ToTensor(),
                                        torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
                                      ])

transforms_test = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(),
                                       torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
                                     ])


cifar10_train = torchvision.datasets.CIFAR10('files/', train=True, download=True, transform= transforms_train ) 
cifar10_test = torchvision.datasets.CIFAR10('files/', train=False, download=True, transform=transforms_test)
 

class GroupNorm32(nn.GroupNorm):
    def __init__(self, num_channels, num_groups=32, **kargs):
        super().__init__(num_groups, num_channels, **kargs)
        

class ResNet18(nn.Module):
    def __init__(self, norm = 'gn'):
      # Default norm: norm layer type is GroupNorm32.  If norm == 'bn', BatchNorm2d will be used - not performing well with FL
        super(ResNet18, self).__init__()
       
        if norm == 'gn':
          norm_layer = GroupNorm32
        elif norm == 'bn':
          norm_layer = nn.BatchNorm2d

        self.model_ft = models.resnet18(pretrained = False, norm_layer = norm_layer, num_classes = 10)
             
        self.model_ft = nn.Sequential(self.model_ft)

    def forward(self, x):
        x = self.model_ft(x)
        return F.log_softmax(x) #x 



    
def inference(network,test_loader):
    network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
      for data, target in test_loader:
        output = network(data)
        test_loss += F.nll_loss(output, target, size_average=False).item()
        pred = output.data.max(1, keepdim=True)[1]
        correct += pred.eq(target.data.view_as(pred)).sum()
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
      test_loss, correct, len(test_loader.dataset),
      100. * correct / len(test_loader.dataset)))
    accuracy = float(correct / len(test_loader.dataset))
    return accuracy

  from .autonotebook import tqdm as notebook_tqdm


Files already downloaded and verified
Files already downloaded and verified


## Implementation of the FedDropoutAvg class

In [3]:
import copy
#from copy import deepcopy


# FedAvg Algo from the original tutorial
def FedAvg(models, weights=None):
    new_model = models[0]
    state_dicts = [model.state_dict() for model in models]
    state_dict = new_model.state_dict()
    for key in models[1].state_dict():
        state_dict[key] = torch.from_numpy(np.average([state[key].numpy() for state in state_dicts],
                                                      axis=0, 
                                                      weights=weights))
    new_model.load_state_dict(state_dict)
    return new_model



# FedDropoutAvg class, implementing random client selection and model aggregation with dropout. 
class FedDropoutAvg():   
    def __init__(self, workers_dataset_sizes=None, fdr=0.3, cdr=0.2):
        
        self.workers_dataset_sizes = workers_dataset_sizes
        self.simple_average = (workers_dataset_sizes==None) # Simple unweighted average
        self.fdr = fdr # federated dropout rate
        self.cdr = cdr
        print('* fed_drop_avg init *')
        print("workers_dataset_sizes : {}".format(workers_dataset_sizes))
        print()
        
    def get_fed_avg_weights(self, selected_worker_ids):
        
        size_list = [self.workers_dataset_sizes[id] for id in selected_worker_ids]
        n_clients = len(selected_worker_ids)
        total_data_points = np.asarray(size_list).sum()
        
        if(self.simple_average): 
            fed_avg_weights = [1 /n_clients for r in range(n_clients)]
        else: # Weighted according to number of samples
            fed_avg_weights = [size_list[r] / total_data_points for r in range(n_clients)]


        print('* get_fed_avg_weights *')
        print("FedAvg Weights: {}".format(fed_avg_weights))
        
        return fed_avg_weights  
        
    def aggregate(self, models, selected_worker_ids): # Updates model using state_dicts

        print("FedDropoutAvg aggragation step. # of models to aggregate = ", len(models))
        new_model = models[0]
        state_dicts = [model.state_dict() for model in models]

        dr_rate = self.fdr
        new_state_dict = {}

        fed_avg_weights = self.get_fed_avg_weights(selected_worker_ids) # contribution weights 

        keys = state_dicts[0].keys()

        for key in keys:
            curr_shape = state_dicts[0][key].shape
            selection_shape = np.asarray(list(curr_shape) + [len(state_dicts)])
            selection_arr = (np.random.random(selection_shape) >= dr_rate).astype(int) 
            #print('selection_arr : ', selection_arr.shape)
            #print(fed_avg_weights)

            curr_sum = np.asarray([fed_avg_weights[i] * selection_arr[...,i] for i in range(len(state_dicts))])
            curr_sum = sum(curr_sum)

            for r in range(len(state_dicts)):

                # Recalculating the contribution weights for each parameter of each model after parameter dropout
                curr_weights = (selection_arr[...,r] * fed_avg_weights[r] / curr_sum)

                curr_weights = np.asarray(curr_weights) # for some cases (i.e., with bn layers) where 'curr_weights' becomes a 'numpy.float64' object
                curr_weights[np.isnan(curr_weights)] = 0 # for rare cases where curr_sum was 0
                
                if(key not in new_state_dict.keys()):     
                    new_state_dict[key] = copy.deepcopy(state_dicts[r][key]) * curr_weights
                else:
                    new_state_dict[key] += copy.deepcopy(state_dicts[r][key]) * curr_weights

        # Load new model weights
        new_model.load_state_dict(new_state_dict)
        return new_model


    def select_random_clients(self, worker_ids):

        # Random worker (collaborator) selection for the round 
        # Uses random choice, so always same number of clients each round

        num_selected = int((1-self.cdr)*len(worker_ids))
        if(num_selected == 0):
            print("ERR: num_selected == 0")
            return None
        selected_workers_this_round = np.concatenate([np.ones(num_selected, dtype=bool), np.zeros(len(worker_ids) - num_selected, dtype=bool)])
        np.random.shuffle(selected_workers_this_round)

        selected_worker_ids_this_round = []
        for ind in range(len(worker_ids)):
            if(selected_workers_this_round[ind]):
                selected_worker_ids_this_round += [worker_ids[ind]]

        print()
        print('client_dropout_rate = ', self.cdr)
        print('selected_workers_this_round = ', selected_workers_this_round)
        print('selected_worker_ids_this_round = ', selected_worker_ids_this_round)
        print()

        return selected_worker_ids_this_round

        # # # # choice updated

Next we import the `FLSpec`, `LocalRuntime`, and placement decorators.

- `FLSpec` – Defines the flow specification. User defined flows are subclasses of this.
- `Runtime` – Defines where the flow runs, infrastructure for task transitions (how information gets sent). The `LocalRuntime` runs the flow on a single node.
- `aggregator/collaborator` - placement decorators that define where the task will be assigned

In [4]:
from openfl.experimental.interface import FLSpec, Aggregator, Collaborator
from openfl.experimental.runtime import LocalRuntime
from openfl.experimental.placement import aggregator, collaborator

* Now we come to the flow definition. The OpenFL Workflow Interface adopts the conventions set by Metaflow, that every workflow begins with `start` and concludes with the `end` task. The aggregator begins with an optionally passed in model and optimizer. The aggregator begins the flow with the `start` task, where the list of collaborators is extracted and is then used as the list of participants to run the task listed in `self.next`, `aggregated_model_validation`. The model, optimizer, and anything that is not explicitly excluded from the next function will be passed from the `start` function on the aggregator to the `aggregated_model_validation` task on the collaborator. Where the tasks run is determined by the placement decorator that precedes each task definition (`@aggregator` or `@collaborator`). Once each of the collaborators (defined in the runtime) complete the `aggregated_model_validation` task, they pass their current state onto the `train` task, from `train` to `local_model_validation`, and then finally to `join` at the aggregator. It is in `join` that an average is taken of the model weights, and the next round can begin.

* In __`FederatedDropoutAvgFlow`__  we define here, 
    * At the `start` task (at the the start of the flow) and at the `join` task (at the end of each round), some random collaborators are selected for the next round, from the `self.runtime.collaborators` using `FedDropoutAvg.select_random_clients` method. So, not every collaborator is participating at training.
    * At the `join` task, model aggregation is done using `FedDropoutAvg.aggregate` method


In [5]:
class FederatedDropoutAvgFlow(FLSpec):

    def __init__(self, model = None, optimizer = None, rounds=3, fdr=0.3, cdr=0.2, train_set_sizes=None, **kwargs):
        super().__init__(**kwargs)
        if model is not None:
            self.model = model
            self.optimizer = optimizer
        else:
            self.model = ResNet18(norm = 'gn') 
            self.optimizer = optim.SGD(self.model.parameters(), lr=learning_rate,
                                   momentum=momentum)
        
        self.rounds = rounds
        
        #FedDropoutAvg 
        self.FDRaggregator = FedDropoutAvg(workers_dataset_sizes=train_set_sizes, fdr=fdr, cdr=cdr)

    @aggregator
    def start(self):
        print(f'Performing initialization for model')

        # FedDropoutAvg random collaborator selection for the first round
        self.collaborators = self.FDRaggregator.select_random_clients(self.runtime.collaborators) 

        self.private = 10
        self.current_round = 0
        self.next(self.aggregated_model_validation,foreach='collaborators',exclude=['private']) #

    @collaborator
    def aggregated_model_validation(self):
        print(f'Performing aggregated model validation for collaborator {self.input}')
        self.agg_validation_score = inference(self.model,self.test_loader)
        print(f'{self.input} value of {self.agg_validation_score}')
        self.next(self.train)

    @collaborator
    def train(self):
        self.model.train()
        self.optimizer = optim.SGD(self.model.parameters(), lr=learning_rate,
                                   momentum=momentum)
        train_losses = []
        for batch_idx, (data, target) in enumerate(self.train_loader):
          self.optimizer.zero_grad()
          output = self.model(data)
          loss = F.nll_loss(output, target)
          loss.backward()
          self.optimizer.step()
          if batch_idx % log_interval == 0:
            print('Train Epoch: 1 [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
               batch_idx * len(data), len(self.train_loader.dataset),
              100. * batch_idx / len(self.train_loader), loss.item()))
            self.loss = loss.item()
            torch.save(self.model.state_dict(), 'model.pth')
            torch.save(self.optimizer.state_dict(), 'optimizer.pth')
        self.training_completed = True
        self.next(self.local_model_validation)

    @collaborator
    def local_model_validation(self):
        self.local_validation_score = inference(self.model,self.test_loader)
        print(f'Doing local model validation for collaborator {self.input}: {self.local_validation_score}')
        self.next(self.join, exclude=['training_completed'])

    @aggregator
    def join(self,inputs):
        self.average_loss = sum(input.loss for input in inputs)/len(inputs)
        self.aggregated_model_accuracy = sum(input.agg_validation_score for input in inputs)/len(inputs)
        self.local_model_accuracy = sum(input.local_validation_score for input in inputs)/len(inputs)
        
        print(f'\n* Ending round = {self.current_round}')
        print(f'Average aggregated model validation values = {self.aggregated_model_accuracy}')
        print(f'Average training loss = {self.average_loss}')
        print(f'Average local model validation values = {self.local_model_accuracy}')
        
        models = [input.model for input in inputs]
        
        #self.model = FedAvg(models)
        self.model = self.FDRaggregator.aggregate(models, self.collaborators)

        self.optimizer = [input.optimizer for input in inputs][0]
        self.current_round += 1

        if self.current_round < self.rounds: 
            # FedDropoutAvg random ccollaborator selection for the next round
            self.collaborators = self.FDRaggregator.select_random_clients(self.runtime.collaborators) 
            self.next(self.aggregated_model_validation, foreach='collaborators', exclude=['private'])
        else:
            self.next(self.end)
        
    @aggregator
    def end(self):
        print(f'This is the end of the flow')  

Aggregator step "start" registered
Collaborator step "aggregated_model_validation" registered
Collaborator step "train" registered
Collaborator step "local_model_validation" registered
Aggregator step "join" registered
Aggregator step "end" registered


Below, we segment shards of the CIFAR10 dataset for **ten collaborators**. Each has their own slice of the dataset that's accessible via the `train_loader` or `test_loader` attribute.

In [6]:
# Setup participants
aggregator = Aggregator()
aggregator.private_attributes = {}

# Setup collaborators with private attributes
collaborator_names = ['Portland', 'Seattle', 'Tokyo', 'New York', 'Mumbai', 'Budapest', 'Vienna', 'London', 'York', 'Istanbul'] 

collaborators = [Collaborator(name=name) for name in collaborator_names]
train_set_sizes = {} 

for idx, collaborator in enumerate(collaborators):
    local_train = copy.deepcopy(cifar10_train)
    local_test = copy.deepcopy(cifar10_test)

    local_train.data = cifar10_train.data[idx::len(collaborators)]
    local_train.targets = cifar10_train.targets[idx::len(collaborators)]
    train_set_sizes[collaborator_names[idx]] = len(local_train.data)

    local_test.data = cifar10_test.data[idx::len(collaborators)]
    local_test.targets = cifar10_test.targets[idx::len(collaborators)]
    collaborator.private_attributes = {
            'train_loader': torch.utils.data.DataLoader(local_train,batch_size=batch_size_train, shuffle=True),
            'test_loader': torch.utils.data.DataLoader(local_test,batch_size=batch_size_train, shuffle=True)
    }

local_runtime = LocalRuntime(aggregator=aggregator, collaborators=collaborators, backend='single_process')
print(f'Local runtime collaborators = {local_runtime.collaborators}')
print(f'train_set_sizes = {train_set_sizes}')

Local runtime collaborators = ['Portland', 'Seattle', 'Tokyo', 'New York', 'Mumbai', 'Budapest', 'Vienna', 'London', 'York', 'Istanbul']
train_set_sizes = {'Portland': 5000, 'Seattle': 5000, 'Tokyo': 5000, 'New York': 5000, 'Mumbai': 5000, 'Budapest': 5000, 'Vienna': 5000, 'London': 5000, 'York': 5000, 'Istanbul': 5000}


Now that we have our flow and runtime defined, let's run the experiment! 

In [7]:
model = None
best_model = None
optimizer = None

flflow = FederatedDropoutAvgFlow(model,optimizer,rounds=n_rounds,fdr=fdr,cdr=cdr,train_set_sizes=train_set_sizes,checkpoint=True)

flflow.runtime = local_runtime
flflow.run()

* fed_drop_avg init *
workers_dataset_sizes : {'Portland': 5000, 'Seattle': 5000, 'Tokyo': 5000, 'New York': 5000, 'Mumbai': 5000, 'Budapest': 5000, 'Vienna': 5000, 'London': 5000, 'York': 5000, 'Istanbul': 5000}





Created flow FederatedDropoutAvgFlow

Calling start
Performing initialization for model

client_dropout_rate =  0.2
selected_workers_this_round =  [ True  True  True  True  True False  True  True  True False]
selected_worker_ids_this_round =  ['Portland', 'Seattle', 'Tokyo', 'New York', 'Mumbai', 'Vienna', 'London', 'York']

Saving data artifacts for start
Saved data artifacts for start
Sending state from aggregator to collaborators

Calling aggregated_model_validation
Performing aggregated model validation for collaborator Portland


  return F.log_softmax(x) #x



Test set: Avg. loss: 2.5109, Accuracy: 95/1000 (10%)

Portland value of 0.0949999988079071
Saving data artifacts for aggregated_model_validation
Saved data artifacts for aggregated_model_validation

Calling train
Saving data artifacts for train
Saved data artifacts for train

Calling local_model_validation

Test set: Avg. loss: 2.1073, Accuracy: 205/1000 (20%)

Doing local model validation for collaborator Portland: 0.20499999821186066
Saving data artifacts for local_model_validation
Saved data artifacts for local_model_validation
Should transfer from local_model_validation to join

Calling aggregated_model_validation
Performing aggregated model validation for collaborator Seattle

Test set: Avg. loss: 2.5109, Accuracy: 94/1000 (9%)

Seattle value of 0.09399999678134918
Saving data artifacts for aggregated_model_validation
Saved data artifacts for aggregated_model_validation

Calling train
Saving data artifacts for train
Saved data artifacts for train

Calling local_model_validation



  curr_weights = (selection_arr[...,r] * fed_avg_weights[r] / curr_sum)



client_dropout_rate =  0.2
selected_workers_this_round =  [False  True  True  True False  True  True  True  True  True]
selected_worker_ids_this_round =  ['Seattle', 'Tokyo', 'New York', 'Budapest', 'Vienna', 'London', 'York', 'Istanbul']

Saving data artifacts for join
Saved data artifacts for join
Sending state from aggregator to collaborators

Calling aggregated_model_validation
Performing aggregated model validation for collaborator Seattle

Test set: Avg. loss: 2.0144, Accuracy: 279/1000 (28%)

Seattle value of 0.27900001406669617
Saving data artifacts for aggregated_model_validation
Saved data artifacts for aggregated_model_validation

Calling train
Saving data artifacts for train
Saved data artifacts for train

Calling local_model_validation

Test set: Avg. loss: 2.0156, Accuracy: 248/1000 (25%)

Doing local model validation for collaborator Seattle: 0.24799999594688416
Saving data artifacts for local_model_validation
Saved data artifacts for local_model_validation
Should trans

In [8]:
# the collaborators from the last round:
flflow.collaborators

['Seattle',
 'Tokyo',
 'New York',
 'Budapest',
 'Vienna',
 'London',
 'York',
 'Istanbul']

In [9]:
# All collaborators available in runtime:
flflow.runtime.collaborators

['Portland',
 'Seattle',
 'Tokyo',
 'New York',
 'Mumbai',
 'Budapest',
 'Vienna',
 'London',
 'York',
 'Istanbul']

Now that the flow has completed, we can get the final model, and all other aggregator attributes after the flow completes.

Let's get the final model and accuracy:

In [10]:
print(f'\nFinal aggregated model accuracy for {flflow.rounds} rounds of training: {flflow.aggregated_model_accuracy}')


Final aggregated model accuracy for 2 rounds of training: 0.27787499874830246


## This is the end of the FedDropoutAvg tutorial. 

## Feel free to change the _`cdr` (client dropout rate)_ and _`fdr` (federated dropout rate)_ parameters of the algorithm, and/or try it on different datasets.