Federated Learning with Heterogeneous Clients and Trusted FedAvg using PyTorch and PySyft on DNS traffic datasets.

Trusted FedAvg paper: https://arxiv.org/pdf/2104.07853.pdf

Heterogeneous clients paper: https://arxiv.org/pdf/2010.01264.pdf

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Add libraries and load data

In [None]:
!pip install syft==0.2.9

In [None]:
import copy
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as sched
from torch.nn import BCELoss
import torch.utils.data as tud
from statistics import median, mean
import syft as sy

# hook PyTorch to PySyft, i.e. add extra functionalities to support Federated Learning and other private AI tools
hook = sy.TorchHook(torch)

In [None]:
# define number of clients
num_of_clients = 7

In [None]:
# load clients' datasets and test sets files
clients_datasets = []
booters_tests = []

for i in range(num_of_clients):
    clients_datasets.append(pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Diploma thesis preprocessed datasets/client' + str(i+1) + '.csv').astype('float32'))

for i in range(7):
    booters_tests.append(pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Diploma thesis preprocessed datasets/booter_test' + str(i+1) + '.csv').astype('float32'))

general_benign_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Diploma thesis preprocessed datasets/general_benign_test.csv').astype('float32')

## Making and normalization of test sets

Make test sets to compare the performance of each client on each attack and on general benign traffic with and without federated learning.

In [None]:
nonfed_attacks = [copy.deepcopy(booters_tests) for i in range(num_of_clients)]
nonfed_benign = [copy.copy(general_benign_test) for i in range(num_of_clients)]
fed_attacks = copy.deepcopy(booters_tests)
fed_benign = copy.copy(general_benign_test)

Without-FL test sets need to have the same features as the clients' datasets they correspond to.

In [None]:
for client in range(num_of_clients):
    to_drop = []
    for feature in general_benign_test.columns:
        if feature not in clients_datasets[client].columns:
            to_drop.append(feature)
    for booter in range(7):
        nonfed_attacks[client][booter].drop(columns=to_drop, inplace=True)
    nonfed_benign[client].drop(columns=to_drop, inplace=True)

In [None]:
nonfed_attacks[0][0].head()

Unnamed: 0,ip.len,udp.length,dns.flags.recdesired,dns.flags.recavail,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,1054.0,1034.0,1.0,1.0,14.0,13.0,23.0,61960000.0,255.0,1.0
1,1054.0,1034.0,1.0,1.0,14.0,13.0,23.0,61960000.0,255.0,1.0
2,1054.0,1034.0,1.0,1.0,14.0,13.0,23.0,61960000.0,255.0,1.0
3,1500.0,2139.0,1.0,1.0,21.0,13.0,23.0,61960000.0,255.0,1.0
4,1054.0,1034.0,1.0,1.0,14.0,13.0,23.0,61960000.0,255.0,1.0


In [None]:
nonfed_benign[0].head()

Unnamed: 0,ip.len,udp.length,dns.flags.recdesired,dns.flags.recavail,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,121.0,101.0,1.0,1.0,1.0,0.0,0.0,70249224.0,12.0,0.0
1,126.0,106.0,0.0,0.0,1.0,0.0,1.0,64076216.0,1.0,0.0
2,197.0,177.0,1.0,1.0,4.0,0.0,0.0,48069392.0,1.0,0.0
3,581.0,561.0,0.0,0.0,0.0,4.0,1.0,65501532.0,43.0,0.0
4,124.0,104.0,1.0,1.0,1.0,0.0,0.0,84075200.0,12.0,0.0


All test and train sets need to be normalized. Train sets will be normalized privately by each client. The same scalers used for the train sets, will also normalize all without-FL test sets. However, with-FL test sets need to be normalized using an aggregated scaler from all clients, for which to be calculated clients need to send their scalers to the central entity of the federated learning. Of course, for the aggregated scaler, every client only contributes to the features that has chosen after feature selection step.

In [None]:
scaling_results = []
for client in range(num_of_clients):
    scaler = MinMaxScaler()
    clients_datasets[client].iloc[:, :-1] = scaler.fit_transform(clients_datasets[client].iloc[:, :-1])
    # use trainset-fitted scaler for without-FL test sets
    for booter in range(7):
        nonfed_attacks[client][booter].iloc[:, :-1] = scaler.transform(nonfed_attacks[client][booter].iloc[:, :-1])
    nonfed_benign[client].iloc[:, :-1] = scaler.transform(nonfed_benign[client].iloc[:, :-1])
    # each client sends a dictionary with keys being features names
    # and values being tuples of min and max of features 
    scaling_results.append(dict(zip(clients_datasets[client].columns[:-1], zip(scaler.data_min_, scaler.data_max_))))

In [None]:
clients_datasets[0].head()

Unnamed: 0,ip.len,udp.length,dns.flags.recdesired,dns.flags.recavail,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,0.353265,0.180414,0.0,0.0,0.0,0.210526,0.025,0.692536,0.165354,0.0
1,0.208935,0.106704,0.0,0.0,0.0,0.210526,0.025,0.491246,0.043307,0.0
2,0.58488,0.298701,1.0,1.0,0.5,0.684211,0.575,0.619595,1.0,1.0
3,0.58488,0.298701,1.0,1.0,0.5,0.684211,0.575,0.619595,1.0,1.0
4,0.54433,0.277992,1.0,1.0,0.538462,0.684211,0.45,0.619595,1.0,1.0


In [None]:
nonfed_attacks[0][0].head()

Unnamed: 0,ip.len,udp.length,dns.flags.recdesired,dns.flags.recavail,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,0.693471,0.354159,1.0,1.0,0.538462,0.684211,0.575,0.619595,1.0,1.0
1,0.693471,0.354159,1.0,1.0,0.538462,0.684211,0.575,0.619595,1.0,1.0
2,0.693471,0.354159,1.0,1.0,0.538462,0.684211,0.575,0.619595,1.0,1.0
3,1.0,0.742015,1.0,1.0,0.807692,0.684211,0.575,0.619595,1.0,1.0
4,0.693471,0.354159,1.0,1.0,0.538462,0.684211,0.575,0.619595,1.0,1.0


In [None]:
nonfed_benign[0].head()

Unnamed: 0,ip.len,udp.length,dns.flags.recdesired,dns.flags.recavail,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,0.052234,0.026676,1.0,1.0,0.038462,0.0,0.0,0.70249,0.043307,0.0
1,0.05567,0.028431,0.0,0.0,0.038462,0.0,0.025,0.640758,0.0,0.0
2,0.104467,0.053352,1.0,1.0,0.153846,0.0,0.0,0.480682,0.0,0.0
3,0.368385,0.188136,0.0,0.0,0.0,0.210526,0.025,0.655011,0.165354,0.0
4,0.054296,0.027729,1.0,1.0,0.038462,0.0,0.0,0.840756,0.043307,0.0


In [None]:
scaling_results[0]

{'dns.count.add_rr': (0.0, 40.0),
 'dns.count.answers': (0.0, 26.0),
 'dns.count.auth_rr': (0.0, 19.0),
 'dns.flags.recavail': (0.0, 1.0),
 'dns.flags.recdesired': (0.0, 1.0),
 'dns.qry.name': (3232.0, 99998890.0),
 'dns.qry.type': (1.0, 255.0),
 'ip.len': (45.0, 1500.0),
 'udp.length': (25.0, 2874.0)}

For each feature, aggregated scaler will be calculated using the minimum and maximum values of feature that appear in the union of all clients' datasets.

In [None]:
# initialize scaler and useful dictionaries
scaler = MinMaxScaler()
scaler.fit(general_benign_test.iloc[:, :-1])
scaler.min_ = []
scaler.scale_ = []
mins = {i:float('Inf') for i in general_benign_test.columns[:-1]}
maxes = {i:-float('Inf') for i in general_benign_test.columns[:-1]}

# compute appropriate min and max values of features
for client in range(num_of_clients):
    for feature, t in scaling_results[client].items():
        if t[0] < mins[feature]:
            mins[feature] = t[0]
        if t[1] > maxes[feature]:
            maxes[feature] = t[1]

# if mins[i] = Inf, feature i is dropped at all clients
# assign mins[i] = 0 and maxes[i] = 1 which lead to not scaling feature i
for feature in mins:
    if mins[feature] == float('Inf'):
        mins[feature] = 0
        maxes[feature] = 1

# pass values to the scaler
for feature in general_benign_test.columns[:-1]:
    scaler.min_.append(-mins[feature]/(maxes[feature]-mins[feature]))
    scaler.scale_.append(1/(maxes[feature]-mins[feature]))

In [None]:
print(scaler.min_)
print(scaler.scale_)

[-0.030927835, -0.00464684, 0.0, 0.0, -0.0, 0.0, -0.0, -0.0, 0.0, -0.0, 0.0, -0.0, -0.0, -0.0, -2.1030677e-05, -0.003937008]
[0.0006872852233676976, 0.0001858736059479554, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.4091741351064854e-05, 0.05263157894736842, 0.025, 1.0000321610342989e-08, 0.003937007874015748]


Normalize with-FL test sets using above calculated aggregated scaler.

In [None]:
for booter in range(7):
    fed_attacks[booter].iloc[:, :-1] = scaler.transform(fed_attacks[booter].iloc[:, :-1])
fed_benign.iloc[:, :-1] = scaler.transform(fed_benign.iloc[:, :-1])

In [None]:
fed_attacks[0].head()

Unnamed: 0,ip.len,udp.length,dns.flags.response,dns.flags.opcode,dns.flags.authoritative,dns.flags.truncated,dns.flags.recdesired,dns.flags.recavail,dns.flags.authenticated,dns.flags.checkdisable,dns.count.queries,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,0.693471,0.187546,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.000337,0.684211,0.575,0.619599,1.0,1.0
1,0.693471,0.187546,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.000337,0.684211,0.575,0.619599,1.0,1.0
2,0.693471,0.187546,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.000337,0.684211,0.575,0.619599,1.0,1.0
3,1.0,0.392937,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.000506,0.684211,0.575,0.619599,1.0,1.0
4,0.693471,0.187546,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.000337,0.684211,0.575,0.619599,1.0,1.0


In [None]:
fed_benign.head()

Unnamed: 0,ip.len,udp.length,dns.flags.response,dns.flags.opcode,dns.flags.authoritative,dns.flags.truncated,dns.flags.recdesired,dns.flags.recavail,dns.flags.authenticated,dns.flags.checkdisable,dns.count.queries,dns.count.answers,dns.count.auth_rr,dns.count.add_rr,dns.qry.name,dns.qry.type,target
0,0.052234,0.014126,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,2.4e-05,0.0,0.0,0.702494,0.043307,0.0
1,0.05567,0.015056,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.4e-05,0.0,0.025,0.640762,0.0,0.0
2,0.104467,0.028253,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,9.6e-05,0.0,0.0,0.480688,0.0,0.0
3,0.368385,0.099628,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.210526,0.025,0.655015,0.165354,0.0
4,0.054296,0.014684,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,2.4e-05,0.0,0.0,0.840758,0.043307,0.0


**Note:** Some features of the with-FL test sets are indeed not scaled, because all clients happened to have dropped these features. The code is general, taking into account that there could be a client who would keep a feature that others have dropped. This will also be taken into account in the learning process, where the central model's weights will all be initialized to zero, so that unused features do not affect the testing performed by the central entity.

Split with-FL test sets to validation and testing sets.

In [None]:
fed_attacks_val = [fed_attacks[b].iloc[:len(fed_attacks[b])//2, :] for b in range(7)]
fed_attacks_test = [fed_attacks[b].iloc[len(fed_attacks[b])//2:, :] for b in range(7)]

fed_benign_val = fed_benign.iloc[:len(fed_benign)//2, :]
fed_benign_test = fed_benign.iloc[len(fed_benign)//2:, :]

## Define training parameters and models, transform datasets to tensors, send data to clients, create dataloaders

In [None]:
# create clients
clients = [sy.VirtualWorker(hook, id='client'+str(i+1)) for i in range(num_of_clients)]

In [None]:
# define the args
args = {
    'use_cuda' : True,
    'batch_size' : 128,
    'test_batch_size' : 1000,
    'lr' : 0.01,
    'log_interval' : 500,
    'epochs' : 7
}

# check to use GPU or not
use_cuda = args['use_cuda'] and torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

In [None]:
# create a simple feedforward network
# n features as input, 2*n+1 hidden layer neurons, 1 output for binary classification
class MLP(nn.Module):
    
    def __init__(self, n):
        super(MLP, self).__init__()
        self.n = n
        
        self.layers = nn.Sequential(
            nn.Linear(in_features=n, out_features=2*n+1),
            nn.ReLU(),
            nn.Linear(in_features=2*n+1, out_features=1),
            nn.Sigmoid()
        )
            
    def forward(self, x):
        return self.layers(x)

In [None]:
# transform to tensors
train_features = [torch.tensor(cd.iloc[:, :-1].to_numpy()) for cd in clients_datasets]
train_target = [torch.tensor(cd['target'].to_numpy()) for cd in clients_datasets]

nonfed_attacks_features = [[torch.tensor(nonfed_attacks[c][b].iloc[:, :-1].to_numpy()) for b in range(7)] for c in range(num_of_clients)]
nonfed_attacks_target = [[torch.tensor(nonfed_attacks[c][b]['target'].to_numpy()) for b in range(7)] for c in range(num_of_clients)]

nonfed_benign_features = [torch.tensor(nonfed_benign[c].iloc[:, :-1].to_numpy()) for c in range(num_of_clients)]
nonfed_benign_target = [torch.tensor(nonfed_benign[c]['target'].to_numpy()) for c in range(num_of_clients)]

fed_attacks_val_features = [torch.tensor(fed_attacks_val[b].iloc[:, :-1].to_numpy()) for b in range(7)]
fed_attacks_val_target = [torch.tensor(fed_attacks_val[b]['target'].to_numpy()) for b in range(7)]

fed_attacks_test_features = [torch.tensor(fed_attacks_test[b].iloc[:, :-1].to_numpy()) for b in range(7)]
fed_attacks_test_target = [torch.tensor(fed_attacks_test[b]['target'].to_numpy()) for b in range(7)]

fed_benign_val_features = torch.tensor(fed_benign_val.iloc[:, :-1].to_numpy())
fed_benign_val_target = torch.tensor(fed_benign_val['target'].to_numpy())

fed_benign_test_features = torch.tensor(fed_benign_test.iloc[:, :-1].to_numpy())
fed_benign_test_target = torch.tensor(fed_benign_test['target'].to_numpy())

In [None]:
# distribute data across workers
# normally there is no need to distribute data, since it is already at the clients
# this is more of a simulation of federated learning
train_datasets = [sy.BaseDataset(train_features[i].send(c), train_target[i].send(c)) for i, c in enumerate(clients)]
federated_dataset = sy.FederatedDataset(train_datasets)
federated_train_loader = sy.FederatedDataLoader(federated_dataset, batch_size=args['batch_size'], shuffle=True)

# test data remains at the central entity
nonfed_attacks_datasets = [[tud.TensorDataset(nonfed_attacks_features[c][b], nonfed_attacks_target[c][b]) for b in range(7)] for c in range(num_of_clients)]
nonfed_attacks_loaders = [[tud.DataLoader(nonfed_attacks_datasets[c][b], batch_size=args['test_batch_size'], shuffle=True) for b in range(7)] for c in range(num_of_clients)]

nonfed_benign_datasets = [tud.TensorDataset(nonfed_benign_features[c], nonfed_benign_target[c]) for c in range(num_of_clients)]
nonfed_benign_loaders = [tud.DataLoader(nonfed_benign_datasets[c], batch_size=args['test_batch_size'], shuffle=True) for c in range(num_of_clients)]

fed_attacks_val_datasets = [tud.TensorDataset(fed_attacks_val_features[b], fed_attacks_val_target[b]) for b in range(7)]
fed_attacks_val_loaders = [tud.DataLoader(fed_attacks_val_datasets[b], batch_size=args['test_batch_size'], shuffle=True) for b in range(7)]

fed_attacks_test_datasets = [tud.TensorDataset(fed_attacks_test_features[b], fed_attacks_test_target[b]) for b in range(7)]
fed_attacks_test_loaders = [tud.DataLoader(fed_attacks_test_datasets[b], batch_size=args['test_batch_size'], shuffle=True) for b in range(7)]

fed_benign_val_dataset = tud.TensorDataset(fed_benign_val_features, fed_benign_val_target)
fed_benign_val_loader = tud.DataLoader(fed_benign_val_dataset, batch_size=args['test_batch_size'], shuffle=True)

fed_benign_test_dataset = tud.TensorDataset(fed_benign_test_features, fed_benign_test_target)
fed_benign_test_loader = tud.DataLoader(fed_benign_test_dataset, batch_size=args['test_batch_size'], shuffle=True)

## Train, test, aggregation, trust computation functions

In [None]:
# classic torch code for training, except for the federated part
def train(args, models, device, train_loader, optimizers, epoch, view_log=False):
    for c, m in models.items():
        m.train()
        # send models to workers
        m.send(c)

    # iterate over federated data client by client
    # of course, in reality all clients would train their models at the same time
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        optimizers[data.location].zero_grad()
        output = models[data.location](data)

        # loss is a ptr to the tensor loss at the remote location
        loss = BCELoss()(output, target.view_as(output))
        # call backward() on the loss ptr, that will send the command to call
        # backward on the actual loss tensor present on the remote machine
        loss.backward()
        optimizers[data.location].step()

        if view_log and batch_idx % args['log_interval'] == 0:
            # get back loss, that was created at remote worker
            loss = loss.get()
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tWorker: {}'.format(
                    epoch, 
                    batch_idx * args['batch_size'], # number of packets done
                    len(train_loader) * args['batch_size'], # total packets
                    100. * batch_idx / len(train_loader), # percentage of batches done
                    loss,
                    data.location.id
                )
            )

    # get back models for aggregation
    for m in models.values():
        m = m.get()

In [None]:
# classic torch code for testing
def test(model, device, test_loader, testType='Validation'):
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)

            # add losses together
            test_loss += BCELoss(reduction='sum')(output, target.view_as(output)).item()

            # favour class 0
            output = torch.max(output-0.2, torch.zeros(size=output.shape).to(device))
            
            # get the index of the max probability class and adjust correctly classified samples
            pred = torch.round(output)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    accuracy = 100. * correct / len(test_loader.dataset)
    print(testType + ' set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(
        test_loss, correct, len(test_loader.dataset),
        accuracy))
    
    return accuracy

In [None]:
def aggregate(central_model, models, weights, trust, num_of_clients_in_weights):
    with torch.no_grad():
        # dataXtrust values needed for normalization later
        dataXtrust_hidden_weight = torch.zeros(size=central_model.layers[0].weight.shape).to(device)
        dataXtrust_hidden_bias = torch.zeros(size=central_model.layers[0].bias.shape).to(device)
        dataXtrust_output_weight = torch.zeros(size=central_model.layers[2].weight.shape).to(device)
        dataXtrust_output_bias = 0
        # firstly compute new aggregated weight values
        # to do so we start by taking the sum of the weights of all clients
        for i, c in enumerate(clients):
            # each client only contributes to chosen features (i.e. columns of weights arrays)
            # for each of these features (columns), the aggregation uses the first x elements (rows) of central model weights
            # where x is the number of hidden layer neurons of client and is equal to 2*(number_of_features_of_client)+1
            rows = 2*models[c].n+1
            for j, feature in enumerate(clients_datasets[i].columns[:-1]):
                # find the index of feature in the central_model
                index = general_benign_test.columns[:-1].get_loc(feature)
                weights['hidden_mean_weight'][:rows, index] += models[c].layers[0].weight.data[:, j].clone()*len(clients_datasets[i])*trust[c]
                dataXtrust_hidden_weight[:rows, index] += len(clients_datasets[i])*trust[c]
            # the rest of the weights don't have to be calculated feature-wise
            weights['hidden_mean_bias'][:rows] += models[c].layers[0].bias.data.clone()*len(clients_datasets[i])*trust[c]
            dataXtrust_hidden_bias[:rows] += len(clients_datasets[i])*trust[c]
            weights['output_mean_weight'][0, :rows] += models[c].layers[2].weight.data[0, :].clone()*len(clients_datasets[i])*trust[c]
            dataXtrust_output_weight[0, :rows] += len(clients_datasets[i])*trust[c]
            weights['output_mean_bias'] += models[c].layers[2].bias.data.clone()*len(clients_datasets[i])*trust[c]
            dataXtrust_output_bias += len(clients_datasets[i])*trust[c]

        # diminish influence of rare weights (i.e. weights of features that few clients have)
        dataXtrust_hidden_weight[num_of_clients_in_weights['hidden'] == 3] *= 2
        dataXtrust_hidden_weight[num_of_clients_in_weights['hidden'] == 2] *= 3
        dataXtrust_hidden_weight[num_of_clients_in_weights['hidden'] == 1] *= 4
        dataXtrust_output_weight[num_of_clients_in_weights['output'].reshape(1, 1, central_model.layers[2].weight.shape[1]) == 3] *= 2
        dataXtrust_output_weight[num_of_clients_in_weights['output'].reshape(1, 1, central_model.layers[2].weight.shape[1]) == 2] *= 3
        dataXtrust_output_weight[num_of_clients_in_weights['output'].reshape(1, 1, central_model.layers[2].weight.shape[1]) == 1] *= 4

        # change zero dataXtrust values to ones
        dataXtrust_hidden_weight[dataXtrust_hidden_weight == 0] = 1
        dataXtrust_hidden_bias[dataXtrust_hidden_bias == 0] = 1
        dataXtrust_output_weight[(dataXtrust_output_weight == 0).reshape(1, 1, central_model.layers[2].weight.shape[1])] = 1

        # and then we normalize the sum taking into account number of data and trust value for each client
        # again parts of weights' arrays are normalized with respect only to clients that contributed to these parts
        weights['hidden_mean_weight'] /= dataXtrust_hidden_weight
        weights['hidden_mean_bias'] /= dataXtrust_hidden_bias
        weights['output_mean_weight'] /= dataXtrust_output_weight
        weights['output_mean_bias'] /= dataXtrust_output_bias

        # secondly copy new weight values to the local models of all clients
        for i, c in enumerate(clients):
            rows = 2*models[c].n+1
            for j, feature in enumerate(clients_datasets[i].columns[:-1]):
                index = general_benign_test.columns[:-1].get_loc(feature)
                models[c].layers[0].weight.data[:, j] = weights['hidden_mean_weight'][:rows, index].clone()
            # the rest of the weights don't have to be copied feature-wise
            models[c].layers[0].bias.data = weights['hidden_mean_bias'][:rows].clone()
            models[c].layers[2].weight.data[0, :] = weights['output_mean_weight'][0, :rows].clone()
            models[c].layers[2].bias.data = weights['output_mean_bias'].clone()

        # and finally copy to the central model for the test set
        central_model.layers[0].weight.data = weights['hidden_mean_weight'].clone()
        central_model.layers[0].bias.data = weights['hidden_mean_bias'].clone()
        central_model.layers[2].weight.data = weights['output_mean_weight'].clone()
        central_model.layers[2].bias.data = weights['output_mean_bias'].clone()

In [None]:
def computeTrust(models, trust, r, s, num_of_clients_in_weights):
    # dev[i] shows how much the weights of model of client i differ from the models of all other clients
    # it is calculated in accordance with the relevant paper, but also taking into account the heterogeneity of models 
    dev = [0 for i in clients]
    for i, c in enumerate(clients):
        for j, cc in enumerate(clients):
            # the smallest model defines the number of weights of rows (neurons) that will be compared
            rows = min(2*models[c].n+1, 2*models[cc].n+1)
            # between 2 clients, only weights of features that both have chosen are compared
            for indexi, feature in enumerate(clients_datasets[i].columns[:-1]): 
                try:
                    # find the index of the column of feature in cc client, provided that cc has chosen this feature
                    indexj = clients_datasets[j].columns[:-1].get_loc(feature)
                except:
                    # go to the next feature, if current feature not chosen by cc
                    continue
                # for hidden layer, add to dev the sum of squared differences of weights of models divided by the number of clients which have each weight
                to_divide = num_of_clients_in_weights['hidden'][:rows, general_benign_test.columns[:-1].get_loc(feature)]
                difference = models[cc].layers[0].weight.data[:rows, indexj].cpu() - models[c].layers[0].weight.data[:rows, indexi].cpu()
                dev[i] += np.sum(difference.numpy()**2 / to_divide)
            # output layer weights don't have to be compared feature-wise
            # same as above for the output layer
            difference = models[cc].layers[2].weight.data[0, :rows].cpu() - models[c].layers[2].weight.data[0, :rows].cpu()
            dev[i] += np.sum(difference.numpy()**2 / num_of_clients_in_weights['output'][0, :rows])

    # I[i] = 1 if client i acts normally and 0 if malicious or malfunctions
    I = [1 if d <= 1.5*median(sorted(dev)) else 0 for d in dev]
    print("dev: ",dev) # testing
    print("median*1.5: ", 1.5*median(sorted(dev))) # testing
    print("I: ", I) # testing
 
    # compute new r, s values for every client
    for i in range(len(clients)):
        p1 = 0.2
        #p2 = lambda x: x/median(sorted(dev)) if x/median(sorted(dev)) > 3 and x > 30 else (x/1000 if x > 1000 else (0.01 if I[i] == 1 and s[i] > 10 else 0.7))
        p2 = lambda x: 0.8
        r[i] = p1*r[i] + I[i]
        s[i] = p2(dev[i])*s[i] + 1 - I[i]

    # compute new trust value of every client
    for i, c in enumerate(clients):
        trust[c] = (r[i]+1)/(r[i]+s[i]+2)

## Results without FL

In [None]:
# clients' models, optimizers and schedulers for learning rate
models = {clients[c]:MLP(len(clients_datasets[c].columns[:-1])).to(device) for c in range(num_of_clients)}
optimizers = {c:optim.SGD(models[c].parameters(), lr=args['lr']) for c in clients}
# decreasing learning rate
lamda = lambda epoch: 1 if epoch < 1 else 0.1
schedulers = {i:sched.LambdaLR(optimizers[i], lr_lambda=lamda) for i in clients}

for epoch in range(1, args['epochs'] + 1):
    train(args, models, device, federated_train_loader, optimizers, epoch)
    for scheduler in schedulers.values():
        scheduler.step()
    for c in range(num_of_clients):
        print('Epoch ' + str(epoch) + ', Client ' + str(c+1) + ':')
        for b in range(7):
            print('\tBooter ' + str(b+1) + ': ', end='')
            test(models[clients[c]], device, nonfed_attacks_loaders[c][b])
        print('\tBenign traffic: ', end='')
        test(models[clients[c]], device, nonfed_benign_loaders[c])
    print()

Epoch 1, Client 1:
	Booter 1: Validation set: Average loss: 0.1197, Accuracy: 45510/50000 (91%)
	Booter 2: Validation set: Average loss: 0.1511, Accuracy: 41653/50000 (83%)
	Booter 3: Validation set: Average loss: 0.1158, Accuracy: 48477/50000 (97%)
	Booter 4: Validation set: Average loss: 0.0615, Accuracy: 45336/50000 (91%)
	Booter 5: Validation set: Average loss: 0.4314, Accuracy: 28/50000 (0%)
	Booter 6: Validation set: Average loss: 0.0607, Accuracy: 49466/50000 (99%)
	Booter 7: Validation set: Average loss: 0.0305, Accuracy: 49870/50000 (100%)
	Benign traffic: Validation set: Average loss: 0.5102, Accuracy: 283897/300000 (95%)
Epoch 1, Client 2:
	Booter 1: Validation set: Average loss: 0.0896, Accuracy: 50000/50000 (100%)
	Booter 2: Validation set: Average loss: 0.0971, Accuracy: 50000/50000 (100%)
	Booter 3: Validation set: Average loss: 0.0954, Accuracy: 50000/50000 (100%)
	Booter 4: Validation set: Average loss: 0.1498, Accuracy: 46506/50000 (93%)
	Booter 5: Validation set: Ave

## FL training with 7 clients

In [None]:
# central model
central_model = MLP(len(general_benign_test.columns[:-1])).to(device)
# initialize weights of central model to zero,
# so that features which are dropped by all clients do not affect testing
central_model.layers[0].weight.data.fill_(0)
central_model.layers[0].bias.data.fill_(0)
central_model.layers[2].weight.data.fill_(0)
central_model.layers[2].bias.data.fill_(0)

# clients' models, optimizers and schedulers for learning rate
# note that central entity knows the chosen features of each client from the preprocessing procedure
models = {clients[c]:MLP(len(clients_datasets[c].columns[:-1])).to(device) for c in range(num_of_clients)}
optimizers = {c:optim.SGD(models[c].parameters(), lr=args['lr']) for c in clients}
# some clients may work better with another learning rate value
optimizers[clients[5]] = optim.SGD(models[clients[5]].parameters(), lr=0.5*args['lr'])
optimizers[clients[6]] = optim.SGD(models[clients[6]].parameters(), lr=0.5*args['lr'])
# decreasing learning rate
lamda = lambda epoch: 1 if epoch < 1 else 0.1
schedulers = {i:sched.LambdaLR(optimizers[i], lr_lambda=lamda) for i in clients}

# initialization of dictionary for models aggregation
weights = {'hidden_mean_weight' : torch.zeros(size=central_model.layers[0].weight.shape).to(device),
           'hidden_mean_bias' : torch.zeros(size=central_model.layers[0].bias.shape).to(device),
           'output_mean_weight' : torch.zeros(size=central_model.layers[2].weight.shape).to(device),
           'output_mean_bias' : torch.zeros(size=central_model.layers[2].bias.shape).to(device)}

# trust values
trust = {i:0 for i in clients}
r = [0 for i in clients]
s = [0 for i in clients]

# for each weight of central_model, count the number of clients which contain this weight in their models
# needed to compute the trust value of each client
num_of_clients_in_weights = {'hidden' : np.zeros(central_model.layers[0].weight.shape),
                             'output' : np.zeros(central_model.layers[2].weight.shape)}
for i, c in enumerate(clients):
    rows = 2*models[c].n+1
    num_of_clients_in_weights['output'][0, :rows] += 1
    for j, feature in enumerate(clients_datasets[i].columns[:-1]):
        index = general_benign_test.columns[:-1].get_loc(feature)
        num_of_clients_in_weights['hidden'][:rows, index] += 1

# choose best model of all epochs (initialization)
best_model = copy.deepcopy(central_model)
best_accuracies = [0 for i in range(8)]

for epoch in range(1, args['epochs'] + 1):
    train(args, models, device, federated_train_loader, optimizers, epoch, view_log=True)
    for scheduler in schedulers.values():
        scheduler.step()

    # we shift the weights of a client to simulate untrustful behavior
    models[clients[4]].layers[0].weight.data *= 2
    models[clients[4]].layers[2].weight.data *= 2

    computeTrust(models, trust, r, s, num_of_clients_in_weights)
    aggregate(central_model, models, weights, trust, num_of_clients_in_weights)

    print()
    accuracies = []
    for b in range(7):
        print('Booter ' + str(b+1) + ': ', end='')
        accuracies.append(test(central_model, device, fed_attacks_val_loaders[b]))
    print('Benign traffic: ', end='')
    accuracies.append(test(central_model, device, fed_benign_val_loader))
    print()

    # update best model
    if min(accuracies[:4] + accuracies[5:]) > 75:
        mean_attacks = mean(accuracies[:4] + accuracies[5:])
        best_mean_attacks = mean(best_accuracies[:4] + best_accuracies[5:])
        # false positives are more harmful than false negatives
        if mean_attacks + 1.2*accuracies[7] > best_mean_attacks + 1.2*best_accuracies[7]:
            best_model = copy.deepcopy(central_model)
            best_accuracies = accuracies

# final testings
print('Final testing of selected model:')
for b in range(7):
    print('Booter ' + str(b+1) + ': ', end='')
    test(best_model, device, fed_attacks_test_loaders[b], testType='Test')
print('Benign traffic: ', end='')
test(best_model, device, fed_benign_test_loader, testType='Test')
print()

dev:  [20.562236351710002, 19.8977922389483, 20.512924887844736, 18.435506562278576, 41.65606992719319, 16.846430829426772, 20.370607854314333]
median*1.5:  30.5559117814715
I:  [1, 1, 1, 1, 0, 1, 1]

Booter 1: Validation set: Average loss: 0.6895, Accuracy: 0/25000 (0%)
Booter 2: Validation set: Average loss: 0.6949, Accuracy: 0/25000 (0%)
Booter 3: Validation set: Average loss: 0.6868, Accuracy: 0/25000 (0%)
Booter 4: Validation set: Average loss: 0.7348, Accuracy: 0/25000 (0%)
Booter 5: Validation set: Average loss: 0.7653, Accuracy: 1/25000 (0%)
Booter 6: Validation set: Average loss: 0.6831, Accuracy: 0/25000 (0%)
Booter 7: Validation set: Average loss: 0.6807, Accuracy: 0/25000 (0%)
Benign traffic: Validation set: Average loss: 0.5333, Accuracy: 150000/150000 (100%)

dev:  [0.2734111657421503, 0.2777438428465505, 0.2368459984581621, 0.22610689287735697, 1.3748570654339511, 0.1779196145067418, 0.21140758705901405]
median*1.5:  0.3552689976872431
I:  [1, 1, 1, 1, 0, 1, 1]

Booter 1

## FL training 7 clients without trust

In [None]:
# central model
central_model = MLP(len(general_benign_test.columns[:-1])).to(device)
# initialize weights of central model to zero,
# so that features which are dropped by all clients do not affect testing
central_model.layers[0].weight.data.fill_(0)
central_model.layers[0].bias.data.fill_(0)
central_model.layers[2].weight.data.fill_(0)
central_model.layers[2].bias.data.fill_(0)

# clients' models, optimizers and schedulers for learning rate
# note that central entity knows the chosen features of each client from the preprocessing procedure
models = {clients[c]:MLP(len(clients_datasets[c].columns[:-1])).to(device) for c in range(num_of_clients)}
optimizers = {c:optim.SGD(models[c].parameters(), lr=args['lr']) for c in clients}
# some clients may work better with another learning rate value
optimizers[clients[5]] = optim.SGD(models[clients[5]].parameters(), lr=0.5*args['lr'])
optimizers[clients[6]] = optim.SGD(models[clients[6]].parameters(), lr=0.5*args['lr'])
# decreasing learning rate
lamda = lambda epoch: 1 if epoch < 1 else 0.1
schedulers = {i:sched.LambdaLR(optimizers[i], lr_lambda=lamda) for i in clients}

# initialization of dictionary for models aggregation
weights = {'hidden_mean_weight' : torch.zeros(size=central_model.layers[0].weight.shape).to(device),
           'hidden_mean_bias' : torch.zeros(size=central_model.layers[0].bias.shape).to(device),
           'output_mean_weight' : torch.zeros(size=central_model.layers[2].weight.shape).to(device),
           'output_mean_bias' : torch.zeros(size=central_model.layers[2].bias.shape).to(device)}

# trust values
trust = {i:1 for i in clients}

# for each weight of central_model, count the number of clients which contain this weight in their models
# needed to compute the trust value of each client
num_of_clients_in_weights = {'hidden' : np.zeros(central_model.layers[0].weight.shape),
                             'output' : np.zeros(central_model.layers[2].weight.shape)}
for i, c in enumerate(clients):
    rows = 2*models[c].n+1
    num_of_clients_in_weights['output'][0, :rows] += 1
    for j, feature in enumerate(clients_datasets[i].columns[:-1]):
        index = general_benign_test.columns[:-1].get_loc(feature)
        num_of_clients_in_weights['hidden'][:rows, index] += 1

# choose best model of all epochs (initialization)
best_model = copy.deepcopy(central_model)
best_accuracies = [0 for i in range(8)]

for epoch in range(1, args['epochs'] + 1):
    train(args, models, device, federated_train_loader, optimizers, epoch, view_log=True)
    for scheduler in schedulers.values():
        scheduler.step()

    # we shift the weights of a client to simulate untrustful behavior
    models[clients[4]].layers[0].weight.data *= 2
    models[clients[4]].layers[2].weight.data *= 2

    aggregate(central_model, models, weights, trust, num_of_clients_in_weights)

    print()
    accuracies = []
    for b in range(7):
        print('Booter ' + str(b+1) + ': ', end='')
        accuracies.append(test(central_model, device, fed_attacks_val_loaders[b]))
    print('Benign traffic: ', end='')
    accuracies.append(test(central_model, device, fed_benign_val_loader))
    print()

    # update best model
    if min(accuracies[:4] + accuracies[5:]) > 75:
        mean_attacks = mean(accuracies[:4] + accuracies[5:])
        best_mean_attacks = mean(best_accuracies[:4] + best_accuracies[5:])
        # false positives are more harmful than false negatives
        if mean_attacks + 1.2*accuracies[7] > best_mean_attacks + 1.2*best_accuracies[7]:
            best_model = copy.deepcopy(central_model)
            best_accuracies = accuracies

# final testings
print('Final testing of selected model:')
for b in range(7):
    print('Booter ' + str(b+1) + ': ', end='')
    test(best_model, device, fed_attacks_test_loaders[b], testType='Test')
print('Benign traffic: ', end='')
test(best_model, device, fed_benign_test_loader, testType='Test')
print()


Booter 1: Validation set: Average loss: 0.7118, Accuracy: 0/25000 (0%)
Booter 2: Validation set: Average loss: 0.7170, Accuracy: 0/25000 (0%)
Booter 3: Validation set: Average loss: 0.7078, Accuracy: 0/25000 (0%)
Booter 4: Validation set: Average loss: 0.7598, Accuracy: 0/25000 (0%)
Booter 5: Validation set: Average loss: 0.7487, Accuracy: 1/25000 (0%)
Booter 6: Validation set: Average loss: 0.7018, Accuracy: 0/25000 (0%)
Booter 7: Validation set: Average loss: 0.6923, Accuracy: 0/25000 (0%)
Benign traffic: Validation set: Average loss: 0.5196, Accuracy: 150000/150000 (100%)


Booter 1: Validation set: Average loss: 0.5387, Accuracy: 0/25000 (0%)
Booter 2: Validation set: Average loss: 0.5487, Accuracy: 0/25000 (0%)
Booter 3: Validation set: Average loss: 0.5303, Accuracy: 0/25000 (0%)
Booter 4: Validation set: Average loss: 0.6019, Accuracy: 0/25000 (0%)
Booter 5: Validation set: Average loss: 0.6050, Accuracy: 1/25000 (0%)
Booter 6: Validation set: Average loss: 0.5193, Accuracy: 0/

## FL 7 clients without feature selection

In [None]:
# results of same procedure, but with clients' datasets containing all 16 features
# same performance, but time and memory inefficient in big data scale

dev:  [24.80002375103197, 23.703931983341672, 25.69374205702227, 24.605259497612394, 27.013491889629435, 26.37263026885325, 25.556488338435923]
median*1.5:  38.33473250765388
I:  [1, 1, 1, 1, 1, 1, 1]

Booter 1: Test set: Average loss: 0.7168, Accuracy: 0/50000 (0%)
Booter 2: Test set: Average loss: 0.7283, Accuracy: 0/50000 (0%)
Booter 3: Test set: Average loss: 0.7090, Accuracy: 0/50000 (0%)
Booter 4: Test set: Average loss: 0.6839, Accuracy: 0/50000 (0%)
Booter 5: Test set: Average loss: 0.7411, Accuracy: 1/50000 (0%)
Booter 6: Test set: Average loss: 0.6861, Accuracy: 0/50000 (0%)
Booter 7: Test set: Average loss: 0.6692, Accuracy: 0/50000 (0%)
Benign traffic: Test set: Average loss: 0.5099, Accuracy: 300000/300000 (100%)

dev:  [0.012548735305858138, 0.016829866409386687, 0.012988741356689247, 0.017730614939185166, 0.03220635562238399, 0.015051668095656888, 0.013780966551152763]
median*1.5:  0.02257750214348533
I:  [1, 1, 1, 1, 0, 1, 1]

Booter 1: Test set: Average loss: 0.5563, A