# Quantum Federated Learning with Genomic Data

This Jupyter Notebook demonstrates the use of Quantum Federated Learning (QFL) with genomic data. Quantum computing has the potential to revolutionize machine learning by offering unique computational advantages. In this notebook, we'll use the Qiskit and Genomic Benchmarks libraries to explore the concept of federated learning, which is a decentralized approach to machine learning.

> The focus of this notebook is to address and resolve encoding issues encountered in the files within the `encoding_technique_1` folder.

## Required Dependencies

Before proceeding with the code execution, it is essential to ensure that you have the necessary libraries installed. The following commands will help you install these libraries:

The line `%%capture` prevents any pip logs from being displayed here.




In [1]:
%%capture
!pip install genomic-benchmarks
!pip install qiskit qiskit_machine_learning qiskit_algorithms
!pip install qiskit-aer

## Data Collection

In this section, our main objective is to gather the necessary data for our Quantum Federated Learning experiment. We will use the `genomic_benchmarks` library to work with a dataset designed for classifying DNA sequences as either human or worm.

To start collecting our data, we'll import the required dataset using the `DemoHumanOrWorm` class from the `genomic_benchmarks.dataset_getters.pytorch_datasets` module. This dataset comes with both a training set and a test set. However, during testing, we noticed some issues with the `test_set` variable in fetching the correct data. For our current purpose, we'll focus solely on the `train_set` variable, which holds a substantial 75,000 samples.

> If your specific use case requires it, you can include the test set as well by uncommenting the relevant line of code.




In [2]:
from genomic_benchmarks.dataset_getters.pytorch_datasets import DemoHumanOrWorm

test_set = DemoHumanOrWorm(split='test', version=0)
train_set = DemoHumanOrWorm(split='train', version=0)

data_set = train_set
# data_set = train_set + test_set
len(data_set)

  from tqdm.autonotebook import tqdm
Downloading...
From: https://drive.google.com/uc?id=1JW0-eTB-rJXvFcglqBo3pFZi1kyIWC3X
To: /root/.genomic_benchmarks/demo_human_or_worm.zip
100%|██████████| 28.9M/28.9M [00:00<00:00, 36.2MB/s]


75000

## Testing Set Size

Before we move further, let's check the size of the testing and training set variables. We have previously mentioned that there were some issues with the `test_set` variable during data collection. We'll assess its current state.

In [3]:

print(f"Nuber of samples in the test set: {len(test_set)}")
print(f"Nuber of samples in the test set: {len(train_set)}")

Nuber of samples in the test set: 25000
Nuber of samples in the test set: 75000


## Genomics Data

Now, let's take a closer look at what the genomic data looks like. The data consists of DNA sequences, each represented as a string with a length of 200 characters, and an associated label, which can be either 0 or 1. In this context, 0 typically represents human DNA, while 1 corresponds to worm DNA.

For our specific use case, we need to reduce the dimensionality of this data. One approach is to encode the DNA sequence characters as follows:
- 'A' as 1
- 'T' as 2
- 'C' as 3
- 'G' as 4
- 'N' as 5

Since we know that DNA sequences contain only these characters. However, working with 200 features for each sequence might be too complex. In the next step, we'll work on reducing the dimensionality of this data from 200 features down to a single digit, such as 5 or 4.


In [6]:
print("One sample from the data_set variable: ")
data_set[0]

One sample from the data_set variable: 


('AAGTGAAGTAGGACTCAGACGTAACGGCAGATGACAGAGATGAGAGCTGACTGTGTGCCAGGTATGGCCCCATGCACCCTATGAGAGCATTTTCCCAGGAGGAGATGGCACAGGGAGGCGGAGGGCCTTGCACAAGGCCACACAGTCAGGAGCTGATAGGGTTGGGGATCAAGCCCAGGCTGCCGGGCCCAGGACTGGGG',
 0)

##  DNA Sequence Preprocessing
In this code snippet, we perform preprocessing on a dataset containing DNA sequences. The primary steps include:

1. Word Conversion and Filtering:
    * The initial dataset consists of samples with 200-character DNA sequences, including characters A, T, C, G, and N.
    * We exclude samples containing the character 'N' from further processing.

2. Breaking Down Sequences:
    * The remaining sequences are divided into sets of 25-character words, resulting in eight sets for each original sequence.

3. Numerical Conversion:
    * Each 25-character word is then converted into a numerical representation.
    * The characters A, T, C, G are mapped to numerical values 0, 1, 2, and 3, respectively.
    * The 25-character word is treated as a base-4 number, and the base4_to_decimal function converts it to decimal form.
    * The resulting numerical values form a list of features for each sequence.

4. Dataset Creation:

   * The processed data is stored in a new dataset (np_data_set), where each data point consists of a list of numerical features ('sequence') and the corresponding label ('label').

5. Analysis and Output:

    * The code includes checks for the presence of 'N' in the numerical representation and excludes such cases from the dataset.
    * The length of the resulting dataset is printed.
    * The first five samples of the converted data are displayed.

In [7]:
import sys

def base4_to_decimal(base4_str):
    decimal_value = 0
    base = 4

    for digit in str(base4_str):
        decimal_value = decimal_value * base + int(digit)

    return decimal_value

char_dict = {'A': 0, 'T': 1, 'C': 2, 'G': 3}

word_length = 25

# Function to convert a word into a list of numbers
def convert_word_to_numbers(word):
    num_list = []

    # Process the 10 parts of 19 characters each
    for i in range(len(word) // word_length):
        part = word[i * word_length: (i + 1) * word_length]
        number = 0
        for char in part:
          if(char == 'N'):
            num_list.append(0)
            return num_list
          number = number * 10 + char_dict.get(char, 0)
        num_list.append(base4_to_decimal(number))

    # # Process the final 10 characters
    # final_part = word[190:]
    # for char in final_part:
    #     number = number * 10 + char_dict.get(char, 0)
    #     num_list.append(number)

    return num_list


# Collect the numbers for each word in the dataset
np_data_set = []

for word, label in data_set:
    if len(word) == 200:
        num_list = convert_word_to_numbers(word)
        if 0 in num_list:
          print(f"At least one number in the list for '{word}' is 0. Label: {label}")
        else:
          data_point = {'sequence': num_list, 'label': label}
          np_data_set.append(data_point)
    else:
      print(f"word of unexpected size found: {word}")

print(f"Length of the dataset np_data_set is {len(np_data_set)}")

print("First 5 samples of converted data:")
np_data_set[:5]

At least one number in the list for 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN' is 0. Label: 0
At least one number in the list for 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN' is 0. Label: 0
At least one number in the list for 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN' is 0. Label: 0
At least one number in the list for 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

[{'sequence': [60529990748994,
   1094136690586524,
   694316848622110,
   185941181688380,
   1069267071266623,
   729561373904271,
   254072917530682,
   631647932500479],
  'label': 0},
 {'sequence': [452181908103151,
   383713384982329,
   386903805088975,
   213126865323367,
   395158524539076,
   1070127763740754,
   664265500883404,
   796615475938622],
  'label': 0},
 {'sequence': [2211918762217,
   1002228117149257,
   650051951709683,
   392325146055957,
   375350478894976,
   229065861898112,
   47765561756170,
   282658361413663],
  'label': 0},
 {'sequence': [29197813616243,
   913953510980156,
   940299582169284,
   970996166762496,
   92360379982325,
   132523176382024,
   629168945446014,
   408644451057272],
  'label': 0},
 {'sequence': [406829193085002,
   509589158306665,
   720282528559074,
   624330955609743,
   528427940862627,
   326517985704158,
   910518586052635,
   1004181103862019],
  'label': 0}]

## Shuffling Data for Balanced Distribution


In the subsequent steps of this code, we'll divide the data into portions for each of our clients, and it's crucial to ensure that each client receives a balanced mix of data from both classes (0 and 1). Therefore, we need to shuffle the `np_data_set` variable.

Shuffling the dataset randomizes the order of samples, guaranteeing that no single client will receive data only from one class. This is essential for a more representative and fair distribution of data among clients.


In [8]:
import numpy as np
np.random.shuffle(np_data_set)
print("First 5 samples of encoded shuffled data:")
np_data_set[:5]

First 5 samples of encoded shuffled data:


[{'sequence': [984872014116412,
   940299582800415,
   493241989484552,
   598139938506570,
   510629620628729,
   912855596101803,
   1072560001479584,
   874910111558140],
  'label': 0},
 {'sequence': [703715448471728,
   3023678494017,
   563156556644769,
   281573760958785,
   410461435199494,
   705061881667605,
   396223659901461,
   397267305660417],
  'label': 1},
 {'sequence': [376094337037911,
   69282349434460,
   916751589951246,
   695730855849001,
   1122201099528530,
   11120222362712,
   419984190184867,
   704202945238796],
  'label': 0},
 {'sequence': [127565277242708,
   375864430031953,
   313576101737940,
   373856639009793,
   183996472952935,
   531130928971890,
   519470023600356,
   670811074619226],
  'label': 0},
 {'sequence': [972334626580833,
   798728392939967,
   90290234381900,
   234636505134083,
   412327068287322,
   488476333558553,
   844718486258119,
   312391766229568],
  'label': 1}]


Scaling the Data with Robust Scaling
In the provided code, we employ Robust scaling to normalize the numerical values in the dataset. This technique is particularly useful for handling outliers and ensuring that the features are within a consistent range.

In [10]:
from sklearn.preprocessing import RobustScaler



sequences = np.array([item['sequence'] for item in np_data_set])
sequences = np.vstack(sequences)

scaler = RobustScaler()
sequences_scaled = scaler.fit_transform(sequences)

for i, item in enumerate(np_data_set):
    item['sequence'] = sequences_scaled[i]

print("First 5 samples of scaled encoded shuffled data:")
np_data_set[:5]

First 5 samples of scaled encoded shuffled data:


[{'sequence': array([1.12313285, 1.03521293, 0.13125814, 0.34099392, 0.16684049,
         0.98048499, 1.29933256, 0.88321131]),
  'label': 0},
 {'sequence': array([ 0.55707553, -0.85223864,  0.27206074, -0.29973812, -0.03567302,
          0.56210429, -0.06422195, -0.05702146]),
  'label': 1},
 {'sequence': array([-0.10252965, -0.71880938,  0.98417414,  0.53851859,  1.40327588,
         -0.83510735, -0.0163186 ,  0.54717679]),
  'label': 0},
 {'sequence': array([-0.60289745, -0.10142572, -0.23057524, -0.11295696, -0.49352513,
          0.21190431,  0.18425373,  0.48144539]),
  'label': 0},
 {'sequence': array([ 1.09789111,  0.7501221 , -0.68025593, -0.39473943, -0.03190121,
          0.12602174,  0.8399837 , -0.2240977 ]),
  'label': 1}]

## Splitting the Dataset and Preparing Test Data

In the previous section, we divided the `np_data_set` variable into two subsets, with 70,000 samples earmarked for training and 3106 samples reserved for testing. This division is crucial for the development and evaluation of our Quantum Federated Learning model, ensuring that we have separate datasets for these purposes.

Following the split, we proceed to prepare the test data for further analysis and evaluation. We extract the sequences and labels from the testing dataset. This separation is essential as it allows us to analyze the data and labels separately, facilitating model evaluation and performance assessment.

At this point, the test data is organized into two variables:
- `test_sequences`: An array containing the sequences from the test data.
- `test_labels`: An array containing the corresponding labels from the test data.

These variables will be used in subsequent steps to evaluate the model's performance on the testing data.


In [12]:
np_train_data = np_data_set[:70000]
np_test_data = np_data_set[-3106:]

print(f"Length of np_train_data: {len(np_train_data)}")
print(f"Length of np_test_data: {len(np_test_data)}")

test_sequences = [data_point["sequence"] for data_point in np_test_data]
test_labels = [data_point["label"] for data_point in np_test_data]
test_sequences = np.array(test_sequences)
test_labels = np.array(test_labels)


Length of np_train_data: 70000
Length of np_test_data: 3106


## Configuring the Federated Learning Setup

In this code section, we establish essential variables and settings for our Federated Learning setup. These variables play a crucial role in shaping how the Federated Learning process unfolds and offer the flexibility to customize the experiment to meet specific requirements.

Here, we outline the key variables that we define:

- `num_clients`: This variable determines the number of participating clients in our Federated Learning setup. Each client plays a role in the learning process.
- `num_epochs`: It specifies the number of training epochs, indicating how many times the Federated Learning process will iterate through the training data for each client.
- `max_train_iterations`: This variable controls the maximum number of training iterations that each client will perform during each round of Federated Learning.
- `samples_per_epoch`: It defines the number of samples processed in each training epoch for each client.
- `backend`: The choice of backend, specified as 'aer_simulator' in this code, determines the quantum simulator used for the Federated Learning setup. If you intend to work with a real quantum device, you can replace this backend with a real quantum device backend provided by IBM Quantum.
- `fl_avg_weight_range`: This range sets the minimum and maximum values for the average weights in the Federated Learning process.
- `ansatz_reps`: This variable defines the depth of the neural network by specifying the number of repetitions for the ansatz in the Quantum Variational Circuit (QVC) model.

If you wish to work with a real quantum backend, the following code snippet shows how to load the IBM Quantum account and set the appropriate backend:
```python
from qiskit import Aer, IBMQ

# Load your IBM Quantum account
IBMQ.load_account()

# Access the provider with the desired backend
provider = IBMQ.get_provider(hub='ibm-q', group='open', project='main')

# List available backends
provider.backends()

# Define the backend for your quantum computations
backend = provider.get_backend('ibm_nairobi')


In [14]:
import time
from qiskit.circuit.library import ZZFeatureMap, RealAmplitudes
from qiskit_algorithms.optimizers import COBYLA
from qiskit_machine_learning.algorithms.classifiers import VQC
from qiskit.primitives import BackendSampler
from functools import partial
from qiskit import Aer, IBMQ


num_clients = 3
num_epochs = 200
max_train_iterations = 20
samples_per_epoch= 100
backend = Aer.get_backend('aer_simulator')

fl_avg_weight_range = [0.1, 1]

ansatz_reps = 50


## Defining the Client Class and Splitting the Training Data

In the code below, we take two significant actions: defining a `Client` class and splitting the training data into multiple clients.

**Client Class**:
- The `Client` class is introduced to encapsulate essential information for each client in our Federated Learning setup. Each client has attributes like `models`, `primary_model`, `data`, `test_scores`, and `train_scores`. These attributes are crucial for managing and tracking the client's involvement in the Federated Learning process.

**Data Splitting Function**:
- A function named `split_dataset` is defined to split the training data into segments, with each segment designated for a specific client and epoch. The function takes parameters such as `num_clients`, `num_epochs`, and `samples_per_epoch` to control the data splitting process.

**Creating Client Instances**:
- After defining the class and data splitting function, we proceed to create an array called `clients`. This array is populated with instances of the `Client` class, and each instance contains the relevant data segments for each epoch. This division ensures that each client has access to its designated training data.

This code sets the foundation for managing clients and their data within the Federated Learning framework.


In [15]:
class Client:
    def __init__(self, data):
        self.models = []
        self.primary_model = None
        self.data = data
        self.test_scores = []
        self.train_scores = []

def split_dataset(num_clients, num_epochs, samples_per_epoch):
  clients = []
  for i in range(num_clients):
    client_data = []
    for j in range(num_epochs):
      start_idx = (i*num_epochs*samples_per_epoch)+(j*samples_per_epoch)
      end_idx = (i*num_epochs*samples_per_epoch)+((j+1)*samples_per_epoch)
      client_data.append(np_train_data[start_idx:end_idx])
    clients.append(Client(client_data))
  return clients

clients = split_dataset(num_clients, num_epochs, samples_per_epoch)


## Examining Client Data

The code snippet `clients[0].data[0][:3]` is used to display the data for the first client and its first epoch.



In [16]:
clients[0].data[0][:9]

[{'sequence': array([1.12313285, 1.03521293, 0.13125814, 0.34099392, 0.16684049,
         0.98048499, 1.29933256, 0.88321131]),
  'label': 0},
 {'sequence': array([ 0.55707553, -0.85223864,  0.27206074, -0.29973812, -0.03567302,
          0.56210429, -0.06422195, -0.05702146]),
  'label': 1},
 {'sequence': array([-0.10252965, -0.71880938,  0.98417414,  0.53851859,  1.40327588,
         -0.83510735, -0.0163186 ,  0.54717679]),
  'label': 0},
 {'sequence': array([-0.60289745, -0.10142572, -0.23057524, -0.11295696, -0.49352513,
          0.21190431,  0.18425373,  0.48144539]),
  'label': 0},
 {'sequence': array([ 1.09789111,  0.7501221 , -0.68025593, -0.39473943, -0.03190121,
          0.12602174,  0.8399837 , -0.2240977 ]),
  'label': 1},
 {'sequence': array([ 0.53146172, -0.09209762,  0.17938519,  0.90324327,  0.95807872,
          0.17297506, -0.39045009, -0.23867421]),
  'label': 0},
 {'sequence': array([-0.78697007,  0.97931277, -0.30861881,  0.39782531, -0.31971517,
          1.2115

## Model Accuracy and Creation Functions

In the provided code, two essential functions are defined, each with a specific role.

**`getAccuracy` Function**:
- The `getAccuracy` function calculates and returns the accuracy of a model with given weights. It initializes a Quantum Variational Circuit (QVC) model with the provided weights and prepares it for evaluation. While it includes a call to the training function (`vqc.fit()`), the training itself doesn't occur because we set the maximum iteration value of the optimizer to 0. This is done as a workaround because we cannot directly use the `.score` function without first executing the `.fit` function on a new VQC class instance. After model preparation, the function computes the accuracy by evaluating the model's performance using test sequences and labels.

**`create_model_with_weights` Function**:
- The `create_model_with_weights` function creates a new Quantum Variational Circuit (QVC) model with an initial point set to the given weights. This function is instrumental in creating a global model from global model weights during the Federated Learning training process.


In [None]:
itr = 0
def training_callback(weights, obj_func_eval):
        global itr
        itr += 1
        print(f"{itr} {obj_func_eval}", end=' | ')


def getAccuracy(weights, test_num = 200):
        num_features = len(test_sequences[0])
        feature_map = ZZFeatureMap(feature_dimension=num_features, reps=1)
        ansatz = RealAmplitudes(num_qubits=num_features, reps=ansatz_reps)
        optimizer = COBYLA(maxiter=0)
        vqc = VQC(
            feature_map=feature_map,
            ansatz=ansatz,
            optimizer=optimizer,
            sampler=BackendSampler(backend=backend),
            initial_point = weights
        )
        vqc.fit(test_sequences[:25], test_labels[:25])
        return vqc.score(test_sequences[:test_num], test_labels[:test_num])


def create_model(weights = None):
  if(weights != None):
    num_features = len(test_sequences[0])
    feature_map = ZZFeatureMap(feature_dimension=num_features, reps=1)
    ansatz = RealAmplitudes(num_qubits=num_features, reps=ansatz_reps)
    optimizer = COBYLA(maxiter=max_train_iterations)
    vqc = VQC(
        feature_map=feature_map,
        ansatz=ansatz,
        optimizer=optimizer,
        sampler=BackendSampler(backend=backend),
        warm_start = True,
        initial_point  = weights,
        callback=partial(training_callback)
    )
    return vqc
  else:
    num_features = len(test_sequences[0])
    feature_map = ZZFeatureMap(feature_dimension=num_features, reps=1)
    ansatz = RealAmplitudes(num_qubits=num_features, reps=ansatz_reps)
    optimizer = COBYLA(maxiter=max_train_iterations)
    vqc = VQC(
        feature_map=feature_map,
        ansatz=ansatz,
        optimizer=optimizer,
        sampler=BackendSampler(backend=backend),
        warm_start = True,
        callback=partial(training_callback)
    )
    return vqc


In [None]:
import warnings

# Temporary code to suppress all FutureWarnings for a cleaner output
warnings.simplefilter("ignore", FutureWarning)


## Global Model Update Functions
The following code defines a set of functions that play a pivotal role in calculating and updating global models across multiple clients in each epoch. Three distinct techniques are employed, each serving a unique purpose in refining the global model.

### simple_averaging Function:
The simple_averaging function performs a straightforward averaging of the weights across all client models, including the global weights from the previous epoch if available. The key steps are as follows:

1. Initialization: If global weights from the previous epoch exist, they are appended to the list of client weights along with their corresponding test scores.

2. Averaging: The function iterates through the weights of each client at the same position and calculates the average for that position across all clients. This process is repeated for all positions, resulting in a set of averaged weights.

3. Result: The function returns a list of averaged weights, representing the updated global model for the current epoch.

### weighted_average Function:
The weighted_average function introduces a more nuanced approach, where the weights of each client contribute to the global model based on their test accuracy. The steps are outlined below:

1. Initialization: Similar to simple_averaging, global weights from the previous epoch are appended to the list of client weights along with their test scores.

2. Sorting and Scaling: The function sorts the client weights based on their corresponding test scores. It then scales the weights, assigning higher importance to models with better test accuracy.

3. Weighted Average Calculation: The function calculates the weighted average of the weights, where the contribution of each client is proportional to its scaled test accuracy. This process results in a refined set of global weights for the current epoch.

4. : The function returns the weighted average weights, representing the updated global model.

### weighted_average_best_pick Function:
The weighted_average_best_pick function refines the weighted averaging technique further by considering only the best-performing models for the final weight calculations. The key steps are as follows:

1. Initialization: Similar to the other functions, global weights from the previous epoch are appended to the list of client weights along with their test scores.

2. Sorting and Scaling: The function sorts and scales the client weights based on their test scores, similar to weighted_average.

3. Best Pick Selection: Only weights corresponding to models with scaled test accuracy above a specified cutoff value are retained for the final calculations.

4. Result: The function returns the weighted average weights, considering only the best-performing models above the cutoff, representing the updated global model for the current epoch.

### create_new_client_model Function:
The create_new_client_model function takes a primary client model, the current epoch, and global model weights, and updates the client model weights using a weighted averaging technique for personalized learning. It assigns weights based on the current epoch and global model weights, creating a new **personalized** client model for the next epoch.

In [None]:
import numpy as np

def create_new_client_model(primary_model, current_epoch, global_model_weights):
  assigned_weights = [1/(current_epoch+2), (current_epoch+1)/(current_epoch+2)]
  primary_model_weights = primary_model.weights
  new_client_weights = []
  for index, _ in enumerate(primary_model_weights):
    new_client_weights.append(assigned_weights[1]*primary_model_weights[index] + assigned_weights[0]*global_model_weights[index])
  return create_model(new_client_weights)

def sort_epoch_results(epoch_results):
    # Pair weights and test_scores together
    pairs = zip(epoch_results['weights'], epoch_results['test_scores'])

    # Sort the pairs based on test_scores
    sorted_pairs = sorted(pairs, key=lambda x: x[1])

    # Unzip the sorted pairs back into separate arrays
    sorted_weights, sorted_test_scores = zip(*sorted_pairs)

    # Create a new sorted dictionary
    sorted_epoch_results = {
        'weights': list(sorted_weights),
        'test_scores': list(sorted_test_scores)
    }

    return sorted_epoch_results
fl_avg_weight_range = [0.1, 1]

def scale_test_scores(sorted_epoch_results):
    min_test_score = sorted_epoch_results['test_scores'][0]
    max_test_score = sorted_epoch_results['test_scores'][-1]
    min_weight, max_weight = fl_avg_weight_range
    scaled_weights = [
        min_weight + (max_weight - min_weight) * (test_score - min_test_score) / (max_test_score - min_test_score)
        for test_score in sorted_epoch_results['test_scores']
    ]
    sorted_epoch_results['fl_avg_weights'] = scaled_weights
    return sorted_epoch_results

def calculate_weighted_average(model_weights, fl_avg_weights):
    weighted_sum_weights = []
    for index in range(len(model_weights[0])):
      weighted_sum_weights.append(0)
      weighted_sum_weights[index] = sum([(weights_array[index]* avg_weight) for weights_array, avg_weight  in zip(model_weights, fl_avg_weights)])/sum(fl_avg_weights)
    return weighted_sum_weights

def weighted_average(epoch_results, global_model_weights_last_epoch = None, global_model_accuracy_last_epoch = None):
  if(global_model_weights_last_epoch != None):
    epoch_results['weights'].append(global_model_weights_last_epoch)
    epoch_results['test_scores'].append(global_model_accuracy_last_epoch)

  if all(epoch_results['test_scores'][0] == x for x in epoch_results['test_scores']):
    # All values in the array are equal
      print("Equal test scores received")
      return simple_averaging(epoch_results)

  epoch_results = sort_epoch_results(epoch_results)
  epoch_results = scale_test_scores(epoch_results)
  print(epoch_results)
  weighted_average_weights_curr_epoch = calculate_weighted_average(epoch_results['weights'], epoch_results['fl_avg_weights'])
  return weighted_average_weights_curr_epoch



def weighted_average_best_pick(epoch_results, global_model_weights_last_epoch = None, global_model_accuracy_last_epoch = None, best_pick_cutoff = 0.5):

  if(global_model_weights_last_epoch != None):
    epoch_results['weights'].append(global_model_weights_last_epoch)
    epoch_results['test_scores'].append(global_model_accuracy_last_epoch)

  if all(epoch_results['test_scores'][0] == x for x in epoch_results['test_scores']):
    # All values in the array are equal
      print("Equal test scores received")
      return simple_averaging(epoch_results)

  epoch_results = sort_epoch_results(epoch_results)
  epoch_results = scale_test_scores(epoch_results)

  new_weights = []
  new_test_scores = []
  new_fl_avg_weights = []

  for index, fl_avg_weight in enumerate(epoch_results['fl_avg_weights']):
      if fl_avg_weight >= best_pick_cutoff:
          new_weights.append(epoch_results['weights'][index])
          new_test_scores.append(epoch_results['test_scores'][index])
          new_fl_avg_weights.append(fl_avg_weight)

  # Update the epoch_results dictionary with the new lists
  epoch_results['weights'] = new_weights
  epoch_results['test_scores'] = new_test_scores
  epoch_results['fl_avg_weights'] = new_fl_avg_weights

  print(epoch_results)
  weighted_average_weights_curr_epoch = calculate_weighted_average(epoch_results['weights'], epoch_results['fl_avg_weights'])
  return weighted_average_weights_curr_epoch

def simple_averaging(epoch_results, global_model_weights_last_epoch = None, global_model_accuracy_last_epoch = None):
  if(global_model_weights_last_epoch != None):
    epoch_results['weights'].append(global_model_weights_last_epoch)
    epoch_results['test_scores'].append(global_model_accuracy_last_epoch)

  epoch_weights = epoch_results['weights']
  averages = []
  # Iterate through the columns (i.e., elements at the same position) of the arrays
  for col in range(len(epoch_weights[0])):
      # Initialize a variable to store the sum of elements at the same position
      col_sum = 0
      for row in range(len(epoch_weights)):
          col_sum += epoch_weights[row][col]

      # Calculate the average for this column and append it to the averages list
      col_avg = col_sum / len(epoch_weights)
      averages.append(col_avg)

  return averages



## Training Function for Federated Learning

In the following code, we define a fundamental function used during the Federated Learning phase for each client in each epoch. This function, named `train`, takes the client's data for a specific epoch and trains a model accordingly. It returns the trained model, training score, as well as the testing score for that iteration.

**Function Explanation**:
- The `train` function begins by checking if a model has been provided as an argument. If not (this is only for the first epoch that a client does not have thier model), it initializes a model for training. This model is created using the Quantum Variational Circuit (QVC) framework and includes components like the feature map, ansatz, optimizer, and a callback function for tracking the training progress.

- The function then processes the training data by extracting the sequences and labels from the provided data. These sequences and labels are organized into NumPy arrays for compatibility with the model.

- The training process is initiated, and the function measures the time taken for training. Upon completion, it prints the time elapsed during training.

- The model's performance is evaluated by scoring it on both the training and some samples in the testing data. The training score and testing score are computed and returned.

This `train` function plays a central role in the Federated Learning process, enabling clients to train models and evaluate their performance for each epoch.


In [None]:
import time

def train(data, model = None):
  if model is None:
    model = create_model()

  train_sequences = [data_point["sequence"] for data_point in data]
  train_labels = [data_point["label"] for data_point in data]

  # Convert the lists to NumPy arrays
  train_sequences = np.array(train_sequences)
  train_labels = np.array(train_labels)

  # Print the shapes
  print("Train Sequences Shape:", train_sequences.shape)
  print("Train Labels Shape:", train_labels.shape)

  print("Training Started")
  start_time = time.time()
  model.fit(train_sequences, train_labels)
  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nTraining complete. Time taken: {elapsed_time} seconds.")

  print(f"SCORING MODEL")
  train_score_q = model.score(train_sequences, train_labels)
  test_score_q = model.score(test_sequences[:200], test_labels[:200])
  return train_score_q, test_score_q, model





## Client Initialization for Federated Learning Techniques
In this code block, we initialize a set of clients for each federated learning technique. The fl_techniques dictionary maps technique names to their corresponding functions. To ensure each technique operates on its distinct set of clients, we create copies of the original `clients` array for each technique. The result is the `clients_2d_array`, a 2D array where each sublist corresponds to a specific technique and contains client copies for that technique. This approach allows for independent application of different federated learning techniques on separate sets of clients.

In [None]:
# Create a dictionary mapping function names to their corresponding functions
fl_techniques = {
    'Best_Pick_Weighted_Averaging': weighted_average_best_pick,
    'Weighted_Averaging': weighted_average,
    'Averaging': simple_averaging
}
clients_2d_array = [[] for _ in range(len(fl_techniques))]

for index, (technique_name, _) in enumerate(fl_techniques.items()):
        for client in clients:
          client_copy = Client(client.data)
          clients_2d_array[index].append(client_copy)

In [None]:
clients_2d_array

[[<__main__.Client at 0x7fdafa638130>,
  <__main__.Client at 0x7fdafa638310>,
  <__main__.Client at 0x7fdafa638370>]]

## Federated Learning Training Loop with Different Techniques
In this section, we implement the training loop for Federated Learning across multiple epochs and clients, considering different federated learning techniques - specifically, Average, Weighted Average, and Best Pick Average. The key steps and processes are as follows:

1. Technique-specific Iteration: The code iterates through each federated learning technique - Average, Weighted Average, and Best Pick Average.

2. Epoch-by-Epoch Training: Within each technique, the code proceeds with epoch-by-epoch training. It starts with epoch 0 and prepares to train client models for that epoch.

3. Client Training: For each epoch and each client within the selected technique, the code checks if the client has a primary model. If not, it creates a new model and trains it using the training data specific to that client and epoch. The resulting model is stored in the client's models array. The training process calculates and stores the training and testing scores for each client.

4. Global Model Aggregation: After training the models for all clients in a given epoch and technique, the code collects the trained model weights in an array called `epoch_weights`. This array holds the weights of all client models and the global model from the previous epoch (if applicable).

5. Global Model Update Using Technique: The code uses the specified federated learning technique (Average, Weighted Average, or Best Pick Average) to calculate new global weights based on the collected `epoch_weights`. This ensures that the global model is updated according to the chosen technique for the current epoch.

6. Client Model Update: The new global model is assigned to each client within the selected technique as its primary model for the next epoch. This ensures that all clients within the technique train using the same global model in the following epoch.

7. Global Model Evaluation: The code evaluates the accuracy of the global model on a subset of testing data. The accuracy is stored in the global_model_accuracy array for tracking the performance of the global model over different epochs within the selected federated learning technique.

8. Technique-specific Iteration (Continued): Steps 2-7 are repeated for the specified number of epochs within the chosen federated learning technique.



In [None]:
global_model_weights = []
global_model_accuracy = []



for outer_idx, clients in enumerate(clients_2d_array):
  technique_name = list(fl_techniques.keys())[outer_idx]
  technique_function = list(fl_techniques.values())[outer_idx]
  print(f"Technique Name: {technique_name}")
  global_model_weights.append([])
  global_model_accuracy.append([])
  for epoch in range(num_epochs):
    epoch_results = {
        'weights': [],
        'test_scores': []
    }
    print(f"epoch: {epoch}")

    for index, client in enumerate(clients):
      print(f"Index: {index}, Client: {client}")

      if client.primary_model is None:
        train_score_q, test_score_q, model = train(data = client.data[epoch])
        client.models.append(model)
        client.test_scores.append(test_score_q)
        client.train_scores.append(train_score_q)
        # Print the values
        print("Train Score:", train_score_q)
        print("Test Score:", test_score_q)
        print("\n\n")
        epoch_results['weights'].append(model.weights)
        epoch_results['test_scores'].append(test_score_q)

      else:
        train_score_q, test_score_q, model = train(data = client.data[epoch], model = client.primary_model)
        client.models.append(model)
        client.test_scores.append(test_score_q)
        client.train_scores.append(train_score_q)
        print("Train Score:", train_score_q)
        print("Test Score:", test_score_q)
        print("\n\n")
        epoch_results['weights'].append(model.weights)
        epoch_results['test_scores'].append(test_score_q)

    new_global_weights = []
    if(epoch == 0):
      new_global_weights = technique_function(epoch_results)
    else:
      new_global_weights = technique_function(epoch_results, global_model_weights[outer_idx][epoch - 1], global_model_accuracy[outer_idx][epoch - 1])
    print(new_global_weights)
    global_model_weights[outer_idx].append(new_global_weights)
    new_model_with_global_weights = create_model(weights = global_model_weights[outer_idx][epoch])

    for index, client in enumerate(clients):
      client.primary_model = create_new_client_model(client.models[-1], epoch, global_model_weights[outer_idx][epoch])

    global_accuracy = getAccuracy(global_model_weights[outer_idx][epoch], len(test_sequences[:200]))
    global_model_accuracy[outer_idx].append(global_accuracy)
    print(f"Technique Name: {technique_name}")
    print(f"Global Model Accuracy In Epoch {epoch}: {global_accuracy}")
    print("----------------------------------------------------------")



Technique Name: Best_Pick_Weighted_Averaging
epoch: 0
Index: 0, Client: <__main__.Client object at 0x7fdafa638130>
Train Sequences Shape: (100, 8)
Train Labels Shape: (100,)
Training Started
1 1.013228731031383 | 2 1.0203079335848908 | 3 1.0098716836340433 | 4 1.0096250775594726 | 5 0.9983751116596715 | 6 1.0164612702410925 | 7 1.0112291733702814 | 8 0.9951007300854181 | 9 0.9976328305510538 | 10 1.0049889608337572 | 11 0.9949900501986059 | 12 0.9905166175912984 | 13 0.997354170523139 | 14 1.001772258859375 | 15 1.001634984928471 | 16 0.9945097669808797 | 17 0.9947425378159089 | 18 0.9956759789411005 | 19 0.986458874628245 | 20 0.9954153126833348 | 
Training complete. Time taken: 80.92854499816895 seconds.
SCORING MODEL
Train Score: 0.53
Test Score: 0.51



Index: 1, Client: <__main__.Client object at 0x7fdafa638310>
Train Sequences Shape: (100, 8)
Train Labels Shape: (100,)
Training Started
21 1.0102926694881251 | 22 1.0002913654229417 | 23 1.0178297448898774 | 24 0.991801131965994 | 

## Visualization of Client Training Scores

In this code, we create visualizations to track the training and testing scores of each client across different epochs.


In [None]:
import matplotlib.pyplot as plt

# Create two figures, one for train scores and one for test scores

for idx, clients in enumerate(clients_2d_array):

  technique_name = list(fl_techniques.keys())[idx]
  # Create a new figure for test scores
  plt.figure(figsize=(8, 6))

  # Plot train scores for all clients
  for client in clients:
      plt.plot(client.train_scores, label=f'Client {clients.index(client) + 1}')

  plt.xlabel('Epochs')
  plt.ylabel('Train Score')
  plt.title(f"Train Scores for All Clients (${technique_name})")
  plt.legend()

  # Show the train scores plot
  plt.show()

  # Create a new figure for test scores
  plt.figure(figsize=(8, 6))

  # Plot test scores for all clients
  for client in clients:
      plt.plot(client.test_scores, label=f'Client {clients.index(client) + 1}')

  plt.xlabel('Epochs')
  plt.ylabel('Test Score')
  plt.title(f"Test Scores for All Clients (${technique_name})")
  plt.legend()

  # Show the test scores plot
  plt.show()


## Visualization of Client Training Scores

In this code, we create visualizations to testing scores of each client and the global model across different epochs.


In [None]:
import matplotlib.pyplot as plt


for idx, clients in enumerate(clients_2d_array):

  technique_name = list(fl_techniques.keys())[idx]
  # Create a new figure for test scores
  plt.figure(figsize=(8, 6))


  # Plot test scores for all clients
  for client in clients:
      plt.plot(client.test_scores, label=f'Client {clients.index(client) + 1}')

  # Plot global model accuracy
  plt.plot(global_model_accuracy[idx], label='Global Model Accuracy', linestyle='--', color='black')

  plt.xlabel('Epochs')
  plt.ylabel('Scores')
  plt.title(f"Test Scores and Global Model Accuracy ({technique_name})")
  plt.legend()

  # Show the combined graph
  plt.show()


In [None]:
np_final_test_data = np_data_set[50000:]

final_test_sequences = [data_point["sequence"] for data_point in np_final_test_data]
final_test_labels = [data_point["label"] for data_point in np_final_test_data]
final_test_sequences = np.array(final_test_sequences)
final_test_labels = np.array(final_test_labels)

def getFinalAccuracy(weights):
        num_features = len(test_sequences[0])
        feature_map = ZZFeatureMap(feature_dimension=num_features, reps=1)
        ansatz = RealAmplitudes(num_qubits=num_features, reps=ansatz_reps)
        optimizer = COBYLA(maxiter=0)
        vqc = VQC(
            feature_map=feature_map,
            ansatz=ansatz,
            optimizer=optimizer,
            initial_point = weights,
            sampler=BackendSampler(backend=backend)
        )
        vqc.fit(test_sequences[:25], test_labels[:25])
        return vqc.score(final_test_sequences, final_test_labels)

In [None]:
final_results = []
for idx, row in enumerate(global_model_weights):
  final_results.append([])
  for global_model_weight in row:
    final_results[idx].append(getFinalAccuracy(global_model_weight))
print(final_results)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Assuming final_results is a 2D array containing your data
final_results = np.array(final_results)  # Convert final_results to a NumPy array

plt.figure(figsize=(10, 8))

# Iterate through each row and plot it
for row_index in range(final_results.shape[0]):
    data_row = final_results[row_index]
    technique_name = list(fl_techniques.keys())[row_index]
    plt.plot(data_row, label=f'Global Model Accuracy For: {technique_name}')

plt.xlabel('Column Index')
plt.ylabel('Value')
plt.title('Global Model Accuracy')
plt.legend()
plt.show()
