<h1>Federated Learning - GTEx_V8 Example</h1>
<h2>Populate remote PyGrid nodes with labeled tensors </h2>
In this notebook, we will populate our PyGrid nodes with labeled data so that it will be used later by people interested in train models.

**NOTE:** At the time of running this notebook, we were running the grid components in background mode.  

Components:
 - PyGrid Network (http://localhost:5000)
 - PyGrid Node h1 (http://localhost:3000)
 - PyGrid Node h2 (http://localhost:3001)
 
Code implementation for this notebook has been referred from <a href="https://github.com/OpenMined/PySyft/blob/master/examples/tutorials/grid/federated_learning/mnist/Fed.Learning%20MNIST%20%5B%20Part-1%20%5D%20-%20Populate%20a%20Grid%20Network%20(%20Dataset%20).ipynb">Fed.Learning MNIST [ Part-1 ] - Populate a Grid Network ( Dataset )</a> tutorial

<h2>Import dependencies</h2>

In [None]:
#dependencies for helper functions/classes
import pandas as pd
import pyarrow.parquet as pq
from typing import NamedTuple
from dataclasses import *
import os.path as path
import os
import progressbar
import requests
import numpy as np

#keras for ML
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dropout, Input, Dense
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.utils import plot_model, normalize
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import SGD, Adam, Nadam, Adadelta
from tensorflow.keras.activations import relu, elu, sigmoid

#sklearn for preprocessing the data and train-test split
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, accuracy_score, classification_report
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, r2_score, mean_squared_error, mean_absolute_error

#for plots
import matplotlib
import matplotlib.pyplot as plt

### # Parameter cell -->

In [2]:
seed = 7
dataset_size = 1800
n_classes = 6
no_samples_to_take = dataset_size // n_classes

# Connect directly to grid nodes
nodes = ["ws://0.0.0.0:3000/",
         "ws://0.0.0.0:3001/"]

In [3]:
import syft as sy

#########<syft==0.2.8>#######################
# # Dynamic FL -->
from syft.grid.clients.data_centric_fl_client import DataCentricFLClient

# #Static FL -->
from syft.grid.clients.model_centric_fl_client import ModelCentricFLClient
#############################################

import torch
import pickle
import time
import numpy as np
import torchvision
from torchvision import datasets, transforms

In [4]:
from pathlib import Path
import sys
if (Path("..") / "src").resolve().exists():
  sys.path.insert(0, Path("..").as_posix())

In [5]:
# Importing from src folder
from src.data_splitter import Genes, Labels
from src.data_splitter import ClientGenerator 

### Loading data

In [6]:
# resolving base folder
wd = Path(".").resolve()
base = wd if (wd / "data").exists() else Path("..")

In [7]:
data = base / "data"
samples_path = str(data / 'gtex' / 'v8_samples.parquet')
expressions_path = str(data / 'gtex' / 'v8_expressions.parquet')

In [8]:
def Huber(yHat, y, delta=1.):
    return np.where(np.abs(y-yHat) < delta,.5*(y-yHat)**2 , delta*(np.abs(y-yHat)-0.5*delta))

def transform_to_probas(age_intervals):
    class_names = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
    res = []
    for a in age_intervals:
        non_zero_index = class_names.index(a)
        res.append([0 if i != non_zero_index else 1 for i in range(len(class_names))])
    return np.array(res)
    
def transform_to_interval(age_probas):
    class_names = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
    return np.array(list(map(lambda p: class_names[np.argmax(p)], age_probas)))        

### Preprocessing for Classification model

In [9]:
genes = Genes(samples_path, expressions_path, problem_type="classification")
X = genes.get_features_dataframe().values
Y = genes.Y

In [10]:
X.shape, Y.shape

((17382, 18420), (17382, 6))

In [11]:
a = transform_to_interval(Y)
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts))

{'20-29': 1320,
 '30-39': 1323,
 '40-49': 2702,
 '50-59': 5615,
 '60-69': 5821,
 '70-79': 601}

In [12]:
b = [np.where(r==1)[0][0] for r in Y]
unique, counts = np.unique(b, return_counts=True)
dict(zip(unique, counts))

{0: 1320, 1: 1323, 2: 2702, 3: 5615, 4: 5821, 5: 601}

In [13]:
df_pie = pd.DataFrame(columns = ['age_group','label'])
df_pie['age_group'] = a
df_pie['label'] = b

# Balancing the data below

In [14]:
res = ClientGenerator().balanced_sample_maker(X,np.asarray(df_pie['age_group'].values), no_samples_to_take)

In [15]:
unique, counts = np.unique(res[1], return_counts=True)
dict(zip(unique, counts))

{'20-29': 300,
 '30-39': 300,
 '40-49': 300,
 '50-59': 300,
 '60-69': 300,
 '70-79': 300}

# Label encoding -->

In [16]:
le = LabelEncoder()
le.fit(res[1])
y = le.transform(res[1])

# Splitting data into different shards --> 

In [17]:
from sklearn import model_selection
X_1, X_2, y_1, y_2 = model_selection.train_test_split(res[0], y, test_size=0.5, random_state=seed, stratify=res[1])

In [18]:
_dtype = np.float32
y_1 = np.vstack(y_1).astype(np.uint8)
y_2 = np.vstack(y_2).astype(np.uint8)
X_1.dtype, y_1.dtype,X_2.dtype, y_2.dtype

(dtype('float32'), dtype('uint8'), dtype('float32'), dtype('uint8'))

In [19]:
unique, counts = np.unique(y_1, return_counts=True)
print(dict(zip(unique, counts)))
unique, counts = np.unique(y_2, return_counts=True)
print(dict(zip(unique, counts)))

{0: 150, 1: 150, 2: 150, 3: 150, 4: 150, 5: 150}
{0: 150, 1: 150, 2: 150, 3: 150, 4: 150, 5: 150}


# One-hot encode the labels

In [20]:
Y[0]
y_1 = y_1.reshape(y_1.shape[0])
y_2 = y_2.reshape(y_2.shape[0])
y_2.shape

(900,)

In [21]:
sy.version.__version__

'0.2.8'

<h2>Setup config</h2>
Init hook, connect with grid nodes, etc...

In [22]:
hook = sy.TorchHook(torch)

compute_nodes = []
for node in nodes:
    # For syft 0.2.8 --> replace DynamicFLClient with DataCentricFLClient
    compute_nodes.append( DataCentricFLClient(hook, node) )

In [23]:
compute_nodes

[<Federated Worker id:h1>, <Federated Worker id:h2>]

## 1 - Conversion to Tensor

The code below will convert GTEx data samples to tensors.

In [24]:
shared_x1, shared_x2, shared_y1, shared_y2 = X_1, X_2, y_1, y_2

# Convert numpy array to torch tensors -->
shared_x1 = torch.from_numpy(shared_x1)
shared_x2 = torch.from_numpy(shared_x2)
shared_y1 = torch.from_numpy(shared_y1)
shared_y2 = torch.from_numpy(shared_y2)

shared_x1 = torch.tensor(shared_x1, dtype=torch.float32)
shared_x2 = torch.tensor(shared_x2, dtype=torch.float32)
shared_y1 = torch.tensor(shared_y1, dtype=torch.int64)
shared_y2 = torch.tensor(shared_y2, dtype=torch.int64)

datasets  = [shared_x1, shared_x2]
labels = [shared_y1, shared_y2]

  current_tensor = hook_self.torch.native_tensor(*args, **kwargs)


# Below using centralized way (full data) --->

In [29]:
# Concatenate 
X = torch.cat((shared_x1, shared_x2), dim=0)
Y = torch.cat((shared_y1, shared_y2), dim=0)

from torch import nn, optim
import torch.nn.functional as F

# TODO: Define your network architecture here
class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(18420, 512)
        self.fc2 = nn.Linear(512, 64)
        self.fc3 = nn.Linear(64, 6)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(x.shape[0], -1)
        
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.softmax(self.fc3(x), dim=1)
        
        return x
    
# Create the network, define the criterion and optimizer
model = Classifier()
# criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)
    
epochs = 50

for e in range(epochs):

    epoch_loss = 0.0
    epoch_acc = 0.0
    optimizer.zero_grad()
    
    pred = model(X)
    loss = F.cross_entropy(pred, Y)

    loss.backward()
    optimizer.step()
    
    # statistics
    #prob = F.softmax(pred, dim=1)
    top1 = torch.argmax(pred, dim=1)
    ncorrect = torch.sum(top1 == Y)

    epoch_loss += loss.item()
    
    epoch_acc += ncorrect.item()

    epoch_loss /= Y.shape[0]
    epoch_acc /= Y.shape[0]

    print(f"Epoch: {e}",f"Training loss: {epoch_loss}", f" | Training Accuracy: {epoch_acc}")

Epoch: 0 Training loss: 0.00010541586023679214  | Training Accuracy: 0.16666666666666666
Epoch: 1 Training loss: 0.0001053553438843075  | Training Accuracy: 0.16666666666666666
Epoch: 2 Training loss: 0.00010528027527415622  | Training Accuracy: 0.17084362866219555
Epoch: 3 Training loss: 0.00010519656648017586  | Training Accuracy: 0.18696317213789856
Epoch: 4 Training loss: 0.00010510910565831519  | Training Accuracy: 0.19702317919755266
Epoch: 5 Training loss: 0.00010502268278061579  | Training Accuracy: 0.20390634192257912
Epoch: 6 Training loss: 0.00010493484324939897  | Training Accuracy: 0.2156724320508295
Epoch: 7 Training loss: 0.00010484467535695549  | Training Accuracy: 0.22679138722202613
Epoch: 8 Training loss: 0.00010475039776668421  | Training Accuracy: 0.23367454994705258
Epoch: 9 Training loss: 0.00010465371467068778  | Training Accuracy: 0.23432168490410638
Epoch: 10 Training loss: 0.00010455445074056061  | Training Accuracy: 0.23755735968937522
Epoch: 11 Training los

<h2>2 - Tagging tensors</h2>
The code below will add a tag (of your choice) to the data that will be sent to grid nodes. This tag is important as the network will need it to retrieve this data later.

In [25]:
tag_input = []
tag_label = []

for i in range(len(compute_nodes)):
    tag_input.append(datasets[i].tag("#X", "#gtex_v8", "#dataset","#balanced").describe("The input datapoints to the GTEx_V8 dataset."))
    tag_label.append(labels[i].tag("#Y", "#gtex_v8", "#dataset","#balanced").describe("The input labels to the GTEx_V8 dataset."))

<h2> 3 - Sending our tensors to grid nodes</h2>

In [26]:
shared_x1 = tag_input[0].send(compute_nodes[0]) # First chunk of dataset to h1
shared_x2 = tag_input[1].send(compute_nodes[1]) # Second chunk of dataset to h2

shared_y1 = tag_label[0].send(compute_nodes[0]) # First chunk of labels to h1
shared_y2 = tag_label[1].send(compute_nodes[1]) # Second chunk of labels to h2

In [27]:
print("X tensor pointers: ", shared_x1, shared_x2)
print("Y tensor pointers: ", shared_y1, shared_y2)

X tensor pointers:  (Wrapper)>[PointerTensor | me:70361333315 -> h1:35881159115]
	Tags: #balanced #dataset #gtex_v8 #X 
	Shape: torch.Size([900, 18420])
	Description: The input datapoints to the GTEx_V8 dataset.... (Wrapper)>[PointerTensor | me:49684275468 -> h2:52723085225]
	Tags: #balanced #dataset #gtex_v8 #X 
	Shape: torch.Size([900, 18420])
	Description: The input datapoints to the GTEx_V8 dataset....
Y tensor pointers:  (Wrapper)>[PointerTensor | me:49137384384 -> h1:4180671457]
	Tags: #balanced #Y #gtex_v8 #dataset 
	Shape: torch.Size([900])
	Description: The input labels to the GTEx_V8 dataset.... (Wrapper)>[PointerTensor | me:10561419581 -> h2:1629165329]
	Tags: #balanced #Y #gtex_v8 #dataset 
	Shape: torch.Size([900])
	Description: The input labels to the GTEx_V8 dataset....


<h2>Disconnect nodes</h2>

In [28]:
for i in range(len(compute_nodes)):
    compute_nodes[i].close()

### Go to the following address to search available tags:
http://0.0.0.0:5000/search-available-tags