<h1>Federated Learning - GTEx_V8 Example</h1>
<h2>Populate remote PyGrid nodes with labeled tensors </h2>
In this notebook, we will populate our PyGrid nodes with labeled data so that it will be used later by people interested in train models.

**NOTE:** At the time of running this notebook, we were running the grid components in background mode.  

Components:
 - PyGrid Network (http://localhost:5000)
 - PyGrid Node h1 (http://localhost:3000)
 - PyGrid Node h2 (http://localhost:3001)
 
Code implementation for this notebook has been referred from <a href="https://github.com/OpenMined/PySyft/blob/master/examples/tutorials/grid/federated_learning/mnist/Fed.Learning%20MNIST%20%5B%20Part-1%20%5D%20-%20Populate%20a%20Grid%20Network%20(%20Dataset%20).ipynb">Fed.Learning MNIST [ Part-1 ] - Populate a Grid Network ( Dataset )</a> tutorial

<h2>Import dependencies</h2>

In [1]:
#dependencies for helper functions/classes
import pandas as pd
import pyarrow.parquet as pq
from typing import NamedTuple
from dataclasses import *
import os.path as path
import os
import progressbar
import requests
import numpy as np

#keras for ML
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dropout, Input, Dense
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.utils import plot_model, normalize
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import SGD, Adam, Nadam, Adadelta
from tensorflow.keras.activations import relu, elu, sigmoid

#sklearn for preprocessing the data and train-test split
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, accuracy_score, classification_report
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, r2_score, mean_squared_error, mean_absolute_error

#for plots
import matplotlib
import matplotlib.pyplot as plt

seed = 7

#%matplotlib inline

In [2]:
import syft as sy

#########<syft==0.2.8>#######################
# # Dynamic FL -->
from syft.grid.clients.data_centric_fl_client import DataCentricFLClient

# #Static FL -->
from syft.grid.clients.model_centric_fl_client import ModelCentricFLClient
#############################################

#########<syft==0.2.6>#######################
# Dynamic FL -->
# from syft.grid.clients.dynamic_fl_client import DynamicFLClient

#Static FL -->
# from syft.grid.clients.static_fl_client import StaticFLClient
#############################################

import torch
import pickle
import time
import numpy as np
import torchvision
from torchvision import datasets, transforms
# import tqdm
# from ipywidgets import IntProgress

In [3]:
@dataclass
class Labels(NamedTuple):
    '''
    One-hot labeled data
    '''
    tissue: np.ndarray
    sex: np.ndarray
    age: np.ndarray
    death: np.ndarray
        

class Genes:
    '''
    Class to load GTEX samples and gene expressions data
    '''
    def __init__(self, samples_path: str = '', expressions_path: str = '', problem_type: str = "classification"):
        self.__set_samples(samples_path)
        self.__set_labels(problem_type)
        if expressions_path != '':
            self.expressions = self.get_expressions(expressions_path)

    def __set_samples(self, sample_path: str) -> pd.DataFrame:
        self.samples: pd.DataFrame = pq.read_table(sample_path).to_pandas()
        self.samples["Death"].fillna(-1.0, inplace = True)
        self.samples: pd.DataFrame = self.samples.set_index("Name")
        self.samples["Sex"].replace([1, 2], ['male', 'female'], inplace=True)
        self.samples["Death"].replace([-1,0,1,2,3,4], ['alive/NA', 'ventilator case', '<10 min.', '<1 hr', '1-24 hr.', '>1 day'], inplace=True)
    
        return self.samples

    def __set_labels(self, problem_type: str = "classification") -> Labels:
        self.labels_list = ["Tissue", "Sex", "Age", "Death"]
        self.labels: pd.DataFrame = self.samples[self.labels_list]
        self.drop_list = self.labels_list + ["Subtissue", "Avg_age"]
        
        if problem_type == "classification":
            dummies_df = pd.get_dummies(self.labels["Age"])
            print(dummies_df.columns.tolist())
            self.Y = dummies_df.values
        
        if problem_type == "regression":
            self.Y = self.samples["Avg_age"].values
        
        return self.Y
    
    def delete_particular_age_examples(self):
        df_series = pd.DataFrame(self.labels["Age"])
        indexes_of_50 = np.where(df_series["Age"] == '50-59')[0].tolist()[300:]
        indexes_of_60 = np.where(df_series["Age"] == '60-69')[0].tolist()[300:]
        indexes_of_20 = np.where(df_series["Age"] == '20-29')[0].tolist()[300:]
        indexes_of_30 = np.where(df_series["Age"] == '30-39')[0].tolist()[300:]
        indexes_of_40 = np.where(df_series["Age"] == '40-49')[0].tolist()[300:]
        indexes_to_delete = indexes_of_50 + indexes_of_60 + indexes_of_20 + indexes_of_30 + indexes_of_40
        
        return indexes_to_delete

    def sex_output(self, model):
        return Dense(units=self.Y.sex.shape[1], activation='softmax', name='sex_output')(model)

    def tissue_output(self, model):
        return Dense(units=self.Y.tissue.shape[1], activation='softmax', name='tissue_output')(model)

    def death_output(self, model):
        return Dense(units=self.Y.death.shape[1], activation='softmax', name='death_output')(model)

    def age_output(self, model):
        '''
        Created an output layer for the keras mode
        :param model: keras model
        :return: keras Dense layer
        '''
        return Dense(units=self.Y.age.shape[1], activation='softmax', name='age_output')(model)


    def get_expressions(self, expressions_path: str)->pd.DataFrame:
        '''
        load gene expressions DataFrame
        :param expressions_path: path to file with expressions
        :return: pandas dataframe with expression
        '''
        if expressions_path.endswith(".parquet"):
            return pq.read_table(expressions_path).to_pandas().set_index("Name")
        else:
            separator = "," if expressions_path.endswith(".csv") else "\t"
            return pd.read_csv(expressions_path, sep=separator).set_index("Name")

    def prepare_data(self, normalize_expressions: bool = True)-> np.ndarray:
        '''
        :param normalize_expressions: if keras should normalize gene expressions
        :return: X array to be used as input data by keras
        '''
        data = self.samples.join(self.expressions, on = "Name", how="inner")
        ji = data.columns.drop(self.drop_list)
        x = data[ji]
        
        # adding one-hot-encoded tissues and sex
        x = pd.concat([x,pd.get_dummies(data['Tissue'], prefix='tissue'), pd.get_dummies(data['Sex'], prefix='sex')],axis=1)
        x = x.values
        
        return normalize(x, axis=0) if normalize_expressions else x
    
    def get_features_dataframe(self, add_tissues=True):
        data = self.samples.join(self.expressions, on = "Name", how="inner")
        ji = data.columns.drop(self.drop_list)
        df = data[ji]
        if add_tissues:
            df = pd.concat([df,pd.get_dummies(data['Tissue'], prefix='tissue'), pd.get_dummies(data['Sex'], prefix='sex')],axis=1)
        x = df.values
        min_max_scaler = MinMaxScaler()
        x_scaled = min_max_scaler.fit_transform(x)
        df_normalized = pd.DataFrame(x_scaled, columns=df.columns, index=df.index)
        return df_normalized


### Loading data

In [4]:
from pathlib import Path

# resolving base folder
wd = Path(".").resolve()
base = wd if (wd / "data").exists() else Path("..")

In [5]:
data = base / "data"
samples_path = str(data / 'gtex' / 'v8_samples.parquet')
expressions_path = str(data / 'gtex' / 'v8_expressions.parquet')

# samples_path = '../data/gtex/v8_samples.parquet'
# expressions_path = '../data/gtex/v8_expressions.parquet'

In [6]:
def Huber(yHat, y, delta=1.):
    return np.where(np.abs(y-yHat) < delta,.5*(y-yHat)**2 , delta*(np.abs(y-yHat)-0.5*delta))

def transform_to_probas(age_intervals):
    class_names = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
    res = []
    for a in age_intervals:
        non_zero_index = class_names.index(a)
        res.append([0 if i != non_zero_index else 1 for i in range(len(class_names))])
    return np.array(res)
    
def transform_to_interval(age_probas):
    class_names = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
    return np.array(list(map(lambda p: class_names[np.argmax(p)], age_probas)))        

### Preprocessing for Classification model

In [7]:
genes = Genes(samples_path, expressions_path, problem_type="classification")
X = genes.get_features_dataframe().values
Y = genes.Y

['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']


In [8]:
X.shape, Y.shape

((17382, 18420), (17382, 6))

In [9]:
a = transform_to_interval(Y)
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts))

{'20-29': 1320,
 '30-39': 1323,
 '40-49': 2702,
 '50-59': 5615,
 '60-69': 5821,
 '70-79': 601}

In [10]:
b = [np.where(r==1)[0][0] for r in Y]
unique, counts = np.unique(b, return_counts=True)
dict(zip(unique, counts))

{0: 1320, 1: 1323, 2: 2702, 3: 5615, 4: 5821, 5: 601}

In [11]:
df_pie = pd.DataFrame(columns = ['age_group','label'])
df_pie['age_group'] = a
df_pie['label'] = b

# Balancing the data below

In [12]:
import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.items():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

In [13]:
no_samples_to_take = 200
res = balanced_sample_maker(X,np.asarray(df_pie['age_group'].values), no_samples_to_take)

In [14]:
unique, counts = np.unique(res[1], return_counts=True)
dict(zip(unique, counts))

{'20-29': 200,
 '30-39': 200,
 '40-49': 200,
 '50-59': 200,
 '60-69': 200,
 '70-79': 200}

# Label encoding -->

In [15]:
le = LabelEncoder()
le.fit(res[1])
y = le.transform(res[1])

# Splitting data into different shards --> 

In [16]:
from sklearn import model_selection
X_1, X_2, y_1, y_2 = model_selection.train_test_split(res[0], y, test_size=0.5, random_state=seed, stratify=res[1])

In [17]:
_dtype = np.float32
y_1 = np.vstack(y_1).astype(np.uint8)
y_2 = np.vstack(y_2).astype(np.uint8)
X_1.dtype, y_1.dtype,X_2.dtype, y_2.dtype

(dtype('float32'), dtype('uint8'), dtype('float32'), dtype('uint8'))

In [18]:
unique, counts = np.unique(y_1, return_counts=True)
print(dict(zip(unique, counts)))
unique, counts = np.unique(y_2, return_counts=True)
print(dict(zip(unique, counts)))

{0: 100, 1: 100, 2: 100, 3: 100, 4: 100, 5: 100}
{0: 100, 1: 100, 2: 100, 3: 100, 4: 100, 5: 100}


# One-hot encode the labels

In [19]:
Y[0]
y_1 = y_1.reshape(y_1.shape[0])
y_2 = y_2.reshape(y_2.shape[0])
y_2.shape

(600,)

In [20]:
sy.version.__version__

'0.2.8'

<h2>Setup config</h2>
Init hook, connect with grid nodes, etc...

In [21]:
hook = sy.TorchHook(torch)

# Connect directly to grid nodes
nodes = ["ws://0.0.0.0:3000/",
         "ws://0.0.0.0:3001/"]

compute_nodes = []
for node in nodes:
    # For syft 0.2.8 --> replace DynamicFLClient with DataCentricFLClient
    compute_nodes.append( DataCentricFLClient(hook, node) )

In [22]:
compute_nodes

[<Federated Worker id:h1>, <Federated Worker id:h2>]

## 1 - Conversion to Tensor

The code below will convert GTEx data samples to tensors.

In [23]:
# DATA_PATH = '../data/balanced/numpy_files/'
# shared_x1 = np.load(DATA_PATH + 'shared_x1.npy') # First chunk of dataset 
# shared_x2 = np.load(DATA_PATH + 'shared_x2.npy') # Second chunk of dataset 

# shared_y1 = np.load(DATA_PATH + 'shared_y1.npy') # First chunk of labels 
# shared_y2 = np.load(DATA_PATH + 'shared_y2.npy') # Second chunk of labels 

shared_x1, shared_x2, shared_y1, shared_y2 = X_1, X_2, y_1, y_2

# Convert numpy array to torch tensors -->
shared_x1 = torch.from_numpy(shared_x1)
shared_x2 = torch.from_numpy(shared_x2)
shared_y1 = torch.from_numpy(shared_y1)
shared_y2 = torch.from_numpy(shared_y2)

shared_x1 = torch.tensor(shared_x1, dtype=torch.float32)
shared_x2 = torch.tensor(shared_x2, dtype=torch.float32)
shared_y1 = torch.tensor(shared_y1, dtype=torch.int64)
shared_y2 = torch.tensor(shared_y2, dtype=torch.int64)

datasets  = [shared_x1, shared_x2]
labels = [shared_y1, shared_y2]

  current_tensor = hook_self.torch.native_tensor(*args, **kwargs)


# Below using centralized way (full data) --->

In [24]:
# Concatenate 
X = torch.cat((shared_x1, shared_x2), dim=0)
Y = torch.cat((shared_y1, shared_y2), dim=0)

from torch import nn, optim
import torch.nn.functional as F

# TODO: Define your network architecture here
class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(18420, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 6)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(x.shape[0], -1)
        
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.softmax(self.fc4(x), dim=1)
        
        return x
    
# Create the network, define the criterion and optimizer
model = Classifier()
# criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
    
epochs = 20

for e in range(epochs):

    epoch_loss = 0.0
    epoch_acc = 0.0
    optimizer.zero_grad()
    
    pred = model(X)
    loss = F.cross_entropy(pred, Y)

    loss.backward()
    optimizer.step()
    
    # statistics
    #prob = F.softmax(pred, dim=1)
    top1 = torch.argmax(pred, dim=1)
    ncorrect = torch.sum(top1 == Y)

    epoch_loss += loss.item()
    
    epoch_acc += ncorrect.item()

    epoch_loss /= Y.shape[0]
    epoch_acc /= Y.shape[0]

    print(f"Epoch: {e}",f"Training loss: {epoch_loss}", f" | Training Accuracy: {epoch_acc}")

Epoch: 0 Training loss: 0.0014931774139404297  | Training Accuracy: 0.15666666666666668
Epoch: 1 Training loss: 0.0014906399448712667  | Training Accuracy: 0.16666666666666666
Epoch: 2 Training loss: 0.0014876883228619893  | Training Accuracy: 0.17
Epoch: 3 Training loss: 0.001481928825378418  | Training Accuracy: 0.22166666666666668
Epoch: 4 Training loss: 0.0014999624093373616  | Training Accuracy: 0.1975
Epoch: 5 Training loss: 0.0015007805824279784  | Training Accuracy: 0.18
Epoch: 6 Training loss: 0.0014997453490893046  | Training Accuracy: 0.1875
Epoch: 7 Training loss: 0.0014792474110921223  | Training Accuracy: 0.22333333333333333
Epoch: 8 Training loss: 0.0014908491571744282  | Training Accuracy: 0.1875
Epoch: 9 Training loss: 0.0014751755197842916  | Training Accuracy: 0.21833333333333332
Epoch: 10 Training loss: 0.0014737178881963095  | Training Accuracy: 0.2275
Epoch: 11 Training loss: 0.0014729687571525573  | Training Accuracy: 0.24
Epoch: 12 Training loss: 0.0014643908540

<h2>2 - Tagging tensors</h2>
The code below will add a tag (of your choice) to the data that will be sent to grid nodes. This tag is important as the network will need it to retrieve this data later.

In [25]:
tag_input = []
tag_label = []

for i in range(len(compute_nodes)):
    tag_input.append(datasets[i].tag("#X", "#gtex_v8", "#dataset","#balanced").describe("The input datapoints to the GTEx_V8 dataset."))
    tag_label.append(labels[i].tag("#Y", "#gtex_v8", "#dataset","#balanced").describe("The input labels to the GTEx_V8 dataset."))

<h2> 3 - Sending our tensors to grid nodes</h2>

In [26]:
shared_x1 = tag_input[0].send(compute_nodes[0]) # First chunk of dataset to h1
shared_x2 = tag_input[1].send(compute_nodes[1]) # Second chunk of dataset to h2

shared_y1 = tag_label[0].send(compute_nodes[0]) # First chunk of labels to h1
shared_y2 = tag_label[1].send(compute_nodes[1]) # Second chunk of labels to h2

In [27]:
print("X tensor pointers: ", shared_x1, shared_x2)
print("Y tensor pointers: ", shared_y1, shared_y2)

X tensor pointers:  (Wrapper)>[PointerTensor | me:28619960005 -> h1:40335857618]
	Tags: #gtex_v8 #X #balanced #dataset 
	Shape: torch.Size([600, 18420])
	Description: The input datapoints to the GTEx_V8 dataset.... (Wrapper)>[PointerTensor | me:64079951987 -> h2:67976316706]
	Tags: #gtex_v8 #X #balanced #dataset 
	Shape: torch.Size([600, 18420])
	Description: The input datapoints to the GTEx_V8 dataset....
Y tensor pointers:  (Wrapper)>[PointerTensor | me:83086150454 -> h1:76247224541]
	Tags: #Y #gtex_v8 #balanced #dataset 
	Shape: torch.Size([600])
	Description: The input labels to the GTEx_V8 dataset.... (Wrapper)>[PointerTensor | me:44335625920 -> h2:69428496864]
	Tags: #Y #gtex_v8 #balanced #dataset 
	Shape: torch.Size([600])
	Description: The input labels to the GTEx_V8 dataset....


<h2>Disconnect nodes</h2>

In [28]:
for i in range(len(compute_nodes)):
    compute_nodes[i].close()

### Go to the following address to search available tags:
http://0.0.0.0:5000/search-available-tags