<a name="top"></a>
<br/>
# Using `medGAN` to try to perform data augmentation on the MIMIC-III dataset with binary values

Author: [Sylvain Combettes](https://github.com/sylvaincom). <br/>
Last update: Sep 3, 2019. Creation: Aug 29, 2019. <br/>
My own medGAN repository (that is based on Edward Choi's work): [medgan](https://github.com/sylvaincom/medgan-tips). <br/>
Edward Choi's original repository: [medgan](https://github.com/mp2893/medgan).

Before reading this notebook, make sure that you have read my [medGAN repository](https://github.com/sylvaincom/medgan-tips)'s table of contents.

> **Using `medGAN` to try to perform data augmentation** <br/> <br/>
With `medGAN`, we want to generate (fake) realistic patient data, which can then enrich the initial training database.
For example, my training dataset $A$ is not large enough (let it be 500 samples with 50 features) and we want to use `medGAN` to generate a new dataset $B$ of 1000 fake samples (with 50 features as well). By adding $B$ to $A$, we get a new training dataset $C$ that has 1500 patients. We can hope that $C$ helps algorithms (any one of them) make better predictions than $A$. <br/> <br/>
I asked Edward Choi what he thought about using the generative model GANs for data augmentation. Trying to generate fake realistic patients with `medGAN` from a dataset of 500 samples with 250 variables seems suboptimal: there seems to be too many variables and not enough samples. There is no definite number as to how many variables we need to delete: it depends on the variance of each variable and the correlation between variables. For example, if there is a variable named "gender" and all 500 samples are from men (thus low variance), then it would be very easy for `medGAN` to replicate that variable (by putting men as gender for each generated sample).

We will use the MIMIC-III dataset and process it so that we only have binary values.

From now on, whenever we refer to "input" or "output", we refer to the input and output of medgan.py (unless specified otherwise). "input" is the original real-life dataset and "output" is the fake realistic generated dataset.

Warning: the computing time is _very_ long.

---
### Table of contents

- [Loading the data](#load)
- [Predicting the column of index `10` (called `target`) with (only) the original real-life dataset](#pred1)
- [Predicting the column of index `10` (called `target`) with (only) the (fake) generated dataset](#pred2)
- [Predicting the column of index `10` (called `target`) with data augmentation](#pred3)

---
### Imports

In [None]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
from time import process_time
import datetime

from sklearn import datasets, model_selection, linear_model, neighbors, neural_network, naive_bayes
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

---
# Loading the data <a name="load"></a>

## Loading the real-life original dataset

We refer to the real-life original dataset as `df_input`.

In [None]:
input_data_array = pickle.load(open('training-data.matrix', 'rb'))
df_input = pd.DataFrame(input_data_array)
print('The shape of the real-life original dataset is :', df_input.shape)
df_input.head(5)

## Loading the (fake) generated dataset

We refer to the (fake) generated dataset as `df_output`.

In [None]:
output = np.load('gen-samples.npy')
df_output = pd.DataFrame(output).round(0)
print('The shape of the (fake) generated dataset is :', df_output.shape)
df_output.head(5)

## Choosing the feature we are going to try to predict

Which feature are we going to try to predict? For example, we want one with not only zeros (si it is harder to predict it).

In [None]:
plt.plot(df_input.sum()/df_input.shape[0], 'o')
plt.xlabel('Index of feature')
plt.ylabel('Proportion of 1s')
plt.show()

In [None]:
target = 10
print('Approx. proportion of 1s of target :', round(df_input[target].sum()/df_input.shape[0], 4))

---
# Predicting the column of index `10` (called `target`) with (only) the original real-life dataset <a name="pred1"></a>

## Preparing the data

In [None]:
df = df_input
X_dataset = df.loc[:, df.columns != target].values
y_dataset = np.ravel(df.loc[:, df.columns == target].values)
print(X_dataset.shape, y_dataset.shape)

## Benchmarking some models according to their score

In [None]:
def score_and_time(model, X_dataset, y_dataset, cv):
    """
    When there are no hyper-parameters.
    This function returns a list with the scores and processing times of model.
    The scores are calculated with cross_val_score (with K-Fold equal to cv).
    """
    t_start = process_time()
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=cv)
    t_stop = process_time()
    part_l = [round(scores.mean(), 3), round(scores.std()*2, 3), str(datetime.timedelta(seconds=t_stop-t_start))]
    return part_l

def score_and_time_hyp(model, parameters, X_dataset, y_dataset, cv):
    """
    When there are hyper-parameters.
    This function returns a list with the scores and processing times of model.
    The scores are calculated with RandomizedSearchCV (with K-Fold equal to cv).
    """
    t_start = process_time()
    clf_grid = RandomizedSearchCV(model, parameters, cv=cv, n_jobs=-1)  
    clf_grid.fit(X_dataset, y_dataset)
    scores = clf_grid.best_score_ # mean cross-validated score of the best_estimator
    t_stop = process_time()
    part_l = [round(scores.mean(), 3), '-', str(datetime.timedelta(seconds=t_stop-t_start))]
    return part_l

def ml_benchmark(X_dataset, y_dataset, cv):
    """
    This function returns a pandas dataframe with the scores and processing times of some classic machine learning models
    applied to X_dataset and y_dataset.
    The scores are calculated with cross_val_score (with K-Fold equal to cv).
    If there are hyper-parameters, there are computed with RandomizedSearchCV.
    """
    
    print('The shape of X_dataset is :', X_dataset.shape)
    print('The shape of y_dataset is :', y_dataset.shape)
     
    rows_name = ["Ridge", "Lasso", "Logistic Regression", "Nearest Neighbors", "Naive Bayes",
                  "Perceptron", "Random Forest", "Multi-Layer Perceptron"]
    
    columns_name = ['Approx. mean of scores', 'Approx. variance of scores', 'Processing time']
    
    l = []
        
    model = Ridge()
    l.append(score_and_time(model, X_dataset, y_dataset, cv))
    
    model = Lasso()
    l.append(score_and_time(model, X_dataset, y_dataset, cv))
    
    model = linear_model.LogisticRegression()
    parameters = {'solver': ['lbfgs','liblinear','sag','saga'], 'multi_class': ['auto'],
                 'warm_start': [True, False], 'C': [0.01,0.1,1,10,100]}
    l.append(score_and_time_hyp(model, parameters, X_dataset, y_dataset, cv))
    
    model = neighbors.KNeighborsClassifier()
    parameters = {'n_neighbors': [1,2,3,5,8,10,20], 'algorithm': ['ball_tree', 'kd_tree', 'brute']}
    l.append(score_and_time_hyp(model, parameters, X_dataset, y_dataset, cv))
    
    model = naive_bayes.GaussianNB()
    l.append(score_and_time(model, X_dataset, y_dataset, cv))
    
    model = linear_model.Perceptron()
    l.append(score_and_time(model, X_dataset, y_dataset, cv))
    
    model = RandomForestClassifier()
    parameters = {'n_estimators': [1000], 'max_depth': [1,10,25,50], "bootstrap": [True, False],
                  "max_features": [1, 3, 10], "min_samples_split": [2, 3, 10],
                  "criterion": ["gini", "entropy"], 'random_state': [0]}
    l.append(score_and_time_hyp(model, parameters, X_dataset, y_dataset, cv))
    
    model = neural_network.MLPClassifier()
    parameters = {'solver': ['lbfgs'], 'max_iter': [1,500,1000,1500,2000], 'alpha': 10.0**-np.arange(1,5),
                  'hidden_layer_sizes': np.arange(1,15,2),'activation': ['relu','tanh']}
    l.append(score_and_time_hyp(model, parameters, X_dataset, y_dataset, cv))
    
    out = pd.DataFrame(l, index = rows_name, columns = columns_name)
    
    return out

In [None]:
input_benchmark = ml_benchmark(X_dataset, y_dataset, 5)
input_benchmark.sort_values(by=['Approx. mean of scores'], ascending=False)

We export the results into a csv file:

In [None]:
input_benchmark.to_csv('input_benchmark.csv', sep=';')

---
# Predicting the column of index `10` (called `target`) with (only) the (fake) generated dataset <a name="pred2"></a>

## Preparing the data

In [None]:
df = df_output
X_dataset = df.loc[:, df.columns != target].values
y_dataset = np.ravel(df.loc[:, df.columns == target].values)
print(X_dataset.shape, y_dataset.shape)

## Benchmarking some models according to their score

In [None]:
output_benchmark = ml_benchmark(X_dataset, y_dataset, 5)
output_benchmark.sort_values(by=['Approx. mean of scores'], ascending=False)

We export the results into a csv file:

In [None]:
output_benchmark.to_csv('output_benchmark.csv', sep=';')

---
# Predicting the column of index `10` (called `target`) with data augmentation <a name="pred3"></a>

## Preparing the data

We concatenate the real-life dataset `df_input` and the (fake) generated dataset `df_output` into `df_aug`:

In [None]:
df_aug = df_input.append(df_output)
print(df_aug.shape)

In [None]:
df = df_aug
X_dataset = df.loc[:, df.columns != target].values
y_dataset = np.ravel(df.loc[:, df.columns == target].values)
print(X_dataset.shape, y_dataset.shape)

## Benchmarking some models according to their score

In [None]:
aug_benchmark = ml_benchmark(X_dataset, y_dataset, 5)
aug_benchmark.sort_values(by=['Approx. mean of scores'], ascending=False)

We export the results into a csv file:

In [None]:
aug_benchmark.to_csv('aug_benchmark', sep=';')

---
# Comparison: does data augmentation help boost the score?

In [None]:
xaxis = input_benchmark['Approx. mean of scores'].values
yaxis = aug_benchmark['Approx. mean of scores'].values

start = min(np.min(xaxis), np.min(yaxis))
stop = max(np.max(xaxis), np.max(yaxis))
p = len(xaxis)
X = np.linspace(start, stop, num=p+1)

plt.plot(xaxis, yaxis, 'ok', X, X, '-g');

plt.legend(['Approx. mean of scores', 'Equal approx. mean of scores'])
plt.title('Dimension-wise probability performance of medGAN')
plt.xlabel('For the real dataset')
plt.ylabel('For the (fake) generated dataset')
plt.savefig('comparison.png') # to save the figure
plt.show()

---
Back to [top](#top).