<a name="top"></a>
<br/>
# Using `medGAN` to try to perform data augmentation on the MIMIC-III dataset with binary values

Author: [Sylvain Combettes](https://github.com/sylvaincom). <br/>
Last update: Aug 29, 2019. Created: Aug 29, 2019. <br/>
My own medGAN repository (that is based on Edward Choi's work): [medgan](https://github.com/sylvaincom/medgan-tips). <br/>
Edward Choi's original repository: [medgan](https://github.com/mp2893/medgan).

Before reading this notebook, make sure that you have read my [medGAN repository](https://github.com/sylvaincom/medgan-tips)'s table of contents.

> **Using `medGAN` to try to perform data augmentation** <br/> <br/>
With `medGAN`, we want to generate (fake) realistic patient data, which can then enrich the initial training database.
For example, my training dataset $A$ is not large enough (let it be 500 samples with 50 features) and we want to use `medGAN` to generate a new dataset $B$ of 1000 fake samples (with 50 features as well). By adding $B$ to $A$, we get a new training dataset $C$ that has 1500 patients. We can hope that $C$ helps algorithms (any one of them) make better predictions than $A$. <br/> <br/>
I asked Edward Choi what he thought about using the generative model GANs for data augmentation. Trying to generate fake realistic patients with `medGAN` from a dataset of 500 samples with 250 variables seems suboptimal: there seems to be too many variables and not enough samples. There is no definite number as to how many variables we need to delete: it depends on the variance of each variable and the correlation between variables. For example, if there is a variable named "gender" and all 500 samples are from men (thus low variance), then it would be very easy for `medGAN` to replicate that variable (by putting men as gender for each generated sample).

We will use the MIMIC-III dataset and process it so that we only have binary values.

From now on, whenever we refer to "input" or "output", we refer to the input and output of medgan.py (unless specified otherwise). "input" is the original real-life dataset and "output" is the fake realistic generated dataset.

Warning: the computing time is very long.

---
### Table of contents

- [Loading the data](#load)
- [Predicting the column of index `10` (called `target`) with (only) the original real-life dataset](#pred1)
- [Predicting the column of index `10` (called `target`) with (only) the (fake) generated dataset](#pred2)
- [Predicting the column of index `10` (called `target`) with data augmentation](#pred3)

---
### Imports

In [None]:
import numpy as np
import os
import pickle
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import model_selection, neural_network, model_selection, preprocessing, neighbors, naive_bayes, linear_model
from sklearn.linear_model import LinearRegression, LogisticRegressionCV, LogisticRegression, RidgeCV, LassoCV
from sklearn.ensemble import RandomForestClassifier

---
# Loading the data <a name="load"></a>

## Loading the real-life original dataset

We call the real-life original dataset by `df_input`.

In [None]:
input_data_array = pickle.load(open('training-data.matrix', 'rb'))
df_input = pd.DataFrame(input_data_array)
print('The shape of the input dataset is :', df_input.shape)
df_input.head(5)

## Loading the (fake) generated dataset

We call the (fake) generated dataset by `df_output`.

In [None]:
output = np.load('gen-samples.npy')
df_output = pd.DataFrame(output).round(0)
print('The shape of the output dataset is :', df_output.shape)
df_output.head(5)

## Choosing the feature we are going to try to predict

Which feature are we going to try to predict? For example, we want one with not only zeros (si it is harder to predict it).

In [None]:
plt.plot(df_input.sum()/df_input.shape[0], 'o')
plt.xlabel('Index of feature')
plt.ylabel('Proportion of 1s')
plt.show()

In [None]:
target = 10
print('Approx. proportion of 1s of', target, ':', round(df_input[target].sum()/df_input.shape[0], 4))

---
# Predicting the column of index `10` (called `target`) with (only) the original real-life dataset <a name="pred1"></a>

## Preparing the data

In [None]:
df = df_input
X_dataset = df.loc[:, df.columns != target].values
y_dataset = np.ravel(df.loc[:, df.columns == target].values)
print(X_dataset.shape, y_dataset.shape)

## Benchmarking some models according to their score

In [None]:
def prediction_benchmark(X_dataset, y_dataset):
    
    print('The shape of X_dataset is :', X_dataset.shape)
    print('The shape of y_dataset is :', y_dataset.shape)
    
    rows_name = ["1-Nearest Neighbors", "5-Nearest Neighbors", "10-Nearest Neighbors", "Naive Bayes",
                 "Ridge", "Lasso", "Logistic Regression", "Perceptron", "Multi-Layer Perceptron"]
    
    columns_name = ['Approx. mean of scores', 'Approx. variance of scores']
    
    l = []
    
    model = neighbors.KNeighborsClassifier(n_neighbors=1)
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std()*2, 3)])
    
    model = neighbors.KNeighborsClassifier(n_neighbors=5)
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = neighbors.KNeighborsClassifier(n_neighbors=10)
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = naive_bayes.GaussianNB()
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1])
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = LassoCV(cv=5, random_state=0)
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = linear_model.LogisticRegression()
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = linear_model.Perceptron()
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    model = neural_network.MLPClassifier(hidden_layer_sizes=(6,),max_iter=1000,solver='lbfgs',alpha=.01)
    scores = model_selection.cross_val_score(model, X_dataset, y_dataset, cv=5)
    l.append([round(scores.mean(), 3), round(scores.std(), 3)])
    
    out = pd.DataFrame(l, index = rows_name, columns = columns_name)
    
    return out

In [None]:
input_benchmark = prediction_benchmark(X_dataset, y_dataset)
input_benchmark.sort_values(by=['Approx. mean of scores'], ascending=False)

---
# Predicting the column of index `10` (called `target`) with (only) the (fake) generated dataset <a name="pred2"></a>

## Preparing the data

In [None]:
df = df_output
X_dataset = df.loc[:, df.columns != target].values
y_dataset = np.ravel(df.loc[:, df.columns == target].values)
print(X_dataset.shape, y_dataset.shape)

## Benchmarking some models according to their score

In [None]:
output_benchmark = prediction_benchmark(X_dataset, y_dataset)
output_benchmark.sort_values(by=['Approx. mean of scores'], ascending=False)

---
# Predicting the column of index `10` (called `target`) with data augmentation <a name="pred3"></a>

## Preparing the data

We concatenate the real-life dataset `df_input` and the (fake) generated dataset `df_output` into `df_aug`:

In [None]:
df_aug = df_input.append(df_output)
print(df_aug.shape)

In [None]:
df = df_aug
X_dataset = df.loc[:, df.columns != target].values
y_dataset = np.ravel(df.loc[:, df.columns == target].values)
print(X_dataset.shape, y_dataset.shape)

## Benchmarking some models according to their score

In [None]:
aug_benchmark = prediction_benchmark(X_dataset, y_dataset)
aug_benchmark.sort_values(by=['Approx. mean of scores'], ascending=False)

---
Back to [top](#top).