<a name="top"></a>
<br/>
# Understanding how `medGAN` works on the MIMIC-III dataset of shape `(1000, 100)` with binary values

Author: [Sylvain Combettes](https://github.com/sylvaincom). <br/>
Last update: Sep 4, 2019. Creation: Aug 12, 2019. <br/>
My own medGAN repository (that is based on Edward Choi's work): [medgan-tips](https://github.com/sylvaincom/medgan-tips). <br/>
Edward Choi's original repository: [medgan](https://github.com/mp2893/medgan).

The final goal of my project is to use `medGAN` on my own dataset (electronic health records). Hence, I first need to understand how the `medGAN` program works. In this notebook, I provide a few code cells and explanations to help better understand and run `medGAN`. Because there are some confidentiality issues with the MIMIC-III dataset, I cleared the output of the cells.

Before reading this notebook, make sure that you have read my [medGAN repository](https://github.com/sylvaincom/medgan-tips)'s table of contents.

We will use the MIMIC-III dataset and process it so that we only have binary values.

---
### Tables of Contents

- [1) Using process_`mimic.py` and `medgan.py` to generate the fake realistic data](#run)
- [2) How to interpret `gen-samples.npy`?](#gen-samples)
- [3) Comparing the (fake) generated samples to the real-life original ones](#comparison)

---
### Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import pickle

---
<a name="run"></a>
# 1) Using `process_mimic.py` and `medgan.py` to generate the fake realistic data 

This step is detailed in [A few additional tips on how to run Edward Choi's medGAN
](https://github.com/sylvaincom/medgan/blob/master/tips-for-medgan.md).

In short, in the Anaconda prompt, we run:
```
cd C:\Users\<username>\Documents\mimic_binary_small
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"
mkdir generated
```

From now on, whenever we refer to input or output, we refer to the input and output of `medgan.py` (unless specified otherwise).

`process_mimic.py` outputs `training-data.matrix` that contains 46 520 samples and 1 071 features. Because we need data augmentation only when our dataset is small, we assume that we only have 1 000 samples. Because we only have 1 000 samples now, it is quite difficult to generate 1 071 features out of 1 000 samples, so we only keep the 101st features.

We keep only the 1000st samples:

In [None]:
real_data_array = pickle.load(open('training-data.matrix', 'rb')) # real-life dataset
df_real = pd.DataFrame(real_data_array)
print(df_real.shape)

df_real_small = df_real.head(1000) # we only select the 1000st samples
print(df_real_small.shape)

We keep only the 100st features:

In [None]:
col = list(df_real_small.columns)[100:1071]
df_real_small.drop(col, axis=1, inplace=True)
print(df_real_small.shape)

We can now export our dataset:

In [None]:
matrix = pd.DataFrame.as_matrix(df_real_small)
pickle.dump(matrix, open('training-data-small.matrix', 'wb'), -1)

We check if it was saved correctly:

In [None]:
real_data_array_small = pickle.load(open('training-data-small.matrix', 'rb'))
df_real_small = pd.DataFrame(real_data_array_small)
print(df_real_small.shape)
df_real_small.head(5)

In the Anaconda prompt, we run:
```
python medgan.py training-data-small.matrix ./generated/samples --data_type="binary" --n_epoch=1000 --n_pretrain_epoch=100 --batch_size=100
python medgan.py training-data-small.matrix gen-samples --model_file=./generated/samples-999 --generate_data=True --data_type="binary"
```
Some default values are `n_epoch=1000`, `n_pretrain_epoch=100` and `batch_size=1000`. We choose `nSamples=1000` line 406 in `medgan.py`.

---
<a name="gen-samples"></a>
# 2) How to interpret `gen-samples.npy`?

We load the `gen-samples.npy` file which is `medgan.py`'s output:

In [None]:
fict = np.load('gen-samples.npy') # fictitious generated dataset
df_fict = pd.DataFrame(fict)
print(df_fict.shape)
df_fict.head()

The ouput of `medgan.py` has no missing values.

We need to round the values ourselves:

In [None]:
df_fict = df_fict.round(0)
df_fict.head()

We claim that we should delete the rows with missing values (if there are any):

In [None]:
df_fict = pd.DataFrame.dropna(df_fict)
print(df_fict.shape)

---
<a name="comparison"></a>
# 3) Comparing the fictitious generated samples to the real-life original ones

In this section, we wish to compare the accuracy of the fictitious generated dataset considering the original one. As in Choi's paper, we use dimension-wise probability (because the variables are binary).

Given that our data is binary, for each feature (dimension), we claim that `1` corresponds to success and `0` to failure. Hence the proportion of `1` obtained is the Bernoulli success probability _p_.

## 3.1) Probability distribution of real-life data

In [None]:
df_real = df_real_small
n_real, p_real = df_real.shape
print(n_real, p_real)

proba_real = [sum(df_real[f])/n_real for f in list(df_real)]

plt.plot(proba_real, 'o')
plt.xlabel('Index of feature')
plt.ylabel('Bernoulli probability success')
plt.title('For the real-life dataset')
plt.show()

## 3.2) Probability distribution of fictitious generated data

In [None]:
n_fict, p_fict = df_fict.shape
print(n_fict, p_fict)

proba_fict = [sum(df_fict[f])/n_fict for f in list(df_fict)]

plt.plot(proba_fict, 'o')
plt.xlabel('Index of feature')
plt.ylabel('Bernoulli probability success')
plt.title('For the fictitious generated dataset')
plt.show()

## 3.3) Comparison: dimension-wise probability

In [None]:
xaxis = proba_real
yaxis = proba_fict

start = min(np.min(xaxis), np.min(yaxis))
stop = max(np.max(xaxis), np.max(yaxis))
p = len(xaxis)
X = np.linspace(start, stop, num=p+1)

plt.plot(xaxis, yaxis, 'ok', X, X, '-g');

plt.legend(['Bernoulli success probability', 'ideal Bernoulli success probability'])
plt.title('Dimension-wise probability performance of medGAN')
plt.xlabel('For the real dataset')
plt.ylabel('For the fictitious generated dataset')
plt.savefig('accuracy_mimic_binary_small.png', dpi=120) # to save the image in high resolution
plt.show()

The values of the $x$-axis and the $y$-axis are ordered: we successively compare the Bernoulli success probability of both datasets for a given variable.

For information, we have 100 features thus 100 scatter points. Note that once we have learned the distribution of our real-life original data, we can generate as many (fake) samples as we want, for example 1 000.

The diagonal green line indicates the ideal performance where the real and the fictitious realistic generated data show identical quality. Based on the graph, as the dots are close to the diagonal green line, we can say that `medGAN` has a really good accuracy.

---
Back to [top](#top).