<a name="top"></a>
<br/>
# Using `medGAN` on the MIMIC-III dataset of shape (1000, 100) with binary values

Author: [Sylvain Combettes](https://github.com/sylvaincom). <br/>
Last update: Sep 11, 2019. Creation: Aug 12, 2019. <br/>
My own `medGAN` repository: [medgan-tips](https://github.com/sylvaincom/medgan-tips) (based on Edward Choi's work). <br/>
Edward Choi's original repository: [medgan](https://github.com/mp2893/medgan).

Before reading this notebook, make sure that you have read my [medGAN repository](https://github.com/sylvaincom/medgan-tips)'s table of contents.

---
### Table of Contents

- [1) Using process_`mimic.py` and `medgan.py` to generate the fake realistic data](#run)
- [2) Processing `gen-samples.npy`](#gen-samples)
- [3) Comparing the fictitious generated samples to the real-life original ones](#comparison)

---
### Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import pickle

---
<a name="run"></a>
# 1) Using `process_mimic.py` and `medgan.py` to generate the fake realistic data 

This step is detailed in my tutorial [A few additional tips on how to run Edward Choi's medGAN
](https://github.com/sylvaincom/medgan/blob/master/tips-for-medgan.md).

In short, in the Anaconda prompt, we run:
```
cd C:\Users\<username>\Documents\mimic_binary_small_rd
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"
mkdir generated
```

`process_mimic.py` outputs `training-data.matrix` that contains 46 520 samples and 1 071 features. We should not perform data augmentation on a dataset that already has a lot of samples because we have no shortage of samples. We are going to randomly select 1 000 samples and randomly select 100 features from the MIMIC-III dataset. We use a random seed for reproducibility. This is our `df_real_small` dataset. Thus, we are in a situation where we have a shortage of samples so we will try to perform data augmentation (in another notebook following this one).

In [None]:
real_data = pickle.load(open('training-data.matrix', 'rb')) # real-life dataset
df_real = pd.DataFrame(real_data)
print(df_real.shape)

We randomly select the 1000 samples:

In [None]:
df_real_small = df_real.sample(1000, random_state=1)
print(df_real_small.shape)

We randomly select 100 features:

In [None]:
df_real_small = df_real_small.sample(100, axis=1, random_state=1)
print(df_real_small.shape)

We can now export our dataset:

In [None]:
matrix = pd.DataFrame.as_matrix(df_real_small)
pickle.dump(matrix, open('training-data-small.matrix', 'wb'), -1)

We check if it was saved correctly:

In [None]:
real_data_array_small = pickle.load(open('training-data-small.matrix', 'rb'))
df_real_small = pd.DataFrame(real_data_array_small)
print(df_real_small.shape)
df_real_small.head(5)

Now, in the Anaconda prompt, we run:
```
python medgan.py training-data-small.matrix ./generated/samples --data_type="binary" --n_epoch=1000 --n_pretrain_epoch=100 --batch_size=100
python medgan.py training-data-small.matrix gen-samples --model_file=./generated/samples-999 --generate_data=True --data_type="binary"
```
Some default values are `n_epoch=1000`, `n_pretrain_epoch=100` and `batch_size=1000`. We choose `nSamples=1000` line 406 in `medgan.py`.

---
<a name="gen-samples"></a>
# 2) Processing `gen-samples.npy`

We load the `gen-samples.npy` file which is `medgan.py`'s output:

In [None]:
fict = np.load('gen-samples.npy') # fictitious generated dataset
df_fict = pd.DataFrame(fict)
print(df_fict.shape)
df_fict.head()

The ouput of `medgan.py` has no missing values.

We need to round the values ourselves:

In [None]:
df_fict = df_fict.round(0)
df_fict.head()

---
<a name="comparison"></a>
# 3) Comparing the fictitious generated samples to the real-life original ones

Here is a recap of our parameters for `medGAN`:

| dataset | number of samples | number of features |
|---|---|---|
|`df_real_small` | 1 000 | 111 |
|`df_fict` | 1 000 | 100 |

| `n_epoch` | `n_pretrain_epoch` | `batch_size` | `nSamples` |
|---|---|---|---|
| 1 000 | 100 | 100 | 1 000 |

## 3.1) Probability distribution of the real-life data

In [None]:
df_real = df_real_small
n_real, p_real = df_real.shape
print(n_real, p_real)

proba_real = [sum(df_real[f])/n_real for f in list(df_real)]

plt.plot(proba_real, 'o')
plt.xlabel('Index of feature')
plt.ylabel('Bernoulli probability success')
plt.title('For the real-life dataset')
plt.show()

## 3.2) Probability distribution of the fictitious generated data

In [None]:
n_fict, p_fict = df_fict.shape
print(n_fict, p_fict)

proba_fict = [sum(df_fict[f])/n_fict for f in list(df_fict)]

plt.plot(proba_fict, 'o')
plt.xlabel('Index of feature')
plt.ylabel('Bernoulli probability success')
plt.title('For the fictitious generated dataset')
plt.show()

## 3.3) Comparison: dimension-wise probability

In [None]:
xaxis = proba_real
yaxis = proba_fict

start = min(np.min(xaxis), np.min(yaxis))
stop = max(np.max(xaxis), np.max(yaxis))
p = len(xaxis)
X = np.linspace(start, stop, num=p+1)

plt.plot(xaxis, yaxis, 'ok', X, X, '-g');

plt.legend(['Bernoulli success probability', 'ideal Bernoulli success probability'])
plt.title('Dimension-wise probability performance of medGAN')
plt.xlabel('For the real-life dataset')
plt.ylabel('For the fictitious generated dataset')
plt.savefig('accuracy_mimic_binary_small.png', dpi=120) # to save the image in high resolution
plt.show()

---
Back to [top](#top).