 <a name="top"></a>
# Understanding how medGAN works

Author: [Sylvain Combettes](https://github.com/sylvaincom).

Edward Choi's original repository: [medgan](https://github.com/mp2893/medgan). <br/>
My own medGAN repository (that is based on Edward Choi's work): [medgan](https://github.com/sylvaincom/medgan).

The final goal of my project is to use medGAN on my own dataset (electronic health records). For that, I first need to understand how medGAN works. In this notebook, I provide a few code cells and explanations that can help better understand and run medGAN. Because there are some confidentiality issues with the MIMIC-III dataset, I cleared the output of the cells.

Before reading this notebook, be sure to have read [A few additional tips on how to run Edward Choi's medGAN](https://github.com/sylvaincom/medgan/blob/master/tips-for-medgan.md).

---
### Tables of Contents

- [Loading the MIMIC-III dataset](#load-mimic)
- [Using process_`mimic.py` and `medgan.py` to generate the fake realistic data](#run)
- [How to interpret `gen-samples.npy`?](#gen-samples)
- [Comparing the (fake) generated samples to the real-life original ones](#comparison)

---
### Imports

In [None]:
import numpy as np
import os
import pickle
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import matplotlib.transforms as mtransforms

---
# Loading the MIMIC-III dataset <a name="load-mimic"></a>

## `ADMISSIONS.csv` file

In [None]:
df_adm = pd.read_csv("ADMISSIONS.csv")
print(df_adm.shape)
df_adm.head()

Do we have a lot of missing values?

In [None]:
n,p = df_adm.shape
for f in df_adm:
    percentage = sum(df_adm[f].isna())*100/n
    if percentage>0:
        print('Missing values in {}: {}%'.format(f, percentage))

## `DIAGNOSES_ICD.csv` file

In [None]:
df_ICD = pd.read_csv('DIAGNOSES_ICD.csv')
print(df_ICD.shape)
df_ICD.head()

Do we have a lot of missing values?

In [None]:
n,p = df_ICD.shape
for f in df_ICD:
    percentage = sum(df_ICD[f].isna())*100/n
    if percentage>0:
        print('Missing values in {}: {}%'.format(f, percentage))

We check if our dataset is balanced. Does one `ICD9_CODE` appear distinctly more than others in proportion?

In [None]:
df_ICD['ICD9_CODE'].value_counts(normalize=True).head()

---
# Using `process_mimic.py` and `medgan.py` to generate the fake realistic data <a name="run"></a>

This step is detailed in [A few additional tips on how to run Edward Choi's medGAN
](https://github.com/sylvaincom/medgan/blob/master/tips-for-medgan.md).

In short, in the Anaconda prompt, we run:
```
cd C:\Users\<username>\Documents\medgan-master
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"
mkdir generated
python medgan.py training-data.matrix ./generated/samples --data_type="binary"
python medgan.py training-data.matrix gen-samples --model_file=./generated/samples-999 --generate_data=True
```
Some default values are `n_epoch=1000` and `n_pretrain_epoch=100`. The computing time took a few hours for me.

From now on, whenever we refer to input or output, we refer to the input and output of `medgan.py` (unless specified otherwise).

---
# How to interpret `gen-samples.npy`? <a name="gen-samples"></a>

We load the `gen-samples.npy` file which is `medgan.py`'s output:

In [None]:
output = np.load('gen-samples.npy')
df_output = pd.DataFrame(output)
print(df_output.shape)
df_output.head()

Some questions about this data frame:
* What do the columns correspond to? They do not look like `ADMISSIONS.csv` nor `DIAGNOSIS_ICD.csv`.
* What do the rows correspond to?
* Why are the values not binary?

We can find some answers in an issue opened in Edward Choi's GitHub: [How to interpret the samples?](https://github.com/mp2893/medgan/issues/3). In order to understand the output `gen-samples.npy` of `medgan.py`, we are going to back to the input of `medgan.py`: the output of `process_mimic.py`.

Actually, as in the `.matrix` file, each row corresponds to a single synthetic patient and each column corresponds to a specific ICD9 diagnosis code. We can use the `.types` file created by `process_mimic.py` to map each column to a specific ICD9 diagnosis code. Read the beginning part of the source code of `process_mimic.py` for more information about `.types` file:
```python
# Output files
# <output file>.pids: cPickled Python list of unique Patient IDs. Used for intermediate processing
# <output file>.matrix: Numpy float32 matrix. Each row corresponds to a patient. Each column corresponds to a ICD9 diagnosis code.
# <output file>.types: cPickled Python dictionary that maps string diagnosis codes to integer diagnosis codes.
```

What is ICD-9? See [ICD-9](https://en.wikipedia.org/wiki/International_Statistical_Classification_of_Diseases_and_Related_Health_Problems#ICD-9) and [List of ICD-9 codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes).

We need to round the values ourselves:

In [None]:
df_output = df_output.round(0)
df_output.head()

We claim that we should delete the rows with missing values (if there are any):

In [None]:
df_output = pd.DataFrame.dropna(df_output)
print(df_output.shape)

Indeed, in line 406 of `medgan.py`, it is written `nSamples=10000`.

## Understanding the `.types` file (an ouput of `process_mimic.py`)

_cPickled Python dictionary that maps string diagnosis codes to integer diagnosis codes._

In [None]:
map_dict = pickle.load(open('training-data.types', 'rb'))
print(type(map_dict))
print('An excerpt is:', dict(list(map_dict.items())[0:5]))

Thus, as its name suggests, `process_mimic.py` is really dependent on the MIMIC-III dataset. We probably will not use `process_mimic.py` on our own dataset and only run `medgan.py`. Out of `process_mimic.py`, we only need to understand how the generated `.matrix` file is constructed (lines 109 to 119).

## Understanding the `.pids` file (an ouput of `process_mimic.py`)

_cPickled Python list of unique Patient IDs. Used for intermediate processing_

In [None]:
id_list = pickle.load(open('training-data.pids', 'rb'))
print(type(id_list))
print('An excerpt is:', id_list[:10])

## Understanding the `.matrix` file (an ouput of `process_mimic.py` and the input of `medgan.py`)

_Numpy float32 matrix. Each row corresponds to a patient. Each column corresponds to a ICD9 diagnosis code._

In [None]:
input_data_array = pickle.load(open('training-data.matrix', 'rb'))
print(type(input_data_array))
input_data_array

In [None]:
df_input = pd.DataFrame(input_data_array)
print(df_input.shape)
df_input.head(10)

As we chose, the input data is binary. 

We can note that the input of `medgan` and the [output](#gen-samples) of `medgan` have the same number of columns and the values have the same type (binary). Thus, `gen-samples.npy` is a (fake) realistic generated dataset corresponding to the `.matrix` file.

---
# Comparing the (fake) generated samples to the real-life original ones  <a name="comparison"></a>

In this section, we wish to compare the accuracy of the (fake) generated dataset considering the original one. As in Choi's paper, we use dimension-wise probability.

## Probability distribution of input data

In [None]:
n_input, p_input = df_input.shape
print(n_input, p_input)

In [None]:
input_freq_list = df_input.sum().tolist()

plt.plot(input_freq_list)
plt.xlabel('Index of variable')
plt.ylabel('Frequency of 1')
plt.title('input_data_pd')
plt.show()

In [None]:
proba_input = [sum(df_input[f])/n_input for f in list(df_input)]

For a feature (dimension), we assume that the proportion of `1` is the Bernoulli success probability _p_.

In [None]:
plt.plot(proba_input)
plt.xlabel('Index of variable')
plt.ylabel('Bernoulli probability success')
plt.title('df_input')
plt.show()

## Probability distribution of output data

In [None]:
n_output, p_output = df_output.shape
print(n_output, p_output)

In [None]:
proba_output = [sum(df_output[f])/n_output for f in list(df_output)]

In [None]:
plt.plot(proba_output)
plt.xlabel('Index of variable')
plt.ylabel('Bernoulli probability success')
plt.title('df_output')
plt.show()

## Comparison: dimension-wise probability

In [None]:
fig, ax = plt.subplots()
ax.scatter(proba_input, proba_output, c='black', label='Bernoulli success probability')
line = mlines.Line2D([0, 1], [0, 1], color='red')
transform = ax.transAxes
line.set_transform(transform)
ax.add_line(line)

plt.title('dimension-wise probability performance of medGAN')
plt.xlabel('for the real data')
plt.ylabel('for the (fake) generated data')
plt.legend()
plt.show()

The diagonal red line indicates the ideal performance where the real and the (fake) realistic generated data show identical quality. Based on th eprevious graph, we can say that medGAN has a really good performance.

Back to [top](#top).