<a name="top"></a>
<br/>
# Understanding how `medGAN` works (on the MIMIC-III dataset with binary values)

Author: [Sylvain Combettes](https://github.com/sylvaincom). <br/>
Last update: Aug 29, 2019. Created: Aug 12, 2019. <br/>
My own medGAN repository (that is based on Edward Choi's work): [medgan](https://github.com/sylvaincom/medgan-tips). <br/>
Edward Choi's original repository: [medgan](https://github.com/mp2893/medgan).

The final goal of my project is to use `medGAN` on my own dataset (electronic health records). Hence, I first need to understand how the `medGAN` program works. In this notebook, I provide a few code cells and explanations to help better understand and run `medGAN`. Because there are some confidentiality issues with the MIMIC-III dataset, I cleared the output of the cells.

Before reading this notebook, make sure that you have read my [medGAN repository](https://github.com/sylvaincom/medgan-tips)'s table of contents.

We will use the MIMIC-III dataset and process it so that we only have binary values.

---
### Table of contents

- [Loading the MIMIC-III dataset](#load-mimic)
- [Using `process_mimic.py` and `medgan.py` to generate the fake realistic data](#run)
- [How can one interpret the output of `medgan.py`?](#gen-samples)
- [How can one interpret the output of `process_mimic.py`?](#input)
- [Comparing the (fake) generated samples to the real-life original ones](#comparison)

---
### Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import pickle

---
# Loading the MIMIC-III dataset <a name="load-mimic"></a>

## `ADMISSIONS.csv` file

In [None]:
df_adm = pd.read_csv("ADMISSIONS.csv")
print(df_adm.shape)
df_adm.head()

Do we have a lot of missing values?

In [None]:
def observe_missing_values(df):
    # This function does not return any values.
    
    dico = {}
    c_index = [] # index of columns of features with missing values
    n,p = df.shape
    features = list(df) # list of features
    
    # We compute the percentage of missing values for each column:
    for f in features:
        percentage = round(sum(df[f].isna())*100/n, 2) # percentage of missing values
        if percentage>0:
            dico[f] = [percentage]
            c_index.append(df.columns.get_loc(f))
    
    if c_index == []:
        return 'There are no missing values!'
    
    # We construct the DataFrame ordered by Approx. missing values (%):
    df_mv = pd.DataFrame(data=dico) # mv for missing values
    idx_rename = {df_mv.index.tolist()[0]:'Approx. missing values (%)'}
    df_mv = df_mv.rename(index=idx_rename)
    
    # We print the features with a decreasing Approx. missing values (%) value:
    n_f = min(p,10) # number of features we choose to print
    df_mv_ordered = df_mv.sort_values(by=['Approx. missing values (%)'], axis=1, ascending=False)
    print('The', n_f, 'features with the most missing values are:')
    print(df_mv_ordered.iloc[0].head(n_f))
    
    # We plot the Approx. missing values (%) given the index of the feature:
    plt.plot(c_index, df_mv.iloc[0], 'o')
    plt.xlabel('Index of feature')
    plt.ylabel('Approx. missing values (%)')
    plt.title('Observing missing values')
    plt.show()

In [None]:
observe_missing_values(df_adm)

## `DIAGNOSES_ICD.csv` file

In [None]:
df_ICD = pd.read_csv('DIAGNOSES_ICD.csv')
print(df_ICD.shape)
df_ICD.head()

Do we have a lot of missing values?

In [None]:
observe_missing_values(df_ICD)

We check if our dataset is balanced. Does one `ICD9_CODE` appear distinctly more than others in proportion?

In [None]:
df_ICD['ICD9_CODE'].value_counts(normalize=True).head()

---
# Using `process_mimic.py` and `medgan.py` to generate the fake realistic data <a name="run"></a>

This step is detailed in [A few additional tips on how to run Edward Choi's medGAN](https://github.com/sylvaincom/medgan/blob/master/tips-for-medgan.md).

In short, in the Anaconda prompt, we run:
```
cd C:\Users\<username>\Documents\mimic_binary
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"
mkdir generated
python medgan.py training-data.matrix ./generated/samples --data_type="binary"
python medgan.py training-data.matrix gen-samples --model_file=./generated/samples-999 --generate_data=True --data_type="binary"
```
Some default values are `n_epoch=1000`, `n_pretrain_epoch=100` and `batch_size=1000`. We choose `nSamples=10000` line 405 in `medgan.py`.
The computing took more than 5 hours on my laptop.

From now on, whenever we refer to "input" or "output", we refer to the input and output of `medgan.py` (unless specified otherwise). "input" is the original real-life dataset and "output" is the fake realistic generated dataset.

---
# How can one interpret the output of `medgan.py`? <a name="gen-samples"></a>

We load the `gen-samples.npy` file which is `medgan.py`'s output:

In [None]:
output = np.load('gen-samples.npy')
df_output = pd.DataFrame(output)
print(df_output.shape)
df_output.head()

Do the output of `medgan.py` have missing values?

In [None]:
observe_missing_values(df_output)

The output of `medgan.py` has no missing values!

Some questions about this data frame:
* What do the columns correspond to? They are not the ones of `ADMISSIONS.csv` nor `DIAGNOSIS_ICD.csv`.
* What do the rows correspond to?
* Why are the values not binary?

We can find some answers in an issue opened in Edward Choi's GitHub: [How to interpret the samples?](https://github.com/mp2893/medgan/issues/3). In order to understand the output `gen-samples.npy` of `medgan.py`, we are going to go back to the input of `medgan.py`: the output of `process_mimic.py` which is `training-data.matrix`.

Actually, in `gen-samples.npy`, as in the `training-data.matrix` file, each row corresponds to a single patient and each column corresponds to a specific ICD9 diagnosis code. We can use the `training-data.types` file created by `process_mimic.py` to map each column to a specific ICD9 diagnosis code. Read the beginning part of the source code of `process_mimic.py` for more information about these files:
```python
# Output files
# <output file>.pids: cPickled Python list of unique Patient IDs. Used for intermediate processing
# <output file>.matrix: Numpy float32 matrix. Each row corresponds to a patient. Each column corresponds to a ICD9 diagnosis code.
# <output file>.types: cPickled Python dictionary that maps string diagnosis codes to integer diagnosis codes.
```

What is ICD-9? See [ICD-9](https://en.wikipedia.org/wiki/International_Statistical_Classification_of_Diseases_and_Related_Health_Problems#ICD-9) and [List of ICD-9 codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes).

We need to round the values ourselves:

In [None]:
df_output = df_output.round(0)
print(df_output.shape)
df_output.head()

Indeed, in line 405 of `medgan.py`, we have chosen `nSamples=10000`. It is important to note thay once we have learned the estimated density of the real-life original samples, we can generate as many fake realistic samples as we want.

---
# How can one interpret the output of `process_mimic.py`? <a name="input"></a>

## Understanding `training-data.types`

_cPickled Python dictionary that maps string diagnosis codes to integer diagnosis codes._

In [None]:
map_dict = pickle.load(open('training-data.types', 'rb'))
print(type(map_dict))
print('An excerpt is:', dict(list(map_dict.items())[0:5]))

Thus, as its name suggests, `process_mimic.py` is really dependent on the MIMIC-III dataset. We probably will not use `process_mimic.py` on our own dataset and only run `medgan.py`. Out of `process_mimic.py`, we only need to understand how the output file `training-data.matrix` is constructed.

## Understanding `training-data.pids`

_cPickled Python list of unique Patient IDs. Used for intermediate processing_

In [None]:
id_list = pickle.load(open('training-data.pids', 'rb'))
print(type(id_list))
print('An excerpt is:', id_list[:10])

## Understanding `training-data.matrix`

_Numpy float32 matrix. Each row corresponds to a patient. Each column corresponds to a ICD9 diagnosis code._

`training-data.matrix` is an output of `process_mimic.py` and the input of `medgan.py`.

In [None]:
input_data_array = pickle.load(open('training-data.matrix', 'rb'))
print(type(input_data_array))
input_data_array

In [None]:
df_input = pd.DataFrame(input_data_array)
print(df_input.shape)
df_input.head(5)

As we chose, the input data is binary. 

We can note that the input and the [output](#gen-samples) of `medgan.py` have the same number of columns and that the values are of the same type (binary). Thus, `gen-samples.npy` is a (fake) realistic generated dataset that is trying to look like the `training-data.matrix` file.

Does the input of `medgan.py` have missing values?

In [None]:
observe_missing_values(df_input)

The input of `medgan.py` has no missing values!

---
# Comparing the (fake) generated samples to the real-life original ones  <a name="comparison"></a>

In this section, we wish to compare the accuracy of the (fake) generated dataset considering the original one. As in Choi's paper, we use dimension-wise probability (because the variables are binary).

Given that our data is binary, for each feature (dimension), we claim that `1` corresponds to success and `0` to failure. Hence the proportion of `1` obtained is the Bernoulli success probability _p_.

## Probability distribution of the input data

In [None]:
n_input, p_input = df_input.shape
print(n_input, p_input)

proba_input = [sum(df_input[f])/n_input for f in list(df_input)]

plt.plot(proba_input, 'o')
plt.xlabel('Index of feature')
plt.ylabel('Bernoulli probability success')
plt.title('For the real-life dataset')
plt.show()

## Probability distribution of the output data

In [None]:
n_output, p_output = df_output.shape
print(n_output, p_output)

proba_output = [sum(df_output[f])/n_output for f in list(df_output)]

plt.plot(proba_output, 'o')
plt.xlabel('Index of feature')
plt.ylabel('Bernoulli probability success')
plt.title('For the (fake) generated dataset')
plt.show()

## Comparison: dimension-wise probability

In [None]:
start = min(np.min(proba_input), np.min(proba_output))
stop = max(np.max(proba_input), np.max(proba_output))
p = df_input.shape[1]
X = np.linspace(start, stop, num=p+1)

plt.plot(proba_input, proba_output, 'ok', X, X, '-g');

plt.legend(['Bernoulli success probability', 'ideal Bernoulli success probability'])
plt.title('Dimension-wise probability performance of medGAN')
plt.xlabel('For the real dataset')
plt.ylabel('For the (fake) generated dataset')
plt.show()

The values of the $x$-axis and the $y$-axis are ordered.

The diagonal green line indicates the ideal performance where the real and the (fake) realistic generated data show identical quality. Based on the previous graph, we can say that medGAN has a really good performance.

---
Back to [top](#top).