# CTGAN for Synthetic Data Generation using MIMIC-III

## Steps:
1. Load & Preprocess MIMIC-III Data

2. Train CTGAN on MIMIC-III Data

3. Generate & Evaluate Synthetic Data

Install and import libraries

In [1]:
!pip install --upgrade ctgan sdv

Collecting ctgan
  Downloading ctgan-0.11.0-py3-none-any.whl.metadata (10 kB)
Collecting sdv
  Downloading sdv-1.18.0-py3-none-any.whl.metadata (13 kB)
Collecting rdt>=1.14.0 (from ctgan)
  Downloading rdt-1.14.0-py3-none-any.whl.metadata (10 kB)
Collecting boto3<2.0.0,>=1.28 (from sdv)
  Downloading boto3-1.37.4-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<2.0.0,>=1.31 (from sdv)
  Downloading botocore-1.37.4-py3-none-any.whl.metadata (5.7 kB)
Collecting copulas>=0.12.0 (from sdv)
  Downloading copulas-0.12.1-py3-none-any.whl.metadata (9.4 kB)
Collecting deepecho>=0.6.1 (from sdv)
  Downloading deepecho-0.7.0-py3-none-any.whl.metadata (10 kB)
Collecting sdmetrics>=0.17.0 (from sdv)
  Downloading sdmetrics-0.19.0-py3-none-any.whl.metadata (9.4 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3<2.0.0,>=1.28->sdv)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3<2.0.0,>=1.28->sdv)
  Downloading s3transfer-0.11.3-py

In [2]:
# Dataframe
import pandas as pd

# MetaData
from sdv.metadata import Metadata
# Model
from sdv.single_table import CTGANSynthesizer

# Evaluation
from sdv.evaluation.single_table import run_diagnostic
from sdv.evaluation.single_table import evaluate_quality

## Step 1: Load & Preprocess MIMIC-III Data

*Note: Make sure that the files are uploaded in the google colab. - ADMISSIONS.csv.gz, PATIENTS.csv.gz, ICUSTAYS.csv.gz*

In [3]:
# Load MIMIC-III CSV.GZ files
admissions = pd.read_csv("ADMISSIONS.csv.gz", compression="gzip")
patients = pd.read_csv("PATIENTS.csv.gz", compression="gzip")
icustays = pd.read_csv("ICUSTAYS.csv.gz", compression="gzip")

# Sample subset for Assignment
admissions = admissions.sample(2000)
patients = patients.sample(2000)
icustays = icustays.sample(2000)

# Select relevant columns
admissions = admissions[["SUBJECT_ID", "HADM_ID", "ADMISSION_TYPE", "INSURANCE", "ETHNICITY", "ADMITTIME"]]
patients = patients[["SUBJECT_ID", "DOB"]]
icustays = icustays[["HADM_ID", "LOS"]]

# Convert date columns to datetime
admissions["ADMITTIME"] = pd.to_datetime(admissions["ADMITTIME"])
patients["DOB"] = pd.to_datetime(patients["DOB"])

# Compute age at admission
admissions = admissions.merge(patients, on="SUBJECT_ID", how="left")
admissions["AGE"] = admissions["ADMITTIME"].dt.year - admissions["DOB"].dt.year

# Merge with ICU stays to get Length of Stay (LOS)
df = admissions.merge(icustays, on="HADM_ID", how="left").rename(columns={"LOS": "ICU_LOS"})

# Keep only relevant columns
df = df[["AGE", "ICU_LOS", "ADMISSION_TYPE", "INSURANCE", "ETHNICITY"]]

# Handle missing values
# df.fillna("Unknown", inplace=True)
for col in df.columns:
    if df[col].dtype == "object":  # Categorical columns
        df[col].fillna("Unknown", inplace=True)
    else:  # Numerical columns (e.g., ICU_LOS)
        df[col].fillna(-1, inplace=True)  # Use -1 as a placeholder for missing values

# Convert categorical variables to strings
categorical_columns = ["ADMISSION_TYPE", "INSURANCE", "ETHNICITY"]
df[categorical_columns] = df[categorical_columns].astype(str)

# Save processed data
df.to_csv("mimic_ctgan_data.csv", index=False)

print("MIMIC-III data preprocessed and saved for CTGAN training!")
print(df.head())

MIMIC-III data preprocessed and saved for CTGAN training!
   AGE  ICU_LOS ADMISSION_TYPE INSURANCE ETHNICITY
0 -1.0     -1.0      EMERGENCY   Private     WHITE
1 -1.0     -1.0      EMERGENCY   Private     WHITE
2 -1.0     -1.0      EMERGENCY   Private     WHITE
3 -1.0     -1.0        NEWBORN   Private     WHITE
4 -1.0     -1.0      EMERGENCY  Medicare     WHITE


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(-1, inplace=True)  # Use -1 as a placeholder for missing values
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna("Unknown", inplace=True)


## Step 2: Train CTGAN on MIMIC-III Data

*Note: the below code triggers GAN training and may take some time. Try changing the runtime to GPU and reduce the number of epochs if it is taking too long to run.*

In [4]:
# Load preprocessed data
df = pd.read_csv("/content/mimic_ctgan_data.csv")

# Define metadata (categorical vs. numerical)
metadata = Metadata.detect_from_dataframe(data=df, table_name='ctgan_data')

# Initialize and train CTGAN
ctgan = CTGANSynthesizer(metadata, enforce_rounding=False, epochs=70, verbose=True)
print("Training CTGAN...")
ctgan.fit(df)

# Save trained model
import pickle
with open("ctgan_model.pkl", "wb") as f:
    pickle.dump(ctgan, f)

print("CTGAN training complete! Model saved.")



Training CTGAN...


Gen. (0.43) | Discrim. (-0.07): 100%|██████████| 70/70 [00:07<00:00,  9.28it/s]

CTGAN training complete! Model saved.





In [5]:
ctgan.get_loss_values_plot()

## Step 3: Generate & Evaluate Synthetic Data

### Generate synthetic data

In [6]:
# Generate synthetic data
synthetic_data = ctgan.sample(num_rows=1000)
synthetic_data.to_csv("synthetic_mimic_data.csv", index=False)
print("Synthetic data generated and saved!")

Synthetic data generated and saved!


In [7]:
print(synthetic_data.info)
synthetic_data.head()

<bound method DataFrame.info of           AGE   ICU_LOS ADMISSION_TYPE   INSURANCE  \
0   -1.000000 -0.999069      EMERGENCY    Medicare   
1   -0.406962 -0.980822       ELECTIVE     Private   
2   -0.605185 -0.971455      EMERGENCY  Government   
3   -0.439996 -0.988964      EMERGENCY    Medicaid   
4   -1.000000 -1.000000      EMERGENCY  Government   
..        ...       ...            ...         ...   
995 -0.641055 -0.883165      EMERGENCY    Medicaid   
996 -1.000000 -1.000000      EMERGENCY    Medicare   
997 -0.842612 -0.919330      EMERGENCY    Self Pay   
998 -1.000000 -0.961767      EMERGENCY     Private   
999  0.356325 -0.982706      EMERGENCY     Private   

                          ETHNICITY  
0                             WHITE  
1                             WHITE  
2                             WHITE  
3             UNKNOWN/NOT SPECIFIED  
4                             WHITE  
..                              ...  
995           UNKNOWN/NOT SPECIFIED  
996            

Unnamed: 0,AGE,ICU_LOS,ADMISSION_TYPE,INSURANCE,ETHNICITY
0,-1.0,-0.999069,EMERGENCY,Medicare,WHITE
1,-0.406962,-0.980822,ELECTIVE,Private,WHITE
2,-0.605185,-0.971455,EMERGENCY,Government,WHITE
3,-0.439996,-0.988964,EMERGENCY,Medicaid,UNKNOWN/NOT SPECIFIED
4,-1.0,-1.0,EMERGENCY,Government,WHITE


The above generated records are synthetically generated. Enabling us to overcome data privacy issues.

### Evaluate data

In [8]:
def evaluate_synthetic_data(real_data, synthetic_data, metadata):
    """Evaluate the quality of generated synthetic data."""
    diagnostic = run_diagnostic(real_data=real_data, synthetic_data=synthetic_data, metadata=metadata)
    quality_report = evaluate_quality(real_data, synthetic_data, metadata)
    return diagnostic, quality_report

In [9]:
diagnostic, quality_report = evaluate_synthetic_data(df, synthetic_data, metadata)

Generating report ...

(1/2) Evaluating Data Validity: |██████████| 5/5 [00:00<00:00, 573.49it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 433.25it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 5/5 [00:00<00:00, 103.93it/s]|
Column Shapes Score: 66.72%

(2/2) Evaluating Column Pair Trends: |██████████| 10/10 [00:00<00:00, 146.58it/s]|
Column Pair Trends Score: 79.18%

Overall Score (Average): 72.95%



In [10]:
quality_report.get_details('Column Shapes')

Unnamed: 0,Column,Metric,Score
0,AGE,KSComplement,0.479
1,ICU_LOS,KSComplement,0.3935
2,ADMISSION_TYPE,TVComplement,0.9035
3,INSURANCE,TVComplement,0.8345
4,ETHNICITY,TVComplement,0.7255


In [11]:
# Visualization
from sdv.evaluation.single_table import get_column_plot

def plot_column_distributions(real_data, synthetic_data, metadata, column_names):
    """Plot distributions of specified columns between real and synthetic data."""
    for col in column_names:
        fig = get_column_plot(real_data=real_data, synthetic_data=synthetic_data, metadata=metadata, column_name=col)
        fig.show()

plot_column_distributions(df, synthetic_data, metadata, df.columns)

Note: We observe that the synthetic data generated closely resembles the distribution of real MIMIC dataset.

## Conclusion

To overcome the challenge of data privacy with healthcare datasets like MIMIC III, we can synthetically create synthetic dataset using CTGAN and evaluate it. This new synthetic dataset can then be freely used, overcoming some challenges with healthcare data.