# CTGAN for Synthetic Data Generation using MIMIC-III

## Steps:
1. Load & Preprocess MIMIC-III Data

2. Train CTGAN on MIMIC-III Data

3. Generate & Evaluate Synthetic Data

Install and import libraries

In [None]:
!pip install --upgrade ctgan sdv kaleido



In [None]:

# Dataframe
import pandas as pd
import pickle

# MetaData
from sdv.metadata import Metadata
# Model
from sdv.single_table import CTGANSynthesizer

# Evaluation
from sdv.evaluation.single_table import run_diagnostic
from sdv.evaluation.single_table import evaluate_quality

## Step 1: Load & Preprocess MIMIC-III Data

*Note: Make sure that the files are uploaded in the google colab. - ADMISSIONS.csv.gz, PATIENTS.csv.gz, ICUSTAYS.csv.gz*

In [None]:
# Load MIMIC-III CSV.GZ files
admissions = pd.read_csv("ADMISSIONS.csv.gz", compression="gzip")
patients = pd.read_csv("PATIENTS.csv.gz", compression="gzip")
icustays = pd.read_csv("ICUSTAYS.csv.gz", compression="gzip")

# # Sample subset for Assignment
# admissions = admissions.sample(2000)
# patients = patients.sample(2000)
# icustays = icustays.sample(2000)

# Select relevant columns
admissions = admissions[["SUBJECT_ID", "HADM_ID", "ADMISSION_TYPE", "INSURANCE", "ETHNICITY", "ADMITTIME"]]
patients = patients[["SUBJECT_ID", "DOB"]]
icustays = icustays[["HADM_ID", "LOS"]]

# Convert date columns to datetime
admissions["ADMITTIME"] = pd.to_datetime(admissions["ADMITTIME"])
patients["DOB"] = pd.to_datetime(patients["DOB"])

# Compute age at admission
admissions = admissions.merge(patients, on="SUBJECT_ID", how="left")
admissions["AGE"] = admissions["ADMITTIME"].dt.year - admissions["DOB"].dt.year

# Merge with ICU stays to get Length of Stay (LOS)
df = admissions.merge(icustays, on="HADM_ID", how="left").rename(columns={"LOS": "ICU_LOS"})

# Keep only relevant columns
df = df[["AGE", "ICU_LOS", "ADMISSION_TYPE", "INSURANCE", "ETHNICITY"]]

# Handle missing values
# df.fillna("Unknown", inplace=True)
for col in df.columns:
    if df[col].dtype == "object":  # Categorical columns
        df[col].fillna("Unknown", inplace=True)
    else:  # Numerical columns (e.g., ICU_LOS)
        df[col].fillna(-1, inplace=True)  # Use -1 as a placeholder for missing values

# Convert categorical variables to strings
categorical_columns = ["ADMISSION_TYPE", "INSURANCE", "ETHNICITY"]
df[categorical_columns] = df[categorical_columns].astype(str)

# Save processed data
df.to_csv("mimic_ctgan_data.csv", index=False)

print("MIMIC-III data preprocessed and saved for CTGAN training!")

MIMIC-III data preprocessed and saved for CTGAN training!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(-1, inplace=True)  # Use -1 as a placeholder for missing values
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna("Unknown", inplace=True)


In [None]:
df.head()

Unnamed: 0,AGE,ICU_LOS,ADMISSION_TYPE,INSURANCE,ETHNICITY
0,65,1.1438,EMERGENCY,Private,WHITE
1,71,1.2641,ELECTIVE,Medicare,WHITE
2,75,1.1862,EMERGENCY,Medicare,WHITE
3,39,0.5124,EMERGENCY,Private,WHITE
4,59,3.5466,EMERGENCY,Private,WHITE


## Step 2: Train CTGAN on MIMIC-III Data

*Note: the below code triggers GAN training and may take some time. Try changing the runtime to GPU and reduce the number of epochs if it is taking too long to run.*

In [19]:
# Load preprocessed data
df = pd.read_csv("/content/mimic_ctgan_data.csv")

# Define metadata (categorical vs. numerical)
metadata = Metadata.detect_from_dataframe(data=df, table_name='ctgan_data')

# Initialize and train CTGAN
ctgan = CTGANSynthesizer(metadata, enforce_rounding=False, epochs=1, verbose=True)
print("Training CTGAN...")
ctgan.fit(df)


We strongly recommend saving the metadata using 'save_to_json' for replicability in future SDV versions.



Training CTGAN...


Gen. (1.36) | Discrim. (0.13): 100%|██████████| 1/1 [00:03<00:00,  3.09s/it]


In [None]:
# Save trained model
import pickle
with open("ctgan_model.pkl", "wb") as f:
    pickle.dump(ctgan, f)

print("CTGAN training complete! Model saved.")

In [20]:
# Load trained model
with open("ctgan_model.pkl", "rb") as f:
    ctgan = pickle.load(f)

In [21]:
ctgan.get_loss_values_plot()

## Step 3: Generate & Evaluate Synthetic Data

### Generate synthetic data

In [None]:
# Generate synthetic data
synthetic_data = ctgan.sample(num_rows=1000)
synthetic_data.to_csv("synthetic_mimic_data.csv", index=False)
print("Synthetic data generated and saved!")

Synthetic data generated and saved!


In [None]:
print(synthetic_data.info())
synthetic_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   AGE             1000 non-null   int64  
 1   ICU_LOS         1000 non-null   float64
 2   ADMISSION_TYPE  1000 non-null   object 
 3   INSURANCE       1000 non-null   object 
 4   ETHNICITY       1000 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB
None


Unnamed: 0,AGE,ICU_LOS,ADMISSION_TYPE,INSURANCE,ETHNICITY
0,39,0.686706,EMERGENCY,Private,ASIAN
1,84,4.257914,NEWBORN,Medicare,UNKNOWN/NOT SPECIFIED
2,58,0.604498,EMERGENCY,Medicaid,WHITE
3,88,12.693638,EMERGENCY,Private,WHITE
4,54,7.499186,EMERGENCY,Private,WHITE - OTHER EUROPEAN


The above generated records are synthetically generated. Enabling us to overcome data privacy issues.

### Evaluate data

In [None]:
def evaluate_synthetic_data(real_data, synthetic_data, metadata):
    """Evaluate the quality of generated synthetic data."""
    diagnostic = run_diagnostic(real_data=real_data, synthetic_data=synthetic_data, metadata=metadata)
    quality_report = evaluate_quality(real_data, synthetic_data, metadata)
    return diagnostic, quality_report

In [None]:
diagnostic, quality_report = evaluate_synthetic_data(df, synthetic_data, metadata)

Generating report ...

(1/2) Evaluating Data Validity: |██████████| 5/5 [00:00<00:00, 195.50it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 523.24it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 5/5 [00:00<00:00, 113.53it/s]|
Column Shapes Score: 84.19%

(2/2) Evaluating Column Pair Trends: |██████████| 10/10 [00:00<00:00, 47.41it/s]|
Column Pair Trends Score: 71.85%

Overall Score (Average): 78.02%



In [None]:
quality_report.get_details('Column Shapes')

Unnamed: 0,Column,Metric,Score
0,AGE,KSComplement,0.897257
1,ICU_LOS,KSComplement,0.86495
2,ADMISSION_TYPE,TVComplement,0.869376
3,INSURANCE,TVComplement,0.818241
4,ETHNICITY,TVComplement,0.759642


In [None]:
# Visualization
from sdv.evaluation.single_table import get_column_plot

def plot_column_distributions(real_data, synthetic_data, metadata, column_names):
    """Plot distributions of specified columns between real and synthetic data."""
    for col in column_names:
        fig = get_column_plot(real_data=real_data, synthetic_data=synthetic_data, metadata=metadata, column_name=col)
        # fig.write_image(f"{col}_distribution.png")  # Save plot as PNG
        fig.write_image(f"{col}_distribution.png")
        fig.show()

plot_column_distributions(df, synthetic_data, metadata, df.columns)

Note: We observe that the synthetic data generated closely resembles the distribution of real MIMIC dataset.

## Conclusion

To overcome the challenge of data privacy with healthcare datasets like MIMIC III, we can synthetically create synthetic dataset using CTGAN and evaluate it. This new synthetic dataset can then be freely used, overcoming some challenges with healthcare data.