# HALO Synthetic Data Generation for MIMIC-III

This notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.

## What You'll Need

1. **MIMIC-III Access**: Download these files from PhysioNet:
   - `ADMISSIONS.csv`
   - `DIAGNOSES_ICD.csv`
   - `PATIENTS.csv`
   - `patient_ids.txt` (list of patient IDs, one per line)

2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)

3. **Time**:
   - Demo (5 epochs, 1K samples): ~20-30 min on GPU
   - Production (80 epochs, 10K samples): ~6-10 hours on GPU

## How It Works

1. **Setup**: Install PyHealth and mount Google Drive
2. **Upload Data**: Upload your MIMIC-III CSV files
3. **Configure**: Set hyperparameters (epochs, batch size, etc.)
4. **Train**: Train HALO model (checkpoints saved to Drive)
5. **Generate**: Create synthetic patients using trained model
6. **Download**: Get CSV file with synthetic data

## Important Notes

⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:
- Colab Pro for longer sessions
- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`

📊 **Demo vs Production**:
- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly
- Production settings (80 epochs, 10K samples) match the published HALO results

## References

- [HALO Paper](https://arxiv.org/abs/2406.16061)
- [PyHealth Documentation](https://pyhealth.readthedocs.io/)
- [MIMIC-III Access](https://physionet.org/content/mimiciii/)

---
# 1. Setup & Installation

In [None]:
# Install PyHealth from GitHub (gets latest HALO implementation)
# For development/CI: set BRANCH to specific branch name (e.g., 'halo-pr-528')
# For production: leave BRANCH as None to use main branch
BRANCH = None  # Change to 'halo-pr-528' or other branch name for development

if BRANCH:
    install_url = f"git+https://github.com/sunlabuiuc/PyHealth.git@{BRANCH}"
    print(f"Installing PyHealth from branch '{BRANCH}'...")
else:
    install_url = "git+https://github.com/sunlabuiuc/PyHealth.git"
    print("Installing PyHealth from main branch...")

!pip install -q {install_url}
print("✓ PyHealth installed successfully!")

In [None]:
# Import required libraries
import os
import sys
import torch
import pickle
import pandas as pd
import shutil
from google.colab import drive, files
import ipywidgets as widgets
from IPython.display import display, Markdown, HTML

print("✓ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Mount Google Drive for persistent storage
print("Mounting Google Drive...")
drive.mount('/content/drive')
print("✓ Google Drive mounted at /content/drive")

# Create directory structure in Drive
base_dir = '/content/drive/MyDrive/HALO_Training'
data_dir = f'{base_dir}/data'
checkpoint_dir = f'{base_dir}/checkpoints'
pkl_data_dir = f'{base_dir}/pkl_data'
output_dir = f'{base_dir}/output'

for dir_path in [base_dir, data_dir, checkpoint_dir, pkl_data_dir, output_dir]:
    os.makedirs(dir_path, exist_ok=True)

print(f"\n✓ Directory structure created:")
print(f"  Base: {base_dir}")
print(f"  Data: {data_dir}")
print(f"  Checkpoints: {checkpoint_dir}")
print(f"  Vocabulary: {pkl_data_dir}")
print(f"  Output: {output_dir}")

---
# 2. Configuration

---
# 3. Data Upload

---
# 4. Training

---
# 5. Generation

---
# 6. Results & Download