# HW5: PyHealth

## Overview

PyHealth is an open-source Python package that provides an end-to-end pipeline for clinical predictive modeling.

In PyHealth, each machine learning pipeline consists of five stages:

> Dataset Processing → Define Healthcare Task → Build ML Model → Train ML Model → Inference & Evaluation

These stages correspond to the following PyHealth modules: `pyhealth.datasets`, `pyhealth.tasks`, `pyhealth.models`, `pyhealth.trainer`, and `pyhealth.metrics`.

This assignment will go through each of these modules and then integrate them into a complete five-stage ML pipeline.

By completing this assignment, you will gain hands-on experience in building a full ML pipeline with PyHealth. You may also use this five-stage framework as a foundation for your final project.

---

In [54]:
import random
import numpy as np
import torch
import os


seed = 1
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
os.environ["PYTHONHASHSEED"] = str(seed)

In [55]:
# Set the MEMORY_FS_ROOT environment variable to a directory with sufficient space
os.environ['MEMORY_FS_ROOT'] = "/tmp/pandarallel_memory"

## 1. Dataset Processing

### The `pyhealth.datasets` Module

The `pyhealth.datasets` module provides processing functions for various public EHR datasets, including MIMIC-III, MIMIC-IV, eICU, and all OMOP-CDM-based datasets. It transforms unstructured data into a unified structured object. Refer to the documentation: [PyHealth Datasets API](https://pyhealth.readthedocs.io/en/latest/api/datasets.html).

#### Available Datasets

The following dataset classes are currently supported:
  - [MIMIC3Dataset](https://pyhealth.readthedocs.io/en/latest/api/datasets/pyhealth.datasets.MIMIC3Dataset.html)
  - [MIMIC4Dataset](https://pyhealth.readthedocs.io/en/latest/api/datasets/pyhealth.datasets.MIMIC4Dataset.html)
  - [eICUDataset](https://pyhealth.readthedocs.io/en/latest/api/datasets/pyhealth.datasets.eICUDataset.html)
  - [OMOPDataset](https://pyhealth.readthedocs.io/en/latest/api/datasets/pyhealth.datasets.OMOPDataset.html): any OMOP-CDM based databases.

#### Arguments
  - `root`: Specifies the path to the data folder, e.g., `"mimiciii/1.4/"`.
  - `tables`: A list of table names from the raw database, defining the data used to construct your dataset.
  - `code_mapping` *(default: None)*: A dictionary specifying the new coding system for each data table.
    - Example:
      
      `{"ICD9CM": "CCSCM"}`
      
      This converts ICD-9-CM codes to CCS-CM codes.
      
      Check supported code transformations [here](https://pyhealth.readthedocs.io/en/latest/api/medcode.html#diagnosis-codes).
  - ``dev`` *(default: False)*: Enables dev mode (uses a small subset of the data for testing).
  - ``refresh_cache`` *(default: False)*: If True, the dataset is reprocessed from scratch, updating the cache.

### Example

In the following example, we initialize a `MIMIC3Dataset` object using the synthetic MIMIC-III dataset hosted on Google Cloud Storage. Let’s break down each argument:

- `root`: The path to the dataset, pointing to the publicly available synthetic MIMIC-III dataset.
- `tables`: Specifies that we are only loading the `DIAGNOSES_ICD`, `PROCEDURES_ICD`, and `PRESCRIPTIONS` tables.
- `code_mapping`: Converts ICD-9-CM codes to CCS-CM to group similar diagnoses.

#### What Happens Internally?
When you execute this code, `pyhealth.datasets.MIMIC3Dataset` performs the following steps:

1. Loads the dataset from the given `root` directory.
2. Extracts the specified tables (`DIAGNOSES_ICD`, `PROCEDURES_ICD`, `PRESCRIPTIONS`).
3. Applies the code mapping transformation, converting ICD-9-CM codes into CCS-CM codes.
4. Stores the processed dataset in a structured format that can be used for downstream machine learning tasks.

In [56]:
from pyhealth.datasets import MIMIC3Dataset


example_dataset = MIMIC3Dataset(
    root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III_subset/",
    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
    code_mapping={"ICD9CM": "CCSCM"},
)

In [57]:
# Print the statistics of the dataset
example_dataset.stat()

In [58]:
# Access a specific patient by their patient id
example_dataset.patients["10025"]

In [59]:
# Access a specific visit by its visit id
example_dataset.patients["10025"].visits["110360"]

In [60]:
# Access all events of a certain type
example_dataset.patients["10025"].visits["110360"].get_event_list("DIAGNOSES_ICD")

### TODO 1: Process the Synthetic MIMIC-III Data

1. Load the synthetic MIMIC-III data from `https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III_subset/`.
2. Extract the `DIAGNOSES_ICD`, `PROCEDURES_ICD`, `PRESCRIPTIONS` tables.
3. Convert ICD-9-CM to CCS-CM, ICD-9-PROC to CCS-PROC, NDC to ATC codes.

In [61]:
from pyhealth.datasets import MIMIC3Dataset


"""
TODO 1: Process the Synthetic MIMIC-III Data [20 points]
"""
mimic3_dataset = None
# your code here
mimic3_dataset = MIMIC3Dataset(
    root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III_subset/",
    tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
    code_mapping={"ICD9CM": "CCSCM","ICD9PROC":"CCSPROC","NDC":"ATC"},
)

In [62]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert set(mimic3_dataset.available_tables) == set(["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"])
num_of_patients = len(mimic3_dataset.patients)
num_of_visits = sum([len(patient.visits) for idx, patient in mimic3_dataset.patients.items()])
assert num_of_patients == 500
assert num_of_visits == 1091
assert mimic3_dataset.code_vocs == {"conditions": "CCSCM", "procedures": "CCSPROC", "drugs": "ATC"}



## 2. Define Healthcare Task

### The `pyhealth.tasks` Module

The `pyhealth.tasks` module provides functions to define various healthcare ML tasks from EHR data. It transforms structured datasets into samples (features and labels) ready for model training. Refer to the documentation: [PyHealth Tasks API](https://pyhealth.readthedocs.io/en/latest/api/tasks.html).

#### Available Tasks

Some example predefined task functions include:

- `mortality_prediction_mimic3_fn`: Predicts in-hospital mortality.
- `length_of_stay_prediction_mimic3_fn`: Predicts the length of stay (LOS).
- `drug_recommendation_mimic3_fn`: Recommends medications based on diagnoses and procedures.

#### Arguments

- `dataset.set_task()`: Converts the dataset into samples (features and labels).
- `task_function`: A callable that defines how features and labels are generated from patient records.

### Example

The mortality prediction task is a common benchmark in clinical machine learning. It predicts whether a patient will survive or die during their hospital stay based on their medical records.

- Input: Patient diagnoses, procedures, and medications from EHRs.
- Output: Binary label (1 for deceased, 0 for survived).
- Task Type: Binary Classification.

#### What Happens Internally?

The `mortality_prediction_mimic3_fn()` function generates labeled samples for a mortality prediction task using MIMIC-III patient records. It processes each patient's visits, using the current visit's diagnoses, procedures, and prescriptions to predict mortality at the next visit. The mortality label is determined from the next visit's discharge status (1 for deceased, 0 for survived). Visits without any medical codes are excluded. The function returns a list of labeled samples (patient_id, visit_id, conditions, procedures, drugs, and label) for binary classification (survival vs. death).

In [63]:
from pyhealth.tasks import mortality_prediction_mimic3_fn


example_samples = example_dataset.set_task(mortality_prediction_mimic3_fn)

In [64]:
# Print the statistics of the samples
example_samples.stat()

In [65]:
# Access a specific sample by index
print(example_samples[0])

In [66]:
# All information available for a specific sample
print(example_samples[0].keys())

### TODO 2: Set the Readmission Prediction Task

Call `set_task()` with the `mimic3_dataset` object to prepare samples for the readmission prediction task.

In [67]:
from pyhealth.tasks import readmission_prediction_mimic3_fn


"""
TODO 2: Set the Readmission Prediction Task [20 points]
"""
mimic3_samples = None
# your code here
mimic3_samples = mimic3_dataset.set_task(readmission_prediction_mimic3_fn)

In [68]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert len(mimic3_samples) == 409
assert "conditions" in mimic3_samples[0].keys()
assert "procedures" in mimic3_samples[0].keys()
assert "drugs" in mimic3_samples[0].keys()
assert "visit_id" in mimic3_samples[0].keys()
assert "patient_id" in mimic3_samples[0].keys()
num_positive_samples = sum([v["label"] for v in mimic3_samples])
assert num_positive_samples == 42



### Data Splitting and Loading

We will split the dataset into train, validation, and test sets. For this problem, we simply split by samples. Other spliting functions can be found [here](https://pyhealth.readthedocs.io/en/latest/api/datasets/pyhealth.datasets.splitter.html#).

- The dataset is split into 60% training, 20% validation, and 20% test.

- After splitting, we use the `get_dataloader` function to convert the datasets into PyTorch DataLoaders.

- The training DataLoader enables shuffling to ensure better generalization, while validation and test DataLoaders do not shuffle the data.

In [69]:
from pyhealth.datasets import split_by_sample, get_dataloader


# Data split
example_train_samples, example_val_samples, example_test_samples = split_by_sample(example_samples, [0.6, 0.2, 0.2])

# Create dataloaders
example_train_loader = get_dataloader(example_train_samples, batch_size=64, shuffle=True)
example_val_loader = get_dataloader(example_val_samples, batch_size=64, shuffle=False)
example_test_loader = get_dataloader(example_test_samples, batch_size=64, shuffle=False)

Similaraly for the `mimic3_samples` we just created.

In [70]:
# Data split
mimic3_train_samples, mimic3_val_samples, mimic3_test_samples = split_by_sample(mimic3_samples, [0.6, 0.2, 0.2])

# Create dataloaders
mimic3_train_loader = get_dataloader(mimic3_train_samples, batch_size=64, shuffle=True)
mimic3_val_loader = get_dataloader(mimic3_val_samples, batch_size=64, shuffle=False)
mimic3_test_loader = get_dataloader(mimic3_test_samples, batch_size=64, shuffle=False)

## 3. Build ML Model

### The `pyhealth.models` Module

The `pyhealth.models` module provides common deep learning models such as RNN, CNN, and Transformer, along with specialized healthcare models like RETAIN, SafeDrug, and GAMENet. Most models can be applied to a wide range of healthcare prediction tasks, except for a few task-specific models (e.g., GAMENet, SafeDrug, and MICRON, which are designed exclusively for drug recommendation). Refer to the documentation: [PyHealth Models API](https://pyhealth.readthedocs.io/en/latest/api/models.html).

#### Arguments
Each deep learning model in `pyhealth.models` follows a standard set of arguments:

- `dataset`: A `pyhealth.datasets.SampleDataset` object (output from the dataset processing step).
- `feature_keys`: A list of feature variables (strings), defined in the task function (e.g., "conditions").
- `label_key`: The target variable, defined in the task function (e.g., "label").
- `mode`: Specifies the type of prediction task:
  - `"multiclass"` (e.g., length-of-stay classification)
  - `"multilabel"` (e.g., drug recommendation)
  - `"binary"` (e.g., mortality prediction)
- Other model-specific arguments: Such as `dropout`, `num_layers`, and `hidden_layer` dimensions.

### Example

Recurrent Neural Networks (RNNs) are widely used in clinical machine learning due to their ability to capture temporal dependencies in sequential patient data. In this example, we use an RNN model to predict a patient’s mortality based on their past medical history.

#### What Happens Internally?

The RNN model processes patient record by applying a separate RNN layer to each feature type (e.g., conditions, procedures, drugs). The final hidden states from each RNN are concatenated into a patient representation vector. A fully connected layer takes the patient representation and predicts patient mortality.

In [71]:
# Each sample in the dataset contains the following information
print(example_samples[0].keys())

We set `feature_keys` to `["conditions", "procedures", "drugs"]`, which are the input features.

We set `label_key` to `"label"`, which represents the patient mortality.

In [72]:
from pyhealth.models import RNN


example_rnn = RNN(
    dataset=example_samples,
    feature_keys=["conditions", "procedures", "drugs"],
    label_key="label",
    mode="binary",
)

In [73]:
# Architecture of RNN
example_rnn

### TODO 3: Build A RNN Model

1. Initialize a `RNN` model from `pyhealth.models`.
2. Use the `mimic3_samples` dataset we just created.
3. Predict readmission from conditions, procedures, and drugs.

In [74]:
"""
TODO 3: Build A RNN Model [20 Points]
"""
mimic3_rnn = None
# your code here
mimic3_rnn = RNN(
    dataset=mimic3_samples,
    feature_keys=["conditions", "procedures", "drugs"],
    label_key="label",
    mode="binary",
)

# Print the architecture
print(mimic3_rnn)

In [75]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert str(type(mimic3_rnn)) == "<class 'pyhealth.models.rnn.RNN'>"
assert set(mimic3_rnn.feature_keys) == set(['conditions', 'procedures', 'drugs'])
assert mimic3_rnn.label_key == "label"
assert mimic3_rnn.mode == "binary"
assert mimic3_rnn.embeddings["conditions"].num_embeddings == 223
assert mimic3_rnn.embeddings["procedures"].num_embeddings == 131
assert mimic3_rnn.embeddings["drugs"].num_embeddings == 391



## 4. Train Model

### The `pyhealth.trainer` Module
The `pyhealth.trainer` module serves as the training handler for deep learning models. It provides a streamlined workflow for model training, including logging, checkpointing, and performance monitoring.  Refer to the documentation: [pyhealth.trainer](https://pyhealth.readthedocs.io/en/latest/api/trainer.html).

#### Arguments
To initialize a `Trainer` instance, the following parameters can be specified:
- `model`: The PyHealth model to train (e.g., an instance of `pyhealth.models.RNN`).
- `checkpoint_path` (optional): Path to save intermediate model checkpoints.
- `metrics` (optional): A list of metrics to track during training (e.g., "pr_auc", "auc_roc").
- `device` (optional): Specifies whether to train on CPU or GPU.
- `enable_logging` (optional): Enables logging for monitoring training progress.
- `output_path` (optional): Directory to store training outputs and logs.
- `exp_name` (optional): Name of the experiment or task for tracking purposes.

#### Trainer Functionality
The core training process is handled by the `Trainer.train()` method, which requires the following parameters:

- Data Loaders:
  - `train_dataloader`: The training dataset loader.
  - `val_dataloader`: The validation dataset loader.
- Training Hyperparameters:
  - `epochs`: Number of epochs to train the model.
  - `optimizer_class` (optional): The optimizer (e.g., torch.optim.Adam).
  - `optimizer_params` (optional): Parameters for the optimizer, including:
  - `lr` (optional): Learning rate.
  - `weight_decay` (optional): Weight decay for regularization.
  - `max_grad_norm` (optional): Maximum gradient norm for gradient clipping.
- Monitoring & Model Selection:
  - `monitor`: Metric to monitor (e.g., "auc_roc").
  - `monitor_criterion` (optional): Whether to maximize or minimize the monitored metric ("max" by default).
  - `load_best_model_at_last` (optional): Whether to automatically load the best-performing model at the end of training.

### Example

We define a `Trainer` class to train the RNN model we just created. Then, we train the model using the `train()` function.

#### What Happens Internally?
1. Loads the train_loader and val_loader for batch-wise training.
2. Tracks performance using PR-AUC and ROC-AUC.
3. Saves model checkpoints during training.
4. Logs training details for reproducibility.
5. The model with the best AUC-ROC is automatically loaded at the end of training.

In [76]:
from pyhealth.trainer import Trainer


# Initialize the Trainer
example_trainer = Trainer(
    model=example_rnn,
    metrics=["pr_auc", "roc_auc"],
    device="cpu",
)

In [77]:
# Train the model
example_trainer.train(
    train_dataloader=example_train_loader,  # Training data
    val_dataloader=example_val_loader,  # Validation data
    epochs=20,  # Number of training epochs
    optimizer_class=torch.optim.Adam,  # Optimizer choice
    optimizer_params={"lr": 0.001, "weight_decay": 1e-5},  # Optimizer parameters
    max_grad_norm=5.0,  # Gradient clipping
    monitor="roc_auc",  # Monitor AUC-ROC for best model selection
    monitor_criterion="max",  # Maximize AUC-ROC during training
    load_best_model_at_last=True,  # Automatically load the best-performing model
)

### TODO 4: Train the RNN Model

Let us train the `mimic3_rnn` model we created on the readmission task. 

1. Set Up the Trainer
- Use `pyhealth.trainer.Trainer` for model training.
- Monitor `roc_auc`, `pr_auc`, and `f1` scores to track training performance.

2. Train the Model
- Call `trainer.train()` with the `mimic3_*_loader`
- Train for 20 epochs.
- Use Adam optimizer.
- Set `lr` to 0.001.
- Monitor `roc_auc` score for model selection
- Automatically load the best-performing model.

In [78]:
"""
TODO 4: Train the RNN Model [20 Points]
"""
# Set up the Trainer
mimic3_trainer = Trainer(
    model=mimic3_rnn,
    metrics=["pr_auc","roc_auc","f1"],
    device="cpu",
)

In [86]:
"""
TODO 4: Train the RNN Model [20 Points]
"""
# Train the model
mimic3_trainer.train(
    train_dataloader=mimic3_train_loader,  # Training data
    val_dataloader=mimic3_val_loader,  # Validation data
    epochs=20,  # Number of training epochs
    optimizer_class=torch.optim.Adam,  # Optimizer choice
    optimizer_params={"lr": 0.001},  # Optimizer parameters
    max_grad_norm=5.0,  # Gradient clipping
    monitor="roc_auc",  # Monitor AUC-ROC for best model selection
    monitor_criterion="max",  # Maximize AUC-ROC during training
    load_best_model_at_last=True,  # Automatically load the best-performing model
)

In [87]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert str(type(mimic3_trainer.model)) == "<class 'pyhealth.models.rnn.RNN'>"
assert set(mimic3_trainer.metrics) == set(["roc_auc", "pr_auc", "f1"])



## 5. Model Evaluation

### One-line Evaluation

The `Trainer` class provides the `.evaluate(data_loader)` method to compute performance metrics on any test dataset. This will output a dictionary containing the evaluation metrics (e.g., `"pr_auc"`, `"auc_roc"`, `"loss"`, etc.).

### Example

In [88]:
example_results = example_trainer.evaluate(example_test_loader)
print(example_results)

### TODO 5: Evaluate the RNN Model

In this step, we will evaluate the trained RNN model on the test dataset using the `.evaluate()` method from `Trainer`. This will compute key binary classification metrics to assess the model’s performance.

In [89]:
"""
TODO 5: Evaluate the RNN Model [20 points]
"""

mimic3_results = None
# your code here
mimic3_results = mimic3_trainer.evaluate(mimic3_test_loader)

# Print evaluation metrics
print("MIMIC-III Test Set Evaluation Metrics:", mimic3_results)

In [90]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert "roc_auc" in mimic3_results
assert "pr_auc" in mimic3_results
assert "f1" in mimic3_results
assert mimic3_results['roc_auc'] > 0.55

