# Benchmarking EHRShot with XGBoost
 
This notebook demonstrates how to benchmark the EHRShot dataset (https://som-shahlab.github.io/ehrshot-website/) using XGBoost for various clinical prediction tasks. EHRShot is a comprehensive dataset containing patient records from multiple OMOP tables, including:

- Operational outcomes (ICU admission, length of stay, readmission)
- Lab value predictions (anemia, hyperkalemia, hypoglycemia, etc.)
- New diagnosis predictions (hypertension, hyperlipidemia, cancer, etc.)
 
We'll use XGBoost as our baseline model to:
1. Process the EHRShot data into a feature matrix
2. Train and evaluate models for each prediction task
3. Compare performance across different OMOP tables
 
The results will help establish baseline performance metrics for these clinical prediction tasks.


In [1]:
import os
import sys

pyhealth_path = os.path.dirname(os.getcwd())
if pyhealth_path not in sys.path:
    print(f"Adding PyHealth to sys.path: {pyhealth_path}")
    sys.path.insert(0, pyhealth_path)

Adding PyHealth to sys.path: /home/REDACTED_USER/PyHealth


In [2]:
import numpy as np
import xgboost as xgb
from sklearn.metrics import roc_auc_score

from pyhealth.datasets import EHRShotDataset
from pyhealth.tasks import BenchmarkEHRShot

## Load and Prepare the Dataset

First, we'll load the EHRShot dataset and prepare it for our benchmarking tasks. The dataset contains various clinical prediction tasks organized across different OMOP tables. We'll load all the necessary tables including:

- Base tables: `ehrshot` and `splits`
- Operational outcomes: `guo_icu`, `guo_los`, `guo_readmission`
- Lab value predictions: `lab_anemia`, `lab_hyperkalemia`, `lab_hypoglycemia`, etc.
- New diagnosis predictions: `new_acutemi`, `new_celiac`, `new_hyperlipidemia`, etc.

After loading, we'll examine the dataset statistics to understand its size and composition.


In [3]:
dataset = EHRShotDataset(
    root="/shared/eng/EHRSHOT_ASSETS",
    tables=[
        "ehrshot",
        "splits",
        "guo_icu",
        "guo_los",
        "guo_readmission",
        "lab_anemia",
        "lab_hyperkalemia",
        "lab_hypoglycemia",
        "lab_hyponatremia",
        "lab_thrombocytopenia",
        "new_acutemi",
        "new_celiac",
        "new_hyperlipidemia",
        "new_hypertension",
        "new_lupus",
        "new_pancan",
    ],
)
dataset.stats()

No config path provided, using default config
Initializing ehrshot dataset from /shared/eng/EHRSHOT_ASSETS (dev mode: False)
Scanning table: ehrshot from /shared/eng/EHRSHOT_ASSETS/data/ehrshot.csv
Scanning table: splits from /shared/eng/EHRSHOT_ASSETS/splits/person_id_map.csv
Scanning table: guo_icu from /shared/eng/EHRSHOT_ASSETS/benchmark/guo_icu/labeled_patients.csv
Scanning table: guo_los from /shared/eng/EHRSHOT_ASSETS/benchmark/guo_los/labeled_patients.csv
Scanning table: guo_readmission from /shared/eng/EHRSHOT_ASSETS/benchmark/guo_readmission/labeled_patients.csv
Scanning table: lab_anemia from /shared/eng/EHRSHOT_ASSETS/benchmark/lab_anemia/labeled_patients.csv
Scanning table: lab_hyperkalemia from /shared/eng/EHRSHOT_ASSETS/benchmark/lab_hyperkalemia/labeled_patients.csv
Scanning table: lab_hypoglycemia from /shared/eng/EHRSHOT_ASSETS/benchmark/lab_hypoglycemia/labeled_patients.csv
Scanning table: lab_hyponatremia from /shared/eng/EHRSHOT_ASSETS/benchmark/lab_hyponatremia/la

## Set Up Benchmark Tasks
 
Next, we'll define the benchmark tasks we want to evaluate. The EHRShot dataset supports multiple clinical prediction tasks across different OMOP tables. We'll set up:
 
- Operational outcomes: ICU admission, length of stay, and readmission prediction
- Lab value predictions: anemia, hyperkalemia, hypoglycemia, hyponatremia, and thrombocytopenia
- New diagnosis predictions: acute MI, celiac disease, hyperlipidemia, hypertension, lupus, and pancreatic cancer
 
For each task, we'll select a subset of OMOP tables to construct features and labels.


In [4]:
tasks = [
    "guo_icu",
    "guo_los",
    "guo_readmission",
    "lab_anemia",
    "lab_hyperkalemia",
    "lab_hypoglycemia",
    "lab_hyponatremia",
    "lab_thrombocytopenia",
    "new_acutemi",
    "new_celiac",
    "new_hyperlipidemia",
    "new_hypertension",
    "new_lupus",
    "new_pancan"
]
omop_tables = [
    "device_exposure",
    "person",
    "visit_detail",
    "visit_occurrence",
    "condition_occurrence",
    "procedure_occurrence",
    "note",
    "drug_exposure",
    "observation",
    "measurement",
]

In [5]:
task = "guo_icu"
omop_table = ["condition_occurrence", "procedure_occurrence"]

In [6]:
task_fn = BenchmarkEHRShot(task=task, omop_tables=omop_table)
samples = dataset.set_task(task_fn)

Setting task BenchmarkEHRShot/guo_icu for ehrshot base dataset...
Generating samples with 8 worker(s)...
Generating samples for BenchmarkEHRShot/guo_icu
Label label vocab: {'False': 0, 'True': 1}


Processing samples: 100%|██████████| 6491/6491 [00:01<00:00, 6456.67it/s]

Generated 6491 samples for task BenchmarkEHRShot/guo_icu





In [7]:
print(samples.input_processors["feature"])
print(samples.output_processors["label"])

SequenceProcessor(code_vocab_size=15148)
BinaryLabelProcessor(label_vocab_size=2)


## Build Features and Labels

In this step, we prepare the data for XGBoost by transforming the clinical samples into feature matrices and label vectors.

For features, we create multi-hot encoded vectors where each position represents a unique clinical event (e.g., diagnoses, procedures).
The vector length is determined by the total number of unique events (num_features), and a '1' indicates the presence of that event.

For labels, we extract the target values from each sample. For lab-related tasks, we convert multi-class labels to binary by
setting any positive class (>=1) to 1.

The data is then split into training, validation, and test sets based on the predefined splits in the samples.

In [8]:
# Determine the maximum index to size the multi-hot vectors
num_features = samples.input_processors["feature"]._next_index

# Prepare feature matrix and labels
X, y, splits = [], [], []

for sample in samples:
    vec = np.zeros(num_features, dtype=int)
    vec[sample["feature"].numpy()] = 1  # multi-hot encoding
    X.append(vec)
    y.append(sample["label"].item())
    splits.append(sample["split"])

X = np.array(X)
y = np.array(y)
splits = np.array(splits)

if task.startswith("lab_"):
    print("Converting multi-class to binary")
    # convert multiclass to binary
    y[y >= 1] = 1

X_train = X[splits == "train"]
y_train = y[splits == "train"]
X_val = X[splits == "val"]
y_val = y[splits == "val"]
X_test = X[splits == "test"]
y_test = y[splits == "test"]
print(f"Train set: {X_train.shape}, {y_train.shape}")
print(f"Val set: {X_val.shape}, {y_val.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")

Train set: (2402, 15147), (2402,)
Val set: (2052, 15147), (2052,)
Test set: (2037, 15147), (2037,)


## Train and Evaluate XGBoost Model
 
In this step, we will train an XGBoost classifier on our prepared feature matrices and labels. 
 
We'll use the training set to fit the model and evaluate its performance on the test set using the area under the ROC curve (AUC) metric. 

The model will be configured with standard hyperparameters including 100 trees, a learning rate of 0.1, and log loss as the evaluation metric. 
 
This will allow us to assess how well the model can predict the target outcomes using the encoded clinical features.


In [9]:
model = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train, verbose=True)

# Test set evaluation
y_pred_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_prob)
print(f"Test AUC for task {task} with input {omop_table}: {auc:.4f}")

Parameters: { "use_label_encoder" } are not used.



Test AUC for task guo_icu with input ['condition_occurrence', 'procedure_occurrence']: 0.7543
