**Basic XGBoost example with MIMIC-IV clinical data demo**

This is a short example of how to fit an xgboost model, using the actual MIMIC data. I only use the demo data and a few covariates, so the model itself is not particularly interesting but shows the overall structure in python.

In this case I define the outcome for each patient as being tranferred to an ICU unit at somepoint after hospital admission ("transferred_to_icu" == 1) or never being transferred to the ICU after admission ("transferred_to_icu" == 0). In this example the features are two categorical covariates: the type of hospital admission, and race.

In [3]:
# If you do not already have these packages installed you'll need to install:
# numpy, pandas, scikit-learn, xgboost

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.metrics import average_precision_score, precision_recall_curve

Preparing the data:

In [4]:
# Replace these file paths with your own
transfers_df = pd.read_csv("/Users/jacobsussman/Desktop/mimic-iv-clinical-database-demo-2.2/hosp/transfers.csv.gz")
admission_df = pd.read_csv("/Users/jacobsussman/Desktop/mimic-iv-clinical-database-demo-2.2/hosp/admissions.csv.gz")

In [5]:
# The "transfers" csv file as a column called "careunit", here I list the ICU units
icu_units = ["Trauma SICU (TSICU)", "Medical Intensive Care Unit (MICU)", 
             "Surgical Intensive Care Unit (SICU)", "Medical/Surgical Intensive Care Unit (MICU/SICU)", 
             "PACU", "Neuro Surgical Intensive Care Unit (Neuro SICU)"]

# Define a new dataframe with the subject id
model_data = transfers_df[["subject_id"]].copy()

# Add the outcome column, which is a 1 if "careunit" is in "icu_units", and 0 if not
model_data["transferred_to_icu"] = transfers_df["careunit"].isin(icu_units).astype(int)

# There are some patients with multiple transfers, this condenses them to 
# just one row that is a 1 if they ever have an ICU transfer
model_data = model_data.groupby("subject_id", as_index = False)["transferred_to_icu"].max()

# This merges the admissions data which is in a seperate csv file
# I keep each patients first admission entry
admission_first = admission_df.groupby("subject_id", as_index = False).first()
model_data = model_data.merge(admission_first, on = "subject_id", how = "left")

In [6]:
model_data.head()

Unnamed: 0,subject_id,transferred_to_icu,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
0,10000032,1,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
1,10001217,1,24597018,2157-11-18 22:56:00,2157-11-25 18:00:00,,EW EMER.,P4645A,EMERGENCY ROOM,HOME HEALTH CARE,Other,?,MARRIED,WHITE,2157-11-18 17:38:00,2157-11-19 01:24:00,0
2,10001725,1,25563031,2110-04-11 15:08:00,2110-04-14 15:00:00,,EW EMER.,P35SU0,PACU,HOME,Other,ENGLISH,MARRIED,WHITE,,,0
3,10002428,1,28662225,2156-04-12 14:16:00,2156-04-29 16:26:00,,EW EMER.,P64TOH,EMERGENCY ROOM,SKILLED NURSING FACILITY,Medicare,ENGLISH,WIDOWED,WHITE,2156-04-12 09:56:00,2156-04-12 17:11:00,0
4,10002495,1,24982426,2141-05-22 20:17:00,2141-05-29 17:41:00,,URGENT,P79SJ2,TRANSFER FROM HOSPITAL,SKILLED NURSING FACILITY,Medicare,ENGLISH,MARRIED,UNKNOWN,,,0


Setting up the model:

In [40]:
# Define a dataframe of features X, and outcomes y
X = model_data[["admission_type", "race"]]
y = model_data["transferred_to_icu"]

# This is just one way to convert the categorical data to 0 and 1's
encoder = OneHotEncoder(drop="first", sparse_output=False)
X_encoded = encoder.fit_transform(X)
X_encoded = pd.DataFrame(X_encoded, columns=encoder.get_feature_names_out(X.columns))


In [59]:
# Defines the training features, training outcome, testing features, and testing outcome
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2
)

In [60]:
# This defines the most basic XGBoost model possible
model = XGBClassifier()

# This fits the model we defined to our training features and outcomes
model.fit(X_train, y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [64]:
# This block uses our model to predict y from our test set of features
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Prints evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUPRC:", average_precision_score(y_test, y_pred_proba))


Accuracy: 0.75
AUPRC: 0.856827731092437
