### Training a model in scikit-learn

We demonstrate how the outputs of the extraction procedure can be used to train a predictive model using standard tools in scikit-learn. Here, we train a logistic regression model using the default parameters. If you are instead interested in training models using Pytorch, this vignette may be skipped.

In [1]:
import os
import numpy as np
import pandas as pd
import joblib
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [2]:
data_path = "/share/pi/nigam/projects/prediction_utils/scratch/"
merged_name = "merged_features_binary"
label_col = "LOS_7" # use length of stay >= 7 days as the outcome

Let's load the relevant data

In [3]:
cohort = pd.read_parquet(
    os.path.join(data_path, 'cohort', 'cohort.parquet')
)
features = joblib.load(
    os.path.join(data_path, merged_name, 'features_sparse', 'features.gz')
)

row_id_map = pd.read_parquet(
    os.path.join(data_path, merged_name, 'features_sparse', 'features_row_id_map.parquet')
)

vocab = pd.read_parquet(
    os.path.join(data_path, merged_name, 'vocab', 'vocab.parquet')
)

The `row_id_map` can be used to determine which rows in `features` correspond to which `prediction_id`s in `cohort`.

In [4]:
cohort.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name,fold_id
0,2150893537413223336,29928031,2016-04-26,2016-05-05,0,0,9,1,0,23,[18-30),Other,FEMALE,1
1,-9172456414774063774,29928277,2016-04-17,2016-04-18,0,0,1,0,0,34,[30-45),Hispanic or Latino,FEMALE,3
2,9182809047481350174,29928369,2019-06-01,2019-06-04,0,0,3,0,0,18,[18-30),Hispanic or Latino,FEMALE,8
3,5941383439057324436,29928716,2010-11-22,2010-11-23,0,0,1,0,0,45,[45-55),Other,FEMALE,8
4,7118399359492822295,29929199,2017-10-05,2017-10-08,0,0,3,0,1,45,[45-55),White,FEMALE,3


In [5]:
row_id_map.head()

Unnamed: 0,features_row_id,prediction_id
0,0,-9218020972023615613
1,1,-9172456414774063774
2,2,-9162815047257347747
3,3,-9160944683684581999
4,4,-9159427746772318041


In [6]:
cohort_with_row_id = cohort.merge(row_id_map)

In [7]:
cohort_with_row_id.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name,fold_id,features_row_id
0,2150893537413223336,29928031,2016-04-26,2016-05-05,0,0,9,1,0,23,[18-30),Other,FEMALE,1,994
1,-9172456414774063774,29928277,2016-04-17,2016-04-18,0,0,1,0,0,34,[30-45),Hispanic or Latino,FEMALE,3,1
2,9182809047481350174,29928369,2019-06-01,2019-06-04,0,0,3,0,0,18,[18-30),Hispanic or Latino,FEMALE,8,1563
3,5941383439057324436,29928716,2010-11-22,2010-11-23,0,0,1,0,0,45,[45-55),Other,FEMALE,8,1296
4,7118399359492822295,29929199,2017-10-05,2017-10-08,0,0,3,0,1,45,[45-55),White,FEMALE,3,1400


Now, we will split the data into a training and test set

In [8]:
cohort_dict = {
    'train': cohort_with_row_id.query('fold_id != "test"'),
    'test': cohort_with_row_id.query('fold_id == "test"')
}

In [9]:
row_ids_dict = {
    key: value['features_row_id'].values
    for key, value in cohort_dict.items()
}

In [10]:
labels_dict = {
    key: value[label_col].values
    for key, value in cohort_dict.items()
}

In [11]:
features_dict = {
    key: features[row_ids_dict[key]]
    for key in row_ids_dict.keys()
}

Now train the model and evaluate it in terms of the AUC-ROC

In [12]:
model = LogisticRegression(C=0.001).fit(features_dict['train'], labels_dict['train'])

In [13]:
auc_dict = {
    key: roc_auc_score(labels_dict[key], model.predict_proba(features_dict[key])[:, 1])
    for key in features_dict.keys()
}
print(auc_dict)

{'train': 0.7153090361087426, 'test': 0.7031039136302294}
