### Training a model in scikit-learn

We demonstrate how the outputs of the extraction procedure can be used to train a predictive model using standard tools in scikit-learn. Here, we train a logistic regression model using the default parameters. If you are instead interested in training models using Pytorch, this vignette may be skipped.

In [4]:
import os
import numpy as np
import pandas as pd
import joblib
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [5]:
data_path = "/share/pi/nigam/projects/prediction_utils/scratch/"
merged_name = "merged_features_binary"
label_col = "LOS_7" # use length of stay >= 7 days as the outcome

Let's load the relevant data

In [6]:
cohort = pd.read_parquet(
    os.path.join(data_path, 'cohort', 'cohort.parquet')
)
features = joblib.load(
    os.path.join(data_path, merged_name, 'features_sparse', 'features.gz')
)

row_id_map = pd.read_parquet(
    os.path.join(data_path, merged_name, 'features_sparse', 'features_row_id_map.parquet')
)

vocab = pd.read_parquet(
    os.path.join(data_path, merged_name, 'vocab', 'vocab.parquet')
)

The `row_id_map` can be used to determine which rows in `features` correspond to which `prediction_id`s in `cohort`.

In [7]:
cohort.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name,fold_id
0,-2634508823241925258,29927078,2016-08-15,2016-08-19,0,0,4,0,0,38,[30-45),Asian,FEMALE,test
1,1780316215979524235,29927087,2019-08-08,2019-08-11,0,0,3,0,0,31,[30-45),Asian,FEMALE,7
2,4456189537238262902,29927561,2015-07-24,2015-07-26,0,0,2,0,0,45,[45-55),Asian,FEMALE,7
3,922196558384555124,29927632,2011-04-17,2011-04-19,0,0,2,0,0,49,[45-55),Asian,FEMALE,10
4,6633284178958689091,29928326,2014-09-02,2014-09-04,0,0,2,0,0,23,[18-30),Hispanic or Latino,FEMALE,10


In [8]:
row_id_map.head()

Unnamed: 0,features_row_id,prediction_id
0,0,3616212052160199172
1,1,2053957947170892260
2,2,-9204411047439856242
3,3,-6129485265169826587
4,4,-4401831474135185752


In [9]:
cohort_with_row_id = cohort.merge(row_id_map)

In [10]:
cohort_with_row_id.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name,fold_id,features_row_id
0,-2634508823241925258,29927078,2016-08-15,2016-08-19,0,0,4,0,0,38,[30-45),Asian,FEMALE,test,821
1,1780316215979524235,29927087,2019-08-08,2019-08-11,0,0,3,0,0,31,[30-45),Asian,FEMALE,7,817
2,4456189537238262902,29927561,2015-07-24,2015-07-26,0,0,2,0,0,45,[45-55),Asian,FEMALE,7,844
3,922196558384555124,29927632,2011-04-17,2011-04-19,0,0,2,0,0,49,[45-55),Asian,FEMALE,10,849
4,6633284178958689091,29928326,2014-09-02,2014-09-04,0,0,2,0,0,23,[18-30),Hispanic or Latino,FEMALE,10,912


Now, we will split the data into a training and test set

In [11]:
cohort_dict = {
    'train': cohort_with_row_id.query('fold_id != "test"'),
    'test': cohort_with_row_id.query('fold_id == "test"')
}

In [12]:
row_ids_dict = {
    key: value['features_row_id'].values
    for key, value in cohort_dict.items()
}

In [13]:
labels_dict = {
    key: value[label_col].values
    for key, value in cohort_dict.items()
}

In [14]:
features_dict = {
    key: features[row_ids_dict[key]]
    for key in row_ids_dict.keys()
}

Now train the model and evaluate it in terms of the AUC-ROC

In [15]:
model = LogisticRegression(C=0.001).fit(features_dict['train'], labels_dict['train'])

In [16]:
auc_dict = {
    key: roc_auc_score(labels_dict[key], model.predict_proba(features_dict[key])[:, 1])
    for key in features_dict.keys()
}
print(auc_dict)

{'train': 0.7935258949690188, 'test': 0.7024897573274503}
