# Predict ED admission probability

This notebook demonstrates the first stage of prediction, to generate a probability of admission for each patient in the ED. 

As one of the modelling decisions is to send predictions at specified times of day, we tailor the models to these times and train one model for each time. The dataset used for this modelling is derived from snapshots of visits at each time of day. The times of day are define in config.json file in the root directory of this repo. 

A patient episode (visit) may well span more than one of these times, so we need to consider how we will deal with the occurence of multiple snapshots per episode. At each of these times of day, we will use only one training sample from each hospital episode.

Separation of the visits into training, validation and test sets will be done chronologically into a training, validation and test set 

Evaluation of individual level models includes: 
- feature importance plots
- calibration plot
- MADCAP overall, plus breakdown by age category and length of stay


## Set up the notebook environment

In [66]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [79]:
from pathlib import Path
import sys
import json
import pandas as pd

PROJECT_ROOT = Path().home() 
USER_ROOT = Path().home() / 'work'

sys.path.append(str(USER_ROOT / 'patientflow' / 'src' / 'patientflow'))
sys.path.append(str(USER_ROOT / 'patientflow' / 'functions'))



In [74]:
PROJECT_ROOT

PosixPath('/home/jovyan')

In [77]:
model_file_path = PROJECT_ROOT /'data' / 'ed-predictor' / 'trained-models'
model_file_path

data_file_path = USER_ROOT / 'ed-predictor' / 'data-raw'
data_file_path

PosixPath('/home/jovyan/work/ed-predictor/data-raw')

In [78]:
model_file_path

PosixPath('/home/jovyan/data/ed-predictor/trained-models')

## Load parameters

These are set in config.json. You can change these for your own purposes. But the times of day will need to match those in the provided dataset if you want to run this notebook successfully.

In [145]:
# Load the times of day
import yaml

config_path = Path(USER_ROOT / 'patientflow')

with open(config_path / 'config.yaml', 'r') as file:
    config = yaml.safe_load(file)
    
# Convert list of times of day at which predictions will be made (currently stored as lists) to list of tuples
prediction_times = [tuple(item) for item in config['prediction_times']]

# See the times of day at which predictions will be made
prediction_times

# Load the dates defining the beginning and end of training, validation and test sets
start_training_set, start_validation_set, start_test_set, end_test_set = [item for item in config['modelling_dates']]


## Load data

In [81]:
from ed_admissions_data_retrieval import ed_admissions_get_data

csv_filename = 'ed_visits.csv'
full_path = data_file_path / csv_filename

df = ed_admissions_get_data(full_path)

In [82]:
df['snapshot_date'] = pd.to_datetime(df['snapshot_date']).dt.date

In [83]:
# print start and end dates
print(df.snapshot_date.min())
print(df.snapshot_date.max())

2030-04-01
2032-04-30


See how many visits there are at each time of day in the dataset. We see that number of visits represented is greater in the afternoon and evening

In [84]:
print(df.prediction_time.value_counts())

prediction_time
(15, 30)    72696
(12, 0)     64177
(22, 0)     59466
(9, 30)     46094
(6, 0)      29253
Name: count, dtype: int64


We will confirm that the dataset aligns with the specified times of day set in the parameters file config.yaml. That is because, later, we will use these times of day to evaluate the predictions. The evaluation will fail if the data loaded does not match. 

In [85]:
print("\nTimes of day at which predictions will be made")
print(prediction_times)
print("\nNumber of rows in dataset that are not in these times of day")
print(len(df[~df.prediction_time.isin(prediction_times)]))


Times of day at which predictions will be made
[(6, 0), (9, 30), (12, 0), (15, 30), (22, 0)]

Number of rows in dataset that are not in these times of day
0


## Set an index column in df

Setting the index as the snapshot_id before subsetting means that we retain the same values of snapshot_id throughout the entire process, ensuring that they are consistent across the original dataset df and the training, validation and test subsets of df

In [86]:
df.head()

Unnamed: 0,snapshot_id,snapshot_date,prediction_time,visit_number,training_validation_test,elapsed_los_td,sex,arrival_method,current_location_type,total_locations_visited,...,latest_lab_results_k,latest_lab_results_lac,latest_lab_results_na,latest_lab_results_pco2,latest_lab_results_ph,latest_lab_results_wcc,latest_lab_results_hco3,has_consultation,is_admitted,age_group
0,0,2030-04-09,"(12, 0)",1,train,3420.0,F,Walk-in,waiting,2,...,,,,,,,,False,False,45-54
1,1,2030-04-09,"(15, 30)",1,train,16020.0,F,Walk-in,majors,5,...,4.2,0.5,141.0,6.84,7.371,5.28,,False,False,45-54
2,2,2030-08-08,"(15, 30)",2,train,29760.0,M,,majors,3,...,3.8,0.9,142.0,6.31,7.361,5.53,,True,False,65-74
3,3,2030-08-03,"(12, 0)",3,train,106800.0,M,Walk-in,majors,5,...,,1.3,,5.83,7.434,,,False,False,65-74
4,4,2030-04-24,"(12, 0)",4,train,6600.0,F,Walk-in,sdec_waiting,4,...,,,,,,,,True,False,35-44


After executing the code below, the snapshot_id has been set as the index column.

In [87]:
if df.index.name != 'snapshot_id':
    df = df.set_index('snapshot_id')
df.head()

Unnamed: 0_level_0,snapshot_date,prediction_time,visit_number,training_validation_test,elapsed_los_td,sex,arrival_method,current_location_type,total_locations_visited,num_obs,...,latest_lab_results_k,latest_lab_results_lac,latest_lab_results_na,latest_lab_results_pco2,latest_lab_results_ph,latest_lab_results_wcc,latest_lab_results_hco3,has_consultation,is_admitted,age_group
snapshot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2030-04-09,"(12, 0)",1,train,3420.0,F,Walk-in,waiting,2,14,...,,,,,,,,False,False,45-54
1,2030-04-09,"(15, 30)",1,train,16020.0,F,Walk-in,majors,5,30,...,4.2,0.5,141.0,6.84,7.371,5.28,,False,False,45-54
2,2030-08-08,"(15, 30)",2,train,29760.0,M,,majors,3,67,...,3.8,0.9,142.0,6.31,7.361,5.53,,True,False,65-74
3,2030-08-03,"(12, 0)",3,train,106800.0,M,Walk-in,majors,5,405,...,,1.3,,5.83,7.434,,,False,False,65-74
4,2030-04-24,"(12, 0)",4,train,6600.0,F,Walk-in,sdec_waiting,4,14,...,,,,,,,,True,False,35-44


## Separate into training, validation and test sets

As part of preparing the data, each visit has already been allocated into one of three sets - training, vaidation and test sets. This has been done chronologically, as shown by the output below. Using a chronological approach is appropriate for tasks where the model needs to be validated on unseen, future data.


In [88]:
for value in df.training_validation_test.unique():
    subset = df[df.training_validation_test == value]
    counts = subset.training_validation_test.value_counts().values[0]
    min_date = subset.snapshot_date.min()
    max_date = subset.snapshot_date.max()
    print(f"Set: {value}\nNumber of rows: {counts}\nMin Date: {min_date}\nMax Date: {max_date}\n")



Set: train
Number of rows: 190761
Min Date: 2030-04-01
Max Date: 2031-08-31

Set: test
Number of rows: 63685
Min Date: 2031-11-01
Max Date: 2032-04-30

Set: valid
Number of rows: 17240
Min Date: 2031-09-01
Max Date: 2031-10-31



In [89]:
train_df = df[df.training_validation_test == 'train'].drop(columns='training_validation_test')
valid_df = df[df.training_validation_test == 'valid'].drop(columns='training_validation_test')
test_df = df[df.training_validation_test == 'test'].drop(columns='training_validation_test')


We can see below that some visits appear more than once in each of these sets. (No visit appears in more than one set.)

In [90]:
train_df.visit_number.value_counts()

visit_number
56473     16
37682     16
93534     16
11970     16
125346    16
          ..
52539      1
52538      1
52537      1
52536      1
215462     1
Name: count, Length: 131837, dtype: int64

For example, the below patient has 16 episode slices. It's quite possible that this patient has already left the ED but the discharge has not been updated on the patient record. While it is tempting to remove these later, in real-time these patients would be picked up, so a model would ideally be trained on this data also. Therefore we do need to include them in our training set. 

In [72]:
train_df[train_df.visit_number == 21947].head()

Unnamed: 0,snapshot_id,snapshot_date,prediction_time,visit_number,elapsed_los_td,sex,arrival_method,current_location_type,total_locations_visited,num_obs,...,latest_lab_results_K,latest_lab_results_Lac,latest_lab_results_NA,latest_lab_results_pCO2,latest_lab_results_pH,latest_lab_results_WCC,latest_lab_results_HCO3,has_consultation,is_admitted,age_group
29982,32290,2030-06-29,"(6, 0)",21947,3 days 00:01:00,F,Walk-in,sdec,7,57,...,4.4,0.8,138.0,6.49,7.357,13.07,,True,False,75-102
29981,32289,2030-06-28,"(22, 0)",21947,2 days 16:01:00,F,Walk-in,sdec,7,57,...,4.4,0.8,138.0,6.49,7.357,13.07,,True,False,75-102
29980,32288,2030-06-28,"(15, 30)",21947,2 days 09:31:00,F,Walk-in,sdec,7,57,...,4.4,0.8,138.0,6.49,7.357,13.07,,True,False,75-102
29979,32287,2030-06-28,"(12, 0)",21947,2 days 06:01:00,F,Walk-in,sdec,7,57,...,4.4,0.8,138.0,6.49,7.357,13.07,,True,False,75-102
29978,32286,2030-06-28,"(9, 30)",21947,2 days 03:31:00,F,Walk-in,sdec,7,57,...,4.4,0.8,138.0,6.49,7.357,13.07,,True,False,75-102


In [101]:
train_df[train_df.latest_obs_temperature > 110]['latest_obs_temperature'.mean()

Unnamed: 0_level_0,snapshot_date,prediction_time,visit_number,elapsed_los_td,sex,arrival_method,current_location_type,total_locations_visited,num_obs,num_obs_events,...,latest_lab_results_k,latest_lab_results_lac,latest_lab_results_na,latest_lab_results_pco2,latest_lab_results_ph,latest_lab_results_wcc,latest_lab_results_hco3,has_consultation,is_admitted,age_group
snapshot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
32083,2030-06-25,"(15, 30)",21819,5700.0,F,Walk-in,utc,3,29,2,...,,,,,,,,False,False,25-34
37252,2030-07-09,"(9, 30)",25358,53880.0,M,Walk-in,sdec,7,54,6,...,3.9,2.4,137.0,4.73,7.44,9.14,,True,True,55-64
37253,2030-07-09,"(12, 0)",25358,62880.0,M,Walk-in,sdec,7,54,6,...,3.9,2.4,137.0,4.73,7.44,9.14,,True,True,55-64
48945,2030-08-07,"(15, 30)",33455,12540.0,F,,sdec_waiting,2,21,1,...,4.3,1.6,136.0,5.04,7.44,7.44,,True,False,65-74
52182,2030-08-15,"(15, 30)",35758,66300.0,F,Public Trans,sdec,5,78,7,...,3.9,1.6,141.0,7.38,7.322,10.3,,True,False,25-34
70982,2030-09-27,"(15, 30)",49023,12000.0,F,Walk-in,sdec_waiting,3,24,3,...,,,,,,,,True,False,18-24
89047,2030-11-11,"(6, 0)",61181,5460.0,F,,majors,4,35,5,...,,1.3,,4.87,7.402,9.2,,False,False,25-34
104086,2030-12-20,"(9, 30)",71422,1980.0,F,Walk-in,rat,2,15,2,...,,,,,,,,False,False,35-44
120102,2031-01-25,"(9, 30)",82388,1980.0,M,Walk-in,utc,4,27,2,...,,,,,,,,False,False,55-64
128674,2031-02-15,"(15, 30)",88411,2580.0,F,Public Trans,rat,4,30,3,...,,,,,,,,False,False,25-34


## Train a XGBoost Classifier for each time of day, and save the best model

The first step is to load a transformer for the ML training data to turn it into a format that our ML classifier can read. This is done using a function called create_column_transformer() which called ColumnTransfomer() a standard method in scikit-learn. This function could be changed for different input

The ColumnTransformer in scikit-learn is a tool that applies different transformations or preprocessing steps to different columns of a dataset in a single operation. OneHotEncoder converts categorical data into a format that can be provided to machine learning algorithms; without this, the model might interpret the categorical data as numerical, which would lead to incorrect results. With the OrdinalEncoder, categories are converted into ordered numerical values to reflect the inherent order in the age groups

We can also specify a grid of hyperparameters, so that the classifier will iterate though them to find the best fitting model. 

We are interested in predictions at different times of day. So we will train a model for each time of day. We will filter each visit so that it only appears once in the training data. A random number has already been included in the dataset to facilitate this.

We then iterate through the grid to find the best model for each time of day, keeping track of the best model and its results. 

The best model is saved, plus a dictionary of its metadata, including

* how many visits were in training, validation and test sets
* Area under ROC curve and log loss (performance metrics) for training (based on 5-fold cross validation), validation and test sets
* List of features and their importances in the model


#### Function for cross validation

The ML models will be trained across a range of different hyperparameter options. When evaluating the best model, we will save common ML metrics (AUC and logloss) and compare each model for the best (lowest) logloss. Apply a chronological approach to the cross-validation split is appropriate for tasks where the model needs to be validated on unseen, future data.

In [142]:

# import xgboost as xgb

from sklearn.model_selection import ParameterGrid, cross_validate
# from sklearn.pipeline import Pipeline
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import (
    confusion_matrix,
    ConfusionMatrixDisplay,
    log_loss,
    roc_auc_score,
)
from joblib import dump, load

from ed_admissions_utils import get_model_name, preprocess_data
from ed_admissions_machine_learning import chronological_cross_validation, create_column_transformer, initialise_model

# initialize a dict to save information about the best models for each time of day
best_model_results_dict = {}

# Option to iterate through different hyperparameters for XGBoost
grid = {
    'n_estimators':[30], #, 40, 50],
    'subsample':[0.7], #,0.8,0.9],
    'colsample_bytree': [0.7] #,0.8,0.9]
}

# certain columns are not used in training
exclude_from_training_data = [
    "visit_number",
    "snapshot_date",
    "prediction_time"]


ordinal_mappings = {
    "age_group": ["0-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75-102"],
    "latest_acvpu": ["A", "C", "V", "P", "U"],
    "latest_manch_triage": ["Blue", "Green", "Yellow", "Orange", "Red"],
    "latest_pain_objective": ["Nil", "Mild", "Moderate", "Severe\E\Very Severe", "Severe\\Very Severe"]
}


# Process each time of day
for prediction_time_ in prediction_times:

    print("\nProcessing :" + str(prediction_time_))

    # create a name for the model based on the time of day it is trained for
    MODEL__ED_ADMISSIONS__NAME = get_model_name('ed_admission', prediction_time_)

    # use this name in the path for saving best model
    full_path = model_file_path / MODEL__ED_ADMISSIONS__NAME 
    full_path = full_path.with_suffix('.joblib')

    # initialise data used for saving attributes of the model
    best_model_results_dict[MODEL__ED_ADMISSIONS__NAME] = {}
    best_valid_logloss = float('inf')
    results_dict = {}
    
    # get visits that were in at the time of day in question and preprocess the training, validation and test sets 
    X_train, y_train = preprocess_data(train_df, prediction_time_, exclude_from_training_data)
    X_valid, y_valid = preprocess_data(valid_df, prediction_time_, exclude_from_training_data)
    X_test, y_test = preprocess_data(test_df, prediction_time_, exclude_from_training_data)
    
    # save size of each set
    best_model_results_dict[MODEL__ED_ADMISSIONS__NAME]['train_valid_test_set_no'] = {
        'train_set_no' : len(X_train),
        'valid_set_no' : len(X_valid),
        'test_set_no' : len(X_test),
    }

    # iterate through the grid of hyperparameters
    for g in ParameterGrid(grid):
        model = initialise_model(g)
        
        # define a column transformer for the ordinal and categorical variables
        column_transformer = create_column_transformer(X_test, ordinal_mappings)
        
        # create a pipeline with the feature transformer and the model
        pipeline = Pipeline([
            ('feature_transformer', column_transformer),
            ('classifier', model)
        ])

        # cross-validate on training set using the function created earlier
        cv_results = chronological_cross_validation(pipeline, X_train, y_train, n_splits=5)

        # Store results for this set of parameters in the results dictionary
        results_dict[str(g)] = {
            'train_auc': cv_results['train_auc'],
            'valid_auc': cv_results['valid_auc'],
            'train_logloss': cv_results['train_logloss'],
            'valid_logloss': cv_results['valid_logloss'],
        }
        
        # Update and save best model if current model is better on validation set
        if cv_results['valid_logloss'] < best_valid_logloss:

            # save the details of the best model
            best_model = str(g)
            best_valid_logloss = cv_results['valid_logloss']

            # save the best model params
            best_model_results_dict[MODEL__ED_ADMISSIONS__NAME]['best_params'] = str(g)

            # save the model metrics on training and validation set
            best_model_results_dict[MODEL__ED_ADMISSIONS__NAME]['train_valid_set_results'] = results_dict

            # score the model's performance on the test set  
            y_test_pred_proba = pipeline.predict_proba(X_test)[:, 1]
            test_auc = roc_auc_score(y_test, y_test_pred_proba)
            test_logloss = log_loss(y_test,y_test_pred_proba)
        
            best_model_results_dict[MODEL__ED_ADMISSIONS__NAME]['test_set_results'] = {
                'test_auc' : test_auc,
                'test_logloss' : test_logloss
            }

            # save the best features
            # To access transformed feature names:
            transformed_cols = pipeline.named_steps['feature_transformer'].get_feature_names_out()
            transformed_cols = [col.split('__')[-1] for col in transformed_cols]
            best_model_results_dict[MODEL__ED_ADMISSIONS__NAME]['best_model_features'] = {
                    'feature_names': transformed_cols,
                    'feature_importances': pipeline.named_steps['classifier'].feature_importances_.tolist()
                }

            # save the best model
            dump(pipeline, full_path)

# save the results dictionary      
filename_results_dict = 'best_model_results_dict.json'
full_path_results_dict = model_file_path / filename_results_dict

with open(full_path_results_dict, 'w') as f:
    json.dump(best_model_results_dict, f)  


Processing :(6, 0)

Processing :(9, 30)

Processing :(12, 0)

Processing :(15, 30)

Processing :(22, 0)


In [144]:
for key, value in best_model_results_dict.items():
    print(f"Model: {key}; AUC: {round(value['test_set_results']['test_auc'],3)}; log loss {round(value['test_set_results']['test_logloss'],3)}")

Model: ed_admission_0600; AUC: 0.862; log loss 0.364
Model: ed_admission_0930; AUC: 0.896; log loss 0.261
Model: ed_admission_1200; AUC: 0.88; log loss 0.25
Model: ed_admission_1530; AUC: 0.869; log loss 0.276
Model: ed_admission_2200; AUC: 0.882; log loss 0.313
