# Modelling probability of admission to specialty, if admitted

This notebook demonstrates the second stage of prediction, to generate a probability of admission to a specialty for each patient in the ED if they are admitted. 

Here consult sequences provide the input to prediction, and the model is trained only on visits by adult patients that ended in admission. Patients less than 18 at the time of arrival to the ED are assumed to be admitted to paediatric wards. This assumption could be relaxed by changing the training data to include children, and changing how the inference stage is done. 

This approach assumes that, if admitted, a patient's probability of admission to any particular specialty is independent of their probability of admission to hospital. 

## Set up the notebook environment

In [1]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

In [2]:
from pathlib import Path
import sys
import json
import pandas as pd

root = Path().resolve().parent

sys.path.append(str(root/ 'src'))




## Load parameters and set file paths

For more information about parameters and file paths, see notebook [4a_Predict_probability_of_admission_from_ED.ipynb](4a_Predict_probability_of_admission_from_ED.ipynb)

In [3]:
# indicate whether the notebook is being run locally for UCLH or with public datasets
uclh = False
from patientflow.load import set_file_paths
from patientflow.load import load_config_file

# set file location
data_folder_name = 'data-uclh' if uclh else 'data-public'
data_file_path, media_file_path, model_file_path, config_path = set_file_paths(
        train_dttm = None, data_folder_name = data_folder_name, uclh = uclh, from_notebook=True, inference_time = False)



# load params
params = load_config_file(config_path)

prediction_times = params["prediction_times"]
start_training_set, start_validation_set, start_test_set, end_test_set = params["start_training_set"], params["start_validation_set"], params["start_test_set"], params["end_test_set"]
# x1, y1, x2, y2 = params["x1"], params["y1"], params["x2"], params["y2"]
# prediction_window = params["prediction_window"]
# epsilon = float(params["epsilon"])
# yta_time_interval = params["yta_time_interval"]

print(f'\nTraining set starts {start_training_set} and ends on {start_validation_set - pd.Timedelta(days=1)} inclusive')
print(f'Validation set starts on {start_validation_set} and ends on {start_test_set - pd.Timedelta(days=1)} inclusive' )
print(f'Test set starts on {start_test_set} and ends on {end_test_set- pd.Timedelta(days=1)} inclusive' )

Configuration will be loaded from: /home/jovyan/work/patientflow/config.yaml
Data files will be loaded from: /home/jovyan/work/patientflow/data-public
Trained models will be saved to: /home/jovyan/work/patientflow/trained-models
Images will be saved to: /home/jovyan/work/patientflow/notebooks/img

Training set starts 2031-03-01 and ends on 2031-08-31 inclusive
Validation set starts on 2031-09-01 and ends on 2031-09-30 inclusive
Test set starts on 2031-10-01 and ends on 2031-12-31 inclusive


## Load data

In [4]:
import pandas as pd
from patientflow.load import set_data_file_names
from patientflow.load import data_from_csv

if uclh:
    visits_path, visits_csv_path, yta_path, yta_csv_path = set_data_file_names(uclh, data_file_path, config_path)
else:
    visits_csv_path, yta_csv_path = set_data_file_names(uclh, data_file_path)

visits = data_from_csv(visits_csv_path, index_column = 'snapshot_id',
                            sort_columns = ["visit_number", "snapshot_date", "prediction_time"], 
                            eval_columns = ["prediction_time", "consultation_sequence", "final_sequence"])

visits['snapshot_date'] = pd.to_datetime(visits['snapshot_date']).dt.date

## Train the model

This is the function that trains the specialty model, loaded from a file. Below we will break it down step-by-step.

In [5]:
from patientflow.train import train_specialty_model, get_default_visits
??train_specialty_model

[0;31mSignature:[0m
[0mtrain_specialty_model[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mvisits[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel_name[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel_metadata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel_file_path[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0muclh[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mtrain_specialty_model[0m[0;34m([0m[0mvisits[0m[0;34m,[0m [0mmodel_name[0m[0;34m,[0m [0mmodel_metadata[0m[0;34m,[0m [0mmodel_file_path[0m[0;34m,[0m [0muclh[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;31m# Select one snapshot per visit[0m[0;34m[0m
[0;34m[0m    [0mvisits_single[0m [0;34m=[0m [0mselect_one_snapshot_per_visit[0m[0;34m([0m[0mvisits[0m[0;34m,[0m [0mvisit_col[0m[0;34m=[0m[0;34m"visit_number"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0

The first step in the function above is to handle the fact that there are multiple snapshots per visit and we only want one for each visit in the training set. 

In [6]:
from patientflow.prepare import select_one_snapshot_per_visit

visits_single = select_one_snapshot_per_visit(visits, visit_col = 'visit_number')

print(visits.shape)
print(visits_single.shape)

(79802, 69)
(64456, 68)


To train the specialty model, we only use a subset of the columns. Here we can see the relevant columns for UCLH or public data

In [7]:
if uclh:
    display(visits_single[['consultation_sequence', 'final_sequence', 'specialty', 'is_admitted', 'age_on_arrival']].head(10))
else:
    display(visits_single[['consultation_sequence', 'final_sequence', 'specialty', 'is_admitted', 'age_group']].head(10))


Unnamed: 0_level_0,consultation_sequence,final_sequence,specialty,is_admitted,age_group
snapshot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,[],[],medical,False,55-64
2,[],[],surgical,False,75-102
3,[],[],medical,False,35-44
5,['haem_onc'],['haem_onc'],haem/onc,False,65-74
7,['surgical'],['surgical'],surgical,False,25-34
10,[],['haem_onc'],medical,False,65-74
11,['haem_onc'],['haem_onc'],medical,False,75-102
12,['haem_onc'],['haem_onc'],haem/onc,False,75-102
13,[],[],haem/onc,False,75-102
15,['ambulatory'],['ambulatory'],,False,0-17


We filter down to only include admitted patients, and remove any with a null value for the specialty column, since this is the model aims to predict. 

In [8]:
admitted = visits_single[
    (visits_single.is_admitted) & ~(visits_single.specialty.isnull())
]

A function called `get_default_visits` handles the next step. This uses a function defined in `prepare.py`, called `create_special_category_objects` to identify any visit characteristics that will be handled differently when predicting specialty. At UCLH, we don't use a model for patients under 18; instead at UCLH we assume that all under 18s will be admitted to a paediatric specialty. Their visits are therefore not relevant for model training, and we remove them here. 

In [9]:
filtered_admitted = get_default_visits(admitted, uclh=uclh)
print(visits_single.shape)
print(filtered_admitted.shape)

(64456, 68)
(8022, 68)


The consultation sequence (which is captured at the snapshot) and the final sequence have been loaded from CSV, and need to be converted to tuples. 

In [10]:
# convert consults data format from list to tuple (required input for SequencePredictor)
filtered_admitted.loc[:, "consultation_sequence"] = filtered_admitted[
    "consultation_sequence"
].apply(lambda x: tuple(x) if x else ())
filtered_admitted.loc[:, "final_sequence"] = filtered_admitted[
    "final_sequence"
].apply(lambda x: tuple(x) if x else ())

Note that some visits that ended in admission had no consult request at the time they were sampled, as we can see below, where visits have an empty tuple

In [11]:
if uclh:
    display(filtered_admitted[['consultation_sequence', 'final_sequence', 'specialty', 'is_admitted', 'age_on_arrival']].head(10))
else:
    display(filtered_admitted[['consultation_sequence', 'final_sequence', 'specialty', 'is_admitted', 'age_group']].head(10))
    


Unnamed: 0_level_0,consultation_sequence,final_sequence,specialty,is_admitted,age_group
snapshot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,"('surgical',)","('surgical', 'surgical')",surgical,True,45-54
58,"('surgical',)","('surgical',)",surgical,True,35-44
75,(),"('acute',)",medical,True,65-74
115,"('surgical',)","('surgical',)",surgical,True,35-44
121,"('surgical',)","('surgical',)",surgical,True,25-34
123,(),"('surgical',)",surgical,True,75-102
137,(),"('surgical',)",surgical,True,65-74
164,"('acute',)","('acute',)",medical,True,65-74
177,(),"('surgical',)",medical,True,75-102
250,(),"('acute',)",medical,True,18-24


The UCLH data (not shared publicly) includes more detailed data on consult type, as shown in the `code` column in the dataset below. The public data has been simplified to a higher level (identified in the mapping below as `type`). 

In [12]:
model_input_path = Path(root / 'src' /  'patientflow'/ 'model-input')
name_mapping = pd.read_csv(str(model_input_path) + '/consults-mapping.csv')
name_mapping

Unnamed: 0,id,code,name,type
0,1,CON124,Inpatient consult to Neuro Ophthalmology,neuro
1,2,CON9,Inpatient consult to Neurology,neuro
2,3,CON34,Inpatient consult to Dietetics (N&D) - Not TPN,allied
3,4,CON134,Inpatient consult to PERRT,icu
4,5,CON163,IP Consult to MCC Complementary Therapy Team,pain
...,...,...,...,...
111,112,CON77,Inpatient consult to Paediatric Allergy,paeds
112,113,CON168,Inpatient consult to Acute Oncology Service,haem_onc
113,114,CON84,Inpatient consult to Paediatric Hematology - C...,haem_onc
114,115,CON122,Inpatient consult to Paediatric Epilepsy Service,paeds


For example, the code for a consult with Acute Medicine is convered to a more general category in the public dataset

In [13]:
name_mapping[name_mapping.code == 'CON157']

Unnamed: 0,id,code,name,type
14,15,CON157,Inpatient consult to Acute Medicine,acute


The medical group includes many of the more specific types

In [14]:
name_mapping[name_mapping.type == 'medical']

Unnamed: 0,id,code,name,type
7,8,CON165,Inpatient consult to Nutrition Team (TPN),medical
10,11,CON54,Inpatient consult to Respiratory Medicine,medical
12,13,CON43,Inpatient consult to Cardiology,medical
15,16,CON5,Inpatient consult to Infectious Diseases,medical
17,18,CON132,Inpatient consult to Adult Diabetes CNS,medical
33,34,CON68,Inpatient consult to Gastroenterology,medical
37,38,CON60,Inpatient consult to Endocrinology,medical
48,49,CON156,Inpatient consult to Adult Endocrine & Diabetes,medical
62,63,CON44,Inpatient consult to Rheumatology,medical
66,67,CON147,Inpatient consult to Cardiac Rehabilitation,medical


## Separate into training, validation and test sets

As part of preparing the data, each visit has already been allocated into one of three sets - training, vaidation and test sets. 


In [15]:
train_visits = filtered_admitted.loc[filtered_admitted.index.isin(filtered_admitted[filtered_admitted.training_validation_test == 'train'].index)]
valid_visits = filtered_admitted.loc[filtered_admitted.index.isin(filtered_admitted[filtered_admitted.training_validation_test == 'valid'].index)]
test_visits = filtered_admitted.loc[filtered_admitted.index.isin(filtered_admitted[filtered_admitted.training_validation_test == 'test'].index)]


assert train_visits.snapshot_date.min() == start_training_set
assert train_visits.snapshot_date.max() < start_validation_set
assert valid_visits.snapshot_date.min() == start_validation_set
assert valid_visits.snapshot_date.max() < start_test_set
assert test_visits.snapshot_date.min() == start_test_set
assert test_visits.snapshot_date.max() < end_test_set

## Train the model

Here, we load the SequencePredictor(), a function that takes a sequence as input (in this case consultation_sequence), a grouping variable (in this case final_sequence) and a outcome variable (in this case specialty), and uses a grouping variable to create a rooted directed tree. Each new consult in the sequence is a branching node of the tree. The grouping variable, final sequence, serves as the terminal nodes of the tree. The function maps the probability of each part-complete sequence of consults ending (via each final_sequence) in each specialty of admission.

In [17]:
from patientflow.predictors.sequence_predictor import SequencePredictor

In [18]:
spec_model = SequencePredictor(
    input_var="consultation_sequence",
    grouping_var="final_sequence",
    outcome_var="specialty",
)
spec_model.fit(train_visits)



Passing an empty tuple to the trained model shows the probability of ending in each specialty, if a visit has had no consults yet. 

In [19]:
print("For a visit which has no consult at the time of a snapsnot, the probabilities of ending up under a medical, surgical or haem/onc specialty are shown below")
print({k: round(v, 3) for k, v in spec_model.predict(tuple()) .items()})

    


For a visit which has no consult at the time of a snapsnot, the probabilities of ending up under a medical, surgical or haem/onc specialty are shown below
{'surgical': 0.27, 'medical': 0.631, 'haem/onc': 0.099}


The probabilities for each consult sequence ending in a given observed specialty have been saved in the model. These can be accessed as follows: 

In [20]:
weights = spec_model.weights
print("For a visit which has one consult to acute medicine at the time of a snapsnot, the probabilities of ending up under a medical, surgical or haem/onc specialty are shown below")
if uclh:
    print({k: round(v, 3) for k, v in weights[tuple(['CON157'])].items()})
else:
    print({k: round(v, 3) for k, v in weights[tuple(['acute'])].items()})


For a visit which has one consult to acute medicine at the time of a snapsnot, the probabilities of ending up under a medical, surgical or haem/onc specialty are shown below
{'surgical': 0.014, 'medical': 0.946, 'haem/onc': 0.04}


The intermediate mapping of consultation_sequence to final_sequence can be accessed from the trained model like this. The first row shows the probability of a null sequence (ie no consults yet) ending in any of the final_sequence options. 

In [21]:
spec_model.input_to_grouping_probs

final_sequence,(),"('acute',)","('acute', 'acute')","('acute', 'acute', 'discharge')","('acute', 'acute', 'icu')","('acute', 'acute', 'medical')","('acute', 'acute', 'medical', 'surgical')","('acute', 'acute', 'mental_health')","('acute', 'acute', 'palliative')","('acute', 'acute', 'surgical')",...,"('surgical', 'surgical')","('surgical', 'surgical', 'acute')","('surgical', 'surgical', 'acute', 'mental_health', 'discharge', 'discharge')","('surgical', 'surgical', 'acute', 'surgical')","('surgical', 'surgical', 'icu')","('surgical', 'surgical', 'medical')","('surgical', 'surgical', 'obs_gyn')","('surgical', 'surgical', 'other')","('surgical', 'surgical', 'surgical')",probability_of_grouping_sequence
consultation_sequence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(),0.010216,0.457056,0.013621,0.000000,0.000000,0.000757,0.000378,0.000378,0.000000,0.000378,...,0.008324,0.000757,0.000000,0.000378,0.000378,0.000000,0.000000,0.000000,0.000000,0.533616
"('acute',)",0.000000,0.829932,0.005831,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.207753
"('acute', 'acute')",0.000000,0.000000,0.851852,0.037037,0.037037,0.037037,0.000000,0.000000,0.037037,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.005451
"('acute', 'allied')",0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000202
"('acute', 'ambulatory')",0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000404
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"('surgical', 'icu')",0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000202
"('surgical', 'medical')",0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000404
"('surgical', 'obs_gyn')",0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000606
"('surgical', 'surgical')",0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.807692,0.000000,0.038462,0.000000,0.000000,0.038462,0.038462,0.038462,0.038462,0.005249


In [22]:
from joblib import dump, load

MODEL__ED_SPECIALTY__NAME = 'ed_specialty'

# use this name in the path for saving the model
full_path = model_file_path / MODEL__ED_SPECIALTY__NAME 
full_path = full_path.with_suffix('.joblib')

# save the model
dump(spec_model, full_path)

['/home/jovyan/work/patientflow/trained-models/ed_specialty.joblib']