# Modelling probability of admission to specialty, if admitted

This notebook demonstrates the second stage of prediction, to generate a probability of admission to a specialty for each patient in the ED if they are admitted. 

Here we are using consult sequences as input, from a dataset that contains only patients who were later admitted. 


## Set up the notebook environment

In [1]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

In [12]:
from pathlib import Path
import sys
import json
import pickle


PROJECT_ROOT = Path().home() 

# Patient flow package
USER_ROOT = Path().home() / 'work'
sys.path.append(str(USER_ROOT / 'patientflow' / 'src' / 'patientflow'))
sys.path.append(str(USER_ROOT / 'patientflow' / 'functions'))





In [13]:
model_file_path = PROJECT_ROOT /'data' / 'ed-predictor' / 'trained-models'
model_file_path

data_file_path = USER_ROOT / 'ed-predictor' / 'data-raw'
data_file_path

# prob_dist_file_path = PROJECT_ROOT / 'dissemination' / 'model-output' / 'probability-distributions'
# prob_dist_file_path.mkdir(parents=True, exist_ok=True)


PosixPath('/home/jovyan/work/ed-predictor/data-raw')

In [16]:
data_file_path

PosixPath('/home/jovyan/work/ed-predictor/data-raw')

## Load parameters

These are set in config.json. You can change these for your own purposes. But the times of day will need to match those in the provided dataset if you want to run this notebook successfully.

In [6]:
# Load the times of day
import yaml

config_path = Path(USER_ROOT / 'patientflow')

with open(config_path / 'config.yaml', 'r') as file:
    config = yaml.safe_load(file)
    
# Convert list of times of day at which predictions will be made (currently stored as lists) to list of tuples
prediction_times = [tuple(item) for item in config['prediction_times']]


[(6, 0), (9, 30), (12, 0), (15, 30), (22, 0)]

## Load data

In [9]:
str(data_file_path) + '/ED_visits.csv'

'/home/jovyan/ed-predictor/data-raw/ED_visits.csv'

In [40]:
from ed_admissions_data_retrieval import ed_admissions_get_data
path_admission_data = str(data_file_path) + '/ed_visits.csv'
path_spec_data = str(data_file_path) + '/specialty.csv'

df = ed_admissions_get_data(path_admission_data)
df_spec = ed_admissions_get_data(path_spec_data)

In [48]:
if df.index.name != 'snapshot_id':
    df = df.set_index('snapshot_id')
df.head()

if df_spec.index.name != 'snapshot_id':
    df_spec = df_spec.set_index('snapshot_id')
df_spec.head()

Unnamed: 0_level_0,visit_number,consultation_sequence,final_sequence,specialty
snapshot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7,6,[CON47],[CON47],Oncology
8,7,[CON47],[CON47],Oncology
11,9,[CON47],[CON47],Oncology
12,12,[CON57],[CON57],Haematology
17,323,[CON15],[CON15],General Paediatrics


Note that some visits that ended in admission had no consult request at the time they were sampled, as we can see below

In [42]:
df_spec[df_spec.consultation_sequence.apply(lambda x: x == [])]


Unnamed: 0,snapshot_id,visit_number,consultation_sequence,final_sequence,specialty
8,21,325,[],[CON4],Gastrointestinal Surgery
15,44,345,[],[CON157],Acute Medicine
20,49,346,[],[CON4],Gastrointestinal Surgery
19,48,346,[],[CON4],Gastrointestinal Surgery
22,69,362,[],[CON157],Acute Medicine
...,...,...,...,...,...
42035,271636,215422,[],[CON157],
42039,271644,215433,[],"[CON157, CON250, CON158]",
42043,271648,215434,[],[CON176],
42042,271647,215434,[],[CON176],


From the summary below, there are some patients with many records in the specialty dataset.

In [37]:
df_spec['visit_number'].value_counts().to_frame()

Unnamed: 0_level_0,count
visit_number,Unnamed: 1_level_1
63099,15
168295,15
63149,15
56563,14
193236,14
...,...
98077,1
98078,1
98079,1
98080,1


To handle this, we'll load the training set from the admission models, which used only one episode slice per visit, and only training using the same episode slices

In [49]:
from ed_admissions_utils import select_one_snapshot_per_visit

df_single = select_one_snapshot_per_visit(df)

print(df.shape)
print(df_single.shape)

(271686, 68)
(185339, 67)


Now select only those df_spec rows which are in df_single

In [55]:
print(df_spec.shape)
df_spec_single = df_spec.loc[df_spec.index.isin(df_single.index)].copy()
print(df_spec_single.shape)

(42051, 4)
(17878, 4)


As the sequence predictor requires tuples as input, we will convert the lists to tuples

In [60]:
df_spec_single['consultation_sequence'] = df_spec_single['consultation_sequence'].apply(lambda x: tuple(x))
df_spec_single['final_sequence'] = df_spec_single['final_sequence'].apply(lambda x: tuple(x))


Drop any rows with no specialty data

In [78]:
print(len(df_spec_single[df_spec_single.specialty.isnull()]))
df_spec_single = df_spec_single[~df_spec_single.specialty.isnull()]

196


## Separate into training, validation and test sets

As part of preparing the data, each visit has already been allocated into one of three sets - training, vaidation and test sets. This has been done chronologically, as shown by the output below. Using a chronological approach is appropriate for tasks where the model needs to be validated on unseen, future data.


In [79]:
train_df_spec = df_spec_single.loc[df_spec_single.index.isin(df[df.training_validation_test == 'train'].index)]
valid_df_spec = df_spec_single.loc[df_spec_single.index.isin(df[df.training_validation_test == 'valid'].index)]
test_df_spec = df_spec_single.loc[df_spec_single.index.isin(df[df.training_validation_test == 'test'].index)]


## Train a rooted directed tree

In [58]:
from predict.emergency_demand.specialty_of_admission import SequencePredictor

In [None]:
model = SequencePredictor(input_var = 'consultation_sequence',
                       grouping_var = 'final_sequence',
                       outcome_var = 'specialty')
model.fit(train_df_spec)



In [87]:
model.predict(tuple(['CON157'])) 

{'Oncology': 0.03337282780410743,
 'Haematology': 0.002567140600315956,
 'General Paediatrics': 0.00039494470774091627,
 'Acute Medicine': 0.7571090047393365,
 'Gastrointestinal Surgery': 0.004936808846761454,
 'Urology': 0.0007898894154818326,
 'Gastrointestinal Medicine': 0.05134281200631913,
 'Adult ENT': 0.0003949447077409163,
 'Gynaecology': 0.0003949447077409163,
 'Care Of the Elderly': 0.027053712480252772,
 'Infection': 0.04680094786729859,
 'Head and Neck': 0.00019747235387045816,
 'Trauma & Orthopaedics': 0.0017772511848341231,
 'Respiratory Medicine': 0.03080568720379147,
 'Clinical Pharmacology': 0.018562401263823063,
 'Paediatric Surgery': 0.00019747235387045813,
 'Neurology': 0.00217219589257504,
 'Accident & Emergency': 0.005331753554502369,
 'Rheumatology': 0.003554502369668247,
 'Rehab & Stroke': 0.006911532385466036,
 'Neurosurgery': 0.0027646129541864144,
 'Paediatric ENT': 0.0,
 'Maternity': 0.0003949447077409163,
 'Children & Young Peoples Cancer': 0.00019747235387

The probabilities for each consult sequence ending in a given observed specialty have been saved in the model. These can be accessed as follows: 

In [88]:
weights = model.weights
weights[tuple(['CON157'])]

{'Oncology': 0.03337282780410743,
 'Haematology': 0.002567140600315956,
 'General Paediatrics': 0.00039494470774091627,
 'Acute Medicine': 0.7571090047393365,
 'Gastrointestinal Surgery': 0.004936808846761454,
 'Urology': 0.0007898894154818326,
 'Gastrointestinal Medicine': 0.05134281200631913,
 'Adult ENT': 0.0003949447077409163,
 'Gynaecology': 0.0003949447077409163,
 'Care Of the Elderly': 0.027053712480252772,
 'Infection': 0.04680094786729859,
 'Head and Neck': 0.00019747235387045816,
 'Trauma & Orthopaedics': 0.0017772511848341231,
 'Respiratory Medicine': 0.03080568720379147,
 'Clinical Pharmacology': 0.018562401263823063,
 'Paediatric Surgery': 0.00019747235387045813,
 'Neurology': 0.00217219589257504,
 'Accident & Emergency': 0.005331753554502369,
 'Rheumatology': 0.003554502369668247,
 'Rehab & Stroke': 0.006911532385466036,
 'Neurosurgery': 0.0027646129541864144,
 'Paediatric ENT': 0.0,
 'Maternity': 0.0003949447077409163,
 'Children & Young Peoples Cancer': 0.00019747235387

In [None]:
model.input_to_grouping_probs

In [89]:
from joblib import dump, load

MODEL__ED_SPECIALTY__NAME = 'ed_specialty'

# use this name in the path for saving the model
full_path = model_file_path / MODEL__ED_SPECIALTY__NAME 
full_path = full_path.with_suffix('.joblib')

# save the model
dump(model, full_path)

['/home/jovyan/data/ed-predictor/trained-models/ed_specialty.joblib']

## Do inference on the test set

In [90]:
from ed_admissions_helper_functions import prepare_for_inference
model = prepare_for_inference(model_file_path, 'ed_specialty', model_only = True)

In [91]:
model.predict(None)

{'Oncology': 0.05328484465175114,
 'Haematology': 0.046881176377579256,
 'General Paediatrics': 0.10649063167048779,
 'Acute Medicine': 0.3605028065459721,
 'Gastrointestinal Surgery': 0.09067910506759429,
 'Urology': 0.045695311882362255,
 'Gastrointestinal Medicine': 0.026009961261759827,
 'Adult ENT': 0.043244525258913744,
 'Gynaecology': 0.05320578701873667,
 'Care Of the Elderly': 0.012174875484228005,
 'Infection': 0.039607874140248245,
 'Head and Neck': 0.01517906553877777,
 'Trauma & Orthopaedics': 0.024349750968456006,
 'Respiratory Medicine': 0.01383508577753182,
 'Clinical Pharmacology': 0.009882204126808443,
 'Paediatric Surgery': 0.008221993833504625,
 'Neurology': 0.002608901889477429,
 'Accident & Emergency': 0.009882204126808444,
 'Rheumatology': 0.0026089018894774295,
 'Rehab & Stroke': 0.01454660447466203,
 'Neurosurgery': 0.006482725907186339,
 'Paediatric ENT': 0.0024507866234484946,
 'Maternity': 0.0022136137244050916,
 'Children & Young Peoples Cancer': 0.00086963

In [92]:
test_df_spec['predicted_specialty'] = test_df_spec['consultation_sequence'].apply(lambda x: model.predict(x)).apply(lambda x: max(x, key=x.get))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df_spec['predicted_specialty'] = test_df_spec['consultation_sequence'].apply(lambda x: model.predict(x)).apply(lambda x: max(x, key=x.get))


In [94]:
test_df_spec

Unnamed: 0_level_0,visit_number,consultation_sequence,final_sequence,specialty,predicted_specialty
snapshot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
207972,158657,"(CON15, CON13)","(CON15, CON13)",General Paediatrics,General Paediatrics
207984,158666,(),"(CON13,)",Adult ENT,Acute Medicine
208001,158678,(),"(CON157,)",Acute Medicine,Acute Medicine
208065,158731,"(CON4,)","(CON4,)",Gastrointestinal Surgery,Gastrointestinal Surgery
208084,158750,"(CON4,)","(CON4,)",Gastrointestinal Surgery,Gastrointestinal Surgery
...,...,...,...,...,...
271512,207500,"(CON30404001, CON157)","(CON30404001, CON157, 30413CON7502)",Acute Medicine,Acute Medicine
271520,207511,(),"(CON157,)",Acute Medicine,Acute Medicine
271548,207543,"(CON157,)","(CON157,)",Acute Medicine,Acute Medicine
271558,207554,(),"(CON157, CON251)",Acute Medicine,Acute Medicine


## Generate a probability distribution for the number of admissions to each specialty

Here we follow the same approach as for prediction overall admission numbers by time of day (see notebook). However, now we want to make different predictions for each specialty. To do this will iterate through each speciality, retrieving each patient's probability of being admitted to that specialty, if admitted.  


In [None]:

## New function - not yet run
## STILL TO DO - test this against all other notebooks that use this approach
## STILL TO DO - need to include dates and prediction_times without patients in ED at that time

from ed_admissions_helper_functions import prepare_for_inference, get_model_name
from ed_admissions_helper_functions import get_specialty_probs, prepare_episode_slices_dict
from predict.emergency_demand.from_individual_probs import get_prob_dist



child_age_group = '0-17'
child_dict = {
    'medical': 0.0,
    'surgical': 0.0,
    'haem_onc': 0.0,
    'paediatric': 1.0
}

# Function to determine if the patient is a child
# This can be customized to any complex logic necessary
is_child_func = lambda row: row['age_group'] == '0-17' # or row['age'] <= 17



prob_dist_dict_all = {}

for prediction_time_ in prediction_times:

    print("\nProcessing :" + str(prediction_time_))
    
    # get model name for this time of day
    MODEL__ED_ADMISSIONS__NAME = get_model_name('ed_admission', prediction_time_)
    
    # initialise a dictionary to save specialty predictions
    prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME] = {}

    # prepare data 
    X_test, y_test, model = prepare_for_inference(model_file_path, 'ed_admission', prediction_time = prediction_time_, data_path = path_admission_data, single_episode_slice_per_visit = False)
    
    # get data on probability of admission to each specialty
    X_test_spec = pd.merge(X_test[['age_group']], df_spec[['consultation_sequence', 'observed_specialty']], left_index=True, right_index=True, how='left')
    
    # this function will return a dictionary of probabilities for each 
    X_test_spec['specialty_prob'] = get_specialty_probs(model_file_path, X_test_spec, special_category_func=is_child_func, special_category_dict=child_dict)
    
    for spec_ in ['medical', 'surgical', 'haem_onc', 'paediatric']:
        print("\nProcessing probability of admission to:" + spec_ )
        
        # get the probability of admission to this specialty for all patients
        weights = X_test_spec['specialty_prob'].apply(lambda x: x.get(spec_))
        
        # select only the episode slices that pertain to children or adults, as appropriate
        if spec_ == 'paediatric':
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group == '0-17')])
        else:
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group != '0-17')])
            
        # get probability distribution for this time of day
        prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME][spec_] = get_prob_dist(
            episode_slices_dict, X_test, y_test, model, weights
        )
        
    # use model name in the path for saving the prob dist
    full_path = prob_dist_file_path / str(MODEL__ED_ADMISSIONS__NAME + '_with_spec') 
    full_path = full_path.with_suffix('.pickle')
        
    with open(full_path, 'wb') as f:  # Note the 'wb' mode for binary writing
        pickle.dump(prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME], f)
    
            

In [None]:
prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME]['medical']['2023-01-26']['pred_demand'].head(10)

In [None]:
prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME]['surgical']['2023-01-26']['pred_demand'].head(10)

## Plot one horizon date as an example

In [None]:
from viz.prob_dist_plot import prob_dist_plot

for spec_ in ['medical', 'surgical', 'haem_onc', 'paediatric']:

    title_ = f'Probability distribution for beds needed in {spec_} specialties\n for patients in ED at {horizon_dts[0]} {MODEL__ED_ADMISSIONS__NAME[-4:]}'
    prob_dist_plot(prob_dist_data=prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME][spec_]['2023-01-26']['pred_demand'], title_=title_,  include_titles=True)

## Adding in a time window

In [None]:
## NOT RUN

from ed_admissions_helper_functions import prepare_for_inference, get_model_name
from ed_admissions_helper_functions import get_specialty_probs, prepare_episode_slices_dict
from predict.emergency_demand.from_individual_probs import get_prob_dist
from predict.emergency_demand.admission_in_time_window_using_aspirational_curve import calculate_probability



child_age_group = '0-17'
child_dict = {
    'medical': 0.0,
    'surgical': 0.0,
    'haem_onc': 0.0,
    'paediatric': 1.0
}

# Function to determine if the patient is a child
# This can be customized to any complex logic necessary
is_child_func = lambda row: row['age_group'] == '0-17' # or row['age'] <= 17



prob_dist_dict_all = {}

for prediction_time_ in prediction_times:

    print("\nProcessing :" + str(prediction_time_))
    
    # get model name for this time of day
    MODEL__ED_ADMISSIONS__NAME = get_model_name('ed_admission', prediction_time_)
    
    # initialise a dictionary to save specialty predictions
    prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME] = {}

    # prepare data 
    X_test, y_test, model = prepare_for_inference(model_file_path, 'ed_admission', prediction_time = prediction_time_, data_path = path_admission_data, single_episode_slice_per_visit = False)
    
    # get data on probability of admission to each specialty
    X_test_spec = pd.merge(X_test[['age_group']], df_spec[['consultation_sequence', 'observed_specialty']], left_index=True, right_index=True, how='left')
    
    # this function will return a dictionary of probabilities for each 
    X_test_spec['specialty_prob'] = get_specialty_probs(model_file_path, X_test_spec, special_category_func=is_child_func, special_category_dict=child_dict)
    
    # get probability of admission in time window
    X_test_admission_in_window_prob = X_test[['elapsed_los_td']].copy()
    time_window_hrs = config['time_window']/60
    X_test_admission_in_window_prob['elapsed_los_td_hrs'] = X_test_admission_in_window_prob['elapsed_los_td']/3600
    time_window_weights = X_test_admission_in_window_prob.apply(lambda row: calculate_probability(row['elapsed_los_td_hrs'], time_window_hrs, x1 = 4, y1 = 0.76, x2 = 12, y2 = .99), axis=1)
    
    for spec_ in ['medical', 'surgical', 'haem_onc', 'paediatric']:
        print("\nProcessing probability of admission to:" + spec_ )
        
        # get the probability of admission to this specialty for all patients
        spec_weights = X_test_spec['specialty_prob'].apply(lambda x: x.get(spec_))
        
        # multiply the weights
        weights = time_window_weights*spec_weights
        
        # select only the episode slices that pertain to children or adults, as appropriate
        if spec_ == 'paediatric':
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group == '0-17')])
        else:
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group != '0-17')])
            
        # get probability distribution for this time of day
        prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME][spec_] = get_prob_dist(
            episode_slices_dict, X_test, y_test, model, weights
        )
        
    # use model name in the path for saving the prob dist
    full_path = prob_dist_file_path / str(MODEL__ED_ADMISSIONS__NAME + '_in_time_window_with_spec') 
    full_path = full_path.with_suffix('.pickle')
        
    with open(full_path, 'wb') as f:  # Note the 'wb' mode for binary writing
        pickle.dump(prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME], f)
    
            