# Modelling probability of admission to specialty, if admitted

This notebook demonstrates the second stage of prediction, to generate a probability of admission to a specialty for each patient in the ED if they are admitted. 

Here we are using consult sequences as input, from a dataset that contains only patients who were later admitted. 


## Set up the notebook environment

In [None]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

In [None]:
from pathlib import Path
import sys
import json
import pickle


PROJECT_ROOT = Path().home() / 'HyMind'

# Patient flow package
USER_ROOT = Path().home()
sys.path.append(str(USER_ROOT / 'patientflow' / 'patientflow' ))

# Functions that sit outside the package
sys.path.append(str(USER_ROOT / 'patientflow' / 'functions' ))




In [None]:
model_file_path = PROJECT_ROOT / 'dissemination' / 'model-output' / 'trained-models'
model_file_path

prob_dist_file_path = PROJECT_ROOT / 'dissemination' / 'model-output' / 'probability-distributions'
prob_dist_file_path.mkdir(parents=True, exist_ok=True)

data_file_path = PROJECT_ROOT / 'dissemination' / 'data-raw'
data_file_path

## Load parameters

These are set in config.json. You can change these for your own purposes. But the times of day will need to match those in the provided dataset if you want to run this notebook successfully.

In [None]:
# Load the times of day
import yaml

config_path = Path(PROJECT_ROOT / 'dissemination')

with open(config_path / 'config.yaml', 'r') as file:
    config = yaml.safe_load(file)
    
# Convert list of times of day at which predictions will be made (currently stored as lists) to list of tuples
prediction_times = [tuple(item) for item in config['prediction_times']]

# See the times of day at which predictions will be made
prediction_times

## Load data

In [None]:
from ed_admissions_data_retrieval import ed_admissions_get_data
path_admission_data = 'HyMind/dissemination/data-raw/ED_visits.csv'
path_spec_data = 'HyMind/dissemination/data-raw/specialty.csv'

df = ed_admissions_get_data(path_admission_data)
df_spec = ed_admissions_get_data(path_spec_data)

Note that many visits had no consult request at the time they were sampled, as we can see just from the first 10 rows.

In [None]:
df_spec.head(10)


Also, there are more admitted patient records in the main dataset than in the specialty dataset. 

In [None]:
print("Number of episode slices in main dataset involving visits by patients that were later admitted")
print(len(df[(df.is_admitted) & (df.age_group != '0-17')]))

print("\nNumber of unique visits in main dataset involving patients that were later admitted")
print(len(df.loc[(df.is_admitted) & (df.age_group != '0-17'), 'visit_number'].unique()))

print("\nNumber of records in specialty dataset")
print(len(df_spec))

print("\nNumber of unique visits in specialty dataset")
print(len(df_spec['visit_number'].unique()))

From the summary below, there are some patients with many records in the specialty dataset.

In [None]:
df_spec['visit_number'].value_counts().to_frame()

To handle this, we'll load the training set from the admission models, which used only one episode slice per visit, and only training using the same episode slices

In [None]:
from ed_admissions_utils import select_one_episode_slice_per_visit

df = ed_admissions_get_data(path_admission_data)
df_spec = ed_admissions_get_data(path_spec_data)

df_single = select_one_episode_slice_per_visit(df)

print(df.shape)
print(df_single.shape)

Now select only those df_spec rows which are in df_single

In [None]:
print(df_spec.shape)
df_spec_single = df_spec[df_spec.episode_slice_id.isin(df_single.episode_slice_id)].copy()
print(df_spec_single.shape)

## Set an index column in df

Setting the index as the episode_slice_id before subsetting means that we retain the same values of episode_slice_id throughout the entire process, ensuring that they are consistent across the original dataset df and the training, validation and test subsets of df

In [None]:
if df.index.name != 'episode_slice_id':
    df = df.set_index('episode_slice_id')


After executing the code below, the episode_slice_id has been set as the index column.

In [None]:
if df_spec.index.name != 'episode_slice_id':
    df_spec = df_spec.set_index('episode_slice_id')
df_spec.head()

## Separate into training, validation and test sets

As part of preparing the data, each visit has already been allocated into one of three sets - training, vaidation and test sets. This has been done chronologically, as shown by the output below. Using a chronological approach is appropriate for tasks where the model needs to be validated on unseen, future data.


In [None]:
train_df_spec = df_spec[df_spec.training_validation_test == 'train'].drop(columns='training_validation_test')
valid_df_spec = df_spec[df_spec.training_validation_test == 'valid'].drop(columns='training_validation_test')
test_df_spec = df_spec[df_spec.training_validation_test == 'test'].drop(columns='training_validation_test')


## Train a rooted directed tree

In [None]:
from predict.emergency_demand.specialty_of_admission import SequencePredictor

In [None]:
model = SequencePredictor(input_var = 'consultation_sequence',
                       grouping_var = 'final_sequence',
                       outcome_var = 'observed_specialty')
model.fit(train_df_spec)
model.predict(tuple(['surgical'])) 


The probabilities for each consult sequence ending in a given observed specialty have been saved in the model. These can be accessed as follows: 

In [None]:
weights = model.weights
weights[tuple(['surgical'])]

In [None]:
model.input_to_grouping_probs

In [None]:
from joblib import dump, load

MODEL__ED_SPECIALTY__NAME = 'ed_specialty'

# use this name in the path for saving the model
full_path = model_file_path / MODEL__ED_SPECIALTY__NAME 
full_path = full_path.with_suffix('.joblib')

# save the model
dump(model, full_path)

## Do inference on the test set

In [None]:
from ed_admissions_helper_functions import prepare_for_inference
model = prepare_for_inference(model_file_path, 'ed_specialty', model_only = True)

In [None]:
model.predict(None)

In [None]:
test_df_spec['predicted_specialty'] = test_df_spec['consultation_sequence'].apply(lambda x: model.predict(x)).apply(lambda x: max(x, key=x.get))

The plot below shows that this approach commonly predicts medical admissions for patients who end up as haem_onc or surgical. This is not surprising, as there are so many visits which are sampled before any consults are requested, and these are all assumed to be medical admissions because this is the dominant class. 

In [None]:
from matplotlib import pyplot as plt
def plot_confusion_matrix(df):
    
    fig, ax = plt.subplots()
    # Confusion Matrix
    y_pred = df['predicted_specialty']
    y_test = df['observed_specialty']
    cm = confusion_matrix(y_test, y_pred)
    classes = sorted(df['observed_specialty'].unique())
    # Display the confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
    disp.plot(cmap='Blues', ax=ax)
    ax.set_title(f'Confusion Matrix for Specialty Prediction')

    plt.tight_layout()
    plt.show()
    
plot_confusion_matrix(test_df_spec)

## Generate a probability distribution for the number of admissions to each specialty

Here we follow the same approach as for prediction overall admission numbers by time of day (see notebook). However, now we want to make different predictions for each specialty. To do this will iterate through each speciality, retrieving each patient's probability of being admitted to that specialty, if admitted.  


In [None]:

## New function - not yet run
## STILL TO DO - test this against all other notebooks that use this approach
## STILL TO DO - need to include dates and prediction_times without patients in ED at that time

from ed_admissions_helper_functions import prepare_for_inference, get_model_name
from ed_admissions_helper_functions import get_specialty_probs, prepare_episode_slices_dict
from predict.emergency_demand.from_individual_probs import get_prob_dist



child_age_group = '0-17'
child_dict = {
    'medical': 0.0,
    'surgical': 0.0,
    'haem_onc': 0.0,
    'paediatric': 1.0
}

# Function to determine if the patient is a child
# This can be customized to any complex logic necessary
is_child_func = lambda row: row['age_group'] == '0-17' # or row['age'] <= 17



prob_dist_dict_all = {}

for prediction_time_ in prediction_times:

    print("\nProcessing :" + str(prediction_time_))
    
    # get model name for this time of day
    MODEL__ED_ADMISSIONS__NAME = get_model_name('ed_admission', prediction_time_)
    
    # initialise a dictionary to save specialty predictions
    prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME] = {}

    # prepare data 
    X_test, y_test, model = prepare_for_inference(model_file_path, 'ed_admission', prediction_time = prediction_time_, data_path = path_admission_data, single_episode_slice_per_visit = False)
    
    # get data on probability of admission to each specialty
    X_test_spec = pd.merge(X_test[['age_group']], df_spec[['consultation_sequence', 'observed_specialty']], left_index=True, right_index=True, how='left')
    
    # this function will return a dictionary of probabilities for each 
    X_test_spec['specialty_prob'] = get_specialty_probs(model_file_path, X_test_spec, special_category_func=is_child_func, special_category_dict=child_dict)
    
    for spec_ in ['medical', 'surgical', 'haem_onc', 'paediatric']:
        print("\nProcessing probability of admission to:" + spec_ )
        
        # get the probability of admission to this specialty for all patients
        weights = X_test_spec['specialty_prob'].apply(lambda x: x.get(spec_))
        
        # select only the episode slices that pertain to children or adults, as appropriate
        if spec_ == 'paediatric':
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group == '0-17')])
        else:
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group != '0-17')])
            
        # get probability distribution for this time of day
        prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME][spec_] = get_prob_dist(
            episode_slices_dict, X_test, y_test, model, weights
        )
        
    # use model name in the path for saving the prob dist
    full_path = prob_dist_file_path / str(MODEL__ED_ADMISSIONS__NAME + '_with_spec') 
    full_path = full_path.with_suffix('.pickle')
        
    with open(full_path, 'wb') as f:  # Note the 'wb' mode for binary writing
        pickle.dump(prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME], f)
    
            

In [None]:
prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME]['medical']['2023-01-26']['pred_demand'].head(10)

In [None]:
prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME]['surgical']['2023-01-26']['pred_demand'].head(10)

## Plot one horizon date as an example

In [None]:
from viz.prob_dist_plot import prob_dist_plot

for spec_ in ['medical', 'surgical', 'haem_onc', 'paediatric']:

    title_ = f'Probability distribution for beds needed in {spec_} specialties\n for patients in ED at {horizon_dts[0]} {MODEL__ED_ADMISSIONS__NAME[-4:]}'
    prob_dist_plot(prob_dist_data=prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME][spec_]['2023-01-26']['pred_demand'], title_=title_,  include_titles=True)

## Adding in a time window

In [None]:
## NOT RUN

from ed_admissions_helper_functions import prepare_for_inference, get_model_name
from ed_admissions_helper_functions import get_specialty_probs, prepare_episode_slices_dict
from predict.emergency_demand.from_individual_probs import get_prob_dist
from predict.emergency_demand.admission_in_time_window_using_aspirational_curve import calculate_probability



child_age_group = '0-17'
child_dict = {
    'medical': 0.0,
    'surgical': 0.0,
    'haem_onc': 0.0,
    'paediatric': 1.0
}

# Function to determine if the patient is a child
# This can be customized to any complex logic necessary
is_child_func = lambda row: row['age_group'] == '0-17' # or row['age'] <= 17



prob_dist_dict_all = {}

for prediction_time_ in prediction_times:

    print("\nProcessing :" + str(prediction_time_))
    
    # get model name for this time of day
    MODEL__ED_ADMISSIONS__NAME = get_model_name('ed_admission', prediction_time_)
    
    # initialise a dictionary to save specialty predictions
    prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME] = {}

    # prepare data 
    X_test, y_test, model = prepare_for_inference(model_file_path, 'ed_admission', prediction_time = prediction_time_, data_path = path_admission_data, single_episode_slice_per_visit = False)
    
    # get data on probability of admission to each specialty
    X_test_spec = pd.merge(X_test[['age_group']], df_spec[['consultation_sequence', 'observed_specialty']], left_index=True, right_index=True, how='left')
    
    # this function will return a dictionary of probabilities for each 
    X_test_spec['specialty_prob'] = get_specialty_probs(model_file_path, X_test_spec, special_category_func=is_child_func, special_category_dict=child_dict)
    
    # get probability of admission in time window
    X_test_admission_in_window_prob = X_test[['elapsed_los_td']].copy()
    time_window_hrs = config['time_window']/60
    X_test_admission_in_window_prob['elapsed_los_td_hrs'] = X_test_admission_in_window_prob['elapsed_los_td']/3600
    time_window_weights = X_test_admission_in_window_prob.apply(lambda row: calculate_probability(row['elapsed_los_td_hrs'], time_window_hrs, x1 = 4, y1 = 0.76, x2 = 12, y2 = .99), axis=1)
    
    for spec_ in ['medical', 'surgical', 'haem_onc', 'paediatric']:
        print("\nProcessing probability of admission to:" + spec_ )
        
        # get the probability of admission to this specialty for all patients
        spec_weights = X_test_spec['specialty_prob'].apply(lambda x: x.get(spec_))
        
        # multiply the weights
        weights = time_window_weights*spec_weights
        
        # select only the episode slices that pertain to children or adults, as appropriate
        if spec_ == 'paediatric':
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group == '0-17')])
        else:
            episode_slices_dict = prepare_episode_slices_dict(df[(df.training_validation_test == 'test') & (df.prediction_time == prediction_time_) & (df.age_group != '0-17')])
            
        # get probability distribution for this time of day
        prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME][spec_] = get_prob_dist(
            episode_slices_dict, X_test, y_test, model, weights
        )
        
    # use model name in the path for saving the prob dist
    full_path = prob_dist_file_path / str(MODEL__ED_ADMISSIONS__NAME + '_in_time_window_with_spec') 
    full_path = full_path.with_suffix('.pickle')
        
    with open(full_path, 'wb') as f:  # Note the 'wb' mode for binary writing
        pickle.dump(prob_dist_dict_all[MODEL__ED_ADMISSIONS__NAME], f)
    
            