## Predictive Models

In this notebook, we will try multiple methods to predict A&E demands

In [None]:
#!pip install dask[dataframe]
#!pip install graphviz
#!pip install --upgrade pip
#!pip install dask-ml 


In [17]:
import pandas as pd
import geopandas as gpd
import dask.dataframe as dd
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC

In [2]:
url1 = 'https://www.opendata.nhs.scot/dataset/0d57311a-db66-4eaa-bd6d-cc622b6cbdfa/resource/a5f7ca94-c810-41b5-a7c9-25c18d43e5a4/download/weekly_ae_activity_20240714.csv'
df_week_AE = pd.read_csv(url1)

url2 = 'https://www.opendata.nhs.scot/dataset/997acaa5-afe0-49d9-b333-dcf84584603d/resource/022c3b27-6a58-48dc-8038-8f1f93bb0e78/download/opendata_monthly_ae_when_202405.csv'
df_month_AE = pd.read_csv(url2)

url3 = 'https://www.opendata.nhs.scot/dataset/997acaa5-afe0-49d9-b333-dcf84584603d/resource/c4622324-f59c-4011-a67b-83b59c59ca94/download/opendata_monthly_ae_discharge_202405.csv'
df_discharge = pd.read_csv(url3)

url4= 'https://www.opendata.nhs.scot/dataset/997acaa5-afe0-49d9-b333-dcf84584603d/resource/37ba17b1-c323-492c-87d5-e986aae9ab59/download/monthly_ae_activity_202405.csv'
df_month_attendances= pd.read_csv(url4)

url_hospital = 'https://www.opendata.nhs.scot/dataset/cbd1802e-0e04-4282-88eb-d7bdcfb120f0/resource/c698f450-eeed-41a0-88f7-c1e40a568acc/download/hospitals.csv'
df_hospital = pd.read_csv(url_hospital)
#df_hospital : A list of all NHS hospitals across Scotland and associated geographic information. It should be noted that this list contains all hospitals in Scotland, not only acute hospitals.

url_demographics= 'https://www.opendata.nhs.scot/dataset/997acaa5-afe0-49d9-b333-dcf84584603d/resource/6abbf8e4-e4e0-4a56-a7b9-f7c7b4171ff3/download/opendata_monthly_ae_demographics_202405.csv'
df_demographics= pd.read_csv(url_demographics)

url_multiple_attendance= 'https://www.opendata.nhs.scot/dataset/997acaa5-afe0-49d9-b333-dcf84584603d/resource/0ca3b959-b758-4532-bb55-aa86da28679e/download/opendata_monthly_ae_multiple_attendances_202405.csv'
df_multi_attendance= pd.read_csv(url_multiple_attendance)
#This data resource contains multiple attendances statistics on new and unplanned return attendances at Accident and Emergency (A&E) services across Scotland for the latest 12 month period.

In [None]:
shapefile_path = "SG_NHS_HealthBoards_2019"

#Reading the shapefile into a GeoDataFrame
gdf = gpd.read_file(shapefile_path)

#Information about locations
location_df= gdf[["HBCode", "HBName"]]

In [None]:
df_week_AE.head()

In [None]:
df_month_AE.head()

In [None]:
df_month_attendances.head()

In [None]:
df_discharge.head()

In [None]:
df_hospital.head()

In [None]:
df_demographics.head()

In [None]:
df_multi_attendance.head()

**Objective:** To build a model that can predict Discharge outcome (Admission to same Hospital/Discharged Home or to usual Place of Residence/Transferred to Other Hospital/Service) based on the Hour and day of arrival. 

The thought behind this objective is that as seen in the exploratory analysis, the demand is not equally distributed across the health boards, therefore it would be useful to predict the outcome of the emergency visit. For example, if a specific board is more likely to provide transfer of the patient, it could imply lack of resources in that board. Similariliy, a board that admits most patients indicates sufficient capacity to deal with emergency patients. Also, boards with most discharges to residence, may be more efficient in handling their resources. 

The fators to be considered include 

#### Data Preparation:

- Merge datasets to include all relevant features for each record.
- Clean the data by handling missing values and encoding categorical variables.
- Create new features if necessary, such as day and hour transformations.

For training the model, we must first create a combined dataset containg all the relevant featurres. 

In order to save memory for merging, we shall first filter the required columns from each dataset. 

In [3]:
#print(df_demographics.columns)
#print(df_discharge.columns)
#print(df_hospital.columns)
#print(df_month_AE.columns)
#print(df_week_AE.columns)

df_demographics= df_demographics[['Month','HBT', 'DepartmentType', 'Age','Sex','Deprivation','NumberOfAttendances']]
df_discharge= df_discharge[['Month','HBT','Discharge', 'NumberOfAttendances']]
df_month_AE= df_month_AE[['Month','HBT','DepartmentType', 'Day','Week', 'Hour', 'InOut', 'NumberOfAttendances']]

# Filter the DataFrame to include only rows with Month >= 202201
df_demographics= df_demographics[df_demographics['Month'] >= 202301]
df_month_AE= df_month_AE[df_month_AE['Month'] >= 202301]
df_discharge= df_discharge[df_discharge['Month'] >= 202301]

In [None]:
df_discharge.head()

In [None]:
df_month_AE.head()

In [4]:
# Merge demographics and discharge data
Demo_and_discharge = pd.merge(df_demographics, df_discharge, on=["Month", "HBT"])
Demo_and_discharge.head()

Unnamed: 0,Month,HBT,DepartmentType,Age,Sex,Deprivation,NumberOfAttendances_x,Discharge,NumberOfAttendances_y
0,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Admission to same Hospital,24
1,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Discharged Home or to usual Place of Residence,246
2,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Transferred to Other Hospital/Service,6
3,202301,S08000015,Emergency Department,18-24,Female,1.0,113,,28
4,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Admission to same Hospital,91


In [21]:
Demo_and_discharge= Demo_and_discharge.rename(columns={'NumberOfAttendances_y': 'Discharge Frequency', 'NumberOfAttendances_x': 'Demographic frequency'})

# Handle missing values
Demo_and_discharge = Demo_and_discharge.dropna(subset=['Discharge'])
Demo_and_discharge.head()

Unnamed: 0,Month,HBT,DepartmentType,Age,Sex,Deprivation,Demographic frequency,Discharge,Discharge Frequency
0,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Admission to same Hospital,24
1,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Discharged Home or to usual Place of Residence,246
2,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Transferred to Other Hospital/Service,6
4,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Admission to same Hospital,91
5,202301,S08000015,Emergency Department,18-24,Female,1.0,113,Discharged Home or to usual Place of Residence,545


In [6]:
# Merge the above result with the AE data
Arrival_and_discharge_df = pd.merge(df_discharge, df_month_AE, on=["Month", "HBT"])

In [19]:
Arrival_and_discharge_df = Arrival_and_discharge_df .rename(columns={'NumberOfAttendances_y': 'Arrival frequency', 'NumberOfAttendances_x': 'Discharge frequency'})
Arrival_and_discharge_df  = Arrival_and_discharge_df.dropna(subset=['Discharge'])

#### Feature Engineering:

- Identify key features that are likely to impact the discharge outcome.
- Consider interactions between features, such as combining day and hour into a single feature if it makes sense for your model.


In [8]:
# Encode categorical features
encoder = OneHotEncoder()
categorical_features = encoder.fit_transform(Demo_and_discharge[['DepartmentType', 'Age', 'Sex', 'HBT']])

In [9]:
# Scale numerical features
scaler = StandardScaler()
numerical_features = scaler.fit_transform(Demo_and_discharge[['Deprivation', 'Demographic frequency']])

In [10]:
# Combine features
X = pd.concat([pd.DataFrame(categorical_features.toarray()), pd.DataFrame(numerical_features)], axis=1)
y =Demo_and_discharge['Discharge']

In [11]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Model Training and Selection:

- Choose a classification model suitable for multi-class outcomes.
- Tried models include Random Forest, Gradient Boosting, and Neural Networks.

In [12]:
# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

In [16]:
# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


                                                precision    recall  f1-score   support

                    Admission to same Hospital       0.21      0.10      0.14     85642
Discharged Home or to usual Place of Residence       0.34      0.71      0.46    115024
                                         Other       0.00      0.00      0.00     26672
         Transferred to Other Hospital/Service       0.21      0.09      0.12     93083

                                      accuracy                           0.31    320421
                                     macro avg       0.19      0.22      0.18    320421
                                  weighted avg       0.24      0.31      0.24    320421

[[ 8699 66780     0 10163]
 [16969 81394     0 16661]
 [ 3372 20152     0  3148]
 [11483 73677     0  7923]]


In [22]:
# Train SVM model
model = SVC(kernel='linear', C=1, gamma='auto', random_state=42)
model.fit(X_train, y_train)

ValueError: Input X contains NaN.
SVC does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

#### Model Training and Evaluation:

- Split the data into training and testing sets.
- Train the model using the training set and evaluate its performance on the test set.
- Use metrics like accuracy, precision, recall, and F1-score to assess model performance.

#### Hyperparameter Tuning:

Use techniques like Grid Search or Random Search to find the best hyperparameters for your model.
Model Deployment:

Once satisfied with the model's performance, prepare it for deployment.

#### A model that predicts the nature of demand