# Predicting COVID19 Hospital Stay Length

The following project was completed by jaketuricchi as part of IBM's deep learning course. Below is an exert from Kaggle describing the data which can be found at: https://www.kaggle.com/nehaprabhavalkar/av-healthcare-analytics-ii

Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital.<br>
This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.<br>
Suppose you have been hired as Data Scientist of HealthMan â€“ a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.<br>
The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. 

The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

# Data dictionary:

In [None]:
import pandas as pd
import os

In [None]:
dir='../input/av-healthcare-analytics-ii/healthcare'
os.chdir(dir)

In [None]:
dictionary=pd.read_csv('train_data_dictionary.csv')
print(dictionary)

# Import Packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import warnings 
import seaborn as sns
import random
warnings.filterwarnings('ignore')
%matplotlib inline
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

# Load data

In [None]:
data=pd.read_csv('train_data.csv')

# Data Exploration<br>
Because this is a modelling task I skip visualisation but I must still explore the data enough to understand preprocessing needs.<br>
Whats the shape of the data?

In [None]:
print(data.shape)

What columns do we have?

In [None]:
print(data.columns.tolist())

What type are the columns?

In [None]:
print(data.dtypes)
print(data.dtypes.value_counts())

Quickly describe the data.

In [None]:
print(data.describe(include='object').T)

From here we can see that object values aren't too unique. This is good as it will make encoding simpler.

Is the outcome data balanced?

In [None]:
data.Stay.value_counts()

The data is not balanced. We'll need to pay attention to the scoring metrics and stratifying splitting. I won't bother up/down sampling for this current exercise.

Do some cols have NA's?

In [None]:
print(data.isna().sum().sort_values(ascending=False))

Lets train on only complete data since the missing data is only a small fraction.

In [None]:
data=data.dropna()

# Labelling and Sorting<br>
Lets ensure we have all columns correctly labelled. I think we have some categorical vars registering as numeric.

In [None]:
data=data.drop(['case_id'], axis=1) # This is unique so not useful
categorical_vars=['Hospital_code', 'City_Code_Hospital', 'Bed Grade', 'City_Code_Patient',
                  'Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 
                  'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', 'Age',
                  'Stay'] 
data[categorical_vars] = data[categorical_vars].astype('category')

In [None]:
print(data.describe(include='category').T)

We have a participant id column which will have many unique values but is not continuous. Multiple occurences of an id suggests multiple hospital trips<br>
To deal with this, lets convert it to n hospital trips.

In [None]:
     
data['patient_visits'] = data.groupby(['patientid'])['patientid'].transform('count')
data=data.drop(['patientid'], axis=1) 

Now for each patient instead of their id we have the number of hospital visits which is less unique<br>
Lets now check our unique values of categories

In [None]:
cat_uniques = pd.DataFrame([[i, len(data[i].unique())] for i in data[categorical_vars].columns], columns=['Variable', 'Unique Values']).set_index('Variable')
print(cat_uniques)

Some values such as hospiral code and city patient code have too many unique values. Lets group the less frequent values

In [None]:
hospital_codes = data['Hospital_code'].value_counts(normalize=True)
print(hospital_codes)
minorities_hospitals =hospital_codes.where(hospital_codes<0.04).dropna().index.values
data['Hospital_code']=np.where(data['Hospital_code'].isin(minorities_hospitals), '0',  data['Hospital_code'])

In [None]:
city_code_patients = data['City_Code_Patient'].value_counts(normalize=True)
print(city_code_patients)
minorities_ccp =city_code_patients.where(city_code_patients<0.04).dropna().index.values
data['City_Code_Patient']=np.where(data['City_Code_Patient'].isin(minorities_ccp), '0',  data['City_Code_Patient'])

Recheck unique cats

In [None]:
cat_uniques = pd.DataFrame([[i, len(data[i].unique())] for i in data[categorical_vars].columns], columns=['Variable', 'Unique Values']).set_index('Variable')
print(cat_uniques)

Now we have a reasonable number of categories

# Encoding<br>
Because most features are categorical vars we must label and one hot encode before training a model.

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
label_vars=['Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type',
            'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', 'Age', 'Stay']

Encode labels

In [None]:
for col in label_vars:
    data[col] = le.fit_transform(data[col])
    
# Reset index
data=data.reset_index(drop=True)
y=data.Stay

Get Dummies

In [None]:
categorical_vars=['Hospital_code', 'City_Code_Hospital', 'Bed Grade', 'City_Code_Patient',
                  'Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 
                  'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', 'Age'] 
X = pd.get_dummies(data, columns=categorical_vars).drop('Stay', axis=1)

# Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numeric_cols=['Available Extra Rooms in Hospital', 'Visitors with Patient',
              'Admission_Deposit', 'patient_visits']

In [None]:
for col in data[numeric_cols].columns:
    X[col] = scaler.fit_transform(X[[col]])

# Data Splitting

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

In [None]:
strat_shuff_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
X.columns = X.columns.str.replace(' ', '_')

Get the index values from the generator

In [None]:
train_idx, test_idx = next(strat_shuff_split.split(X, y))

Create the data sets

In [None]:
X_train = X.loc[train_idx, X.columns.values]
y_train = data.loc[train_idx, 'Stay']

In [None]:
X_test = X.loc[test_idx, X.columns.values]
y_test = data.loc[test_idx, 'Stay']

Ensure the train/test split is equal

In [None]:
print(y_train.value_counts(normalize=True).sort_index())

Test

In [None]:
print(y_test.value_counts(normalize=True).sort_index())

In [None]:
del(data)

The proportion of classes in the train and test data are balanced.

Now lets use keras to get dummies for the y data

In [None]:
num_classes = 11
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Baseline Keras Model<br>
Now we will build a basic Keras model which will attempt

Build model base

In [None]:
model_base = Sequential()
model_base.add(Dense(96,input_dim = 78 ,activation = 'relu'))
model_base.add(Dense(11,activation='sigmoid'))
model_base.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model_base.summary()

Fit the model

In [None]:
model_base.fit(X_train, y_train, batch_size=5, epochs=10,
              validation_data=(X_test, y_test),
              shuffle=True)

Calculate ROC AUC. This is more informative than accuracy due to lack of balance in the outcome variable

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
y_pred_prob = model_base.predict(X_test)
y_pred=model_base.predict_classes(X_test)

In [None]:
print('roc-auc is {:.3f}'.format(roc_auc_score(y_test,y_pred_prob)))

The model is achieving a high level of accuracy with only a very basic model arcitecture. Its worth noting that increasing the number of epochs does not add much to the loss or accuracy<br>
scores but did however take considerably longer to compute. Further models may wish to reduce the epochs in favour of more complex model arcitecture.

# Building a More Complex Keras Model

In [None]:
model_2 = Sequential()
model_2.add(Dense(256,input_dim = 78 ,activation = 'relu'))
model_2.add(Dense(128,activation = 'relu'))
model_2.add(Dropout(0.5))
model_2.add(Dense(64,activation = 'relu'))
model_2.add(Dropout(0.2))
model_2.add(Dense(32,activation = 'relu'))
model_2.add(Dense(11,activation='sigmoid'))
model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model_2.summary()

Fit the model

In [None]:
model_2.fit(X_train, y_train, batch_size=5, epochs=3, # Lets reduce epochs here, it didnt help last time.
              validation_data=(X_test, y_test),
              shuffle=True)

Calculate ROC AUC. 

In [None]:
y_pred_prob = model_2.predict(X_test)
y_pred=model_2.predict_classes(X_test)

In [None]:
print('roc-auc is {:.3f}'.format(roc_auc_score(y_test,y_pred_prob)))

Improvement!

# Next Steps<br>
* Play with adding with width (more nodes) or depth (more layers) to the NN model. My computer is stuggling considerably on a basic model so I'll leave that out for now.<br>
* Tune parameters (e.g. try a different optimizer, lr or batch size)<br>
* Consider using a complex pre-designed model which is fit for this current problem. These often perform better than self-made models<br>
* Conduct more feature engineering. Currently I only encoded and sorted variables, but adding interactions, or perhaps polynomial features to numeric vars could be useful.<br>
* Calculate a more detailed scoring output for multiclassification (note that some metrics don't work here)<br>
* See how the NN model compares to other ML algorithms.

Great, we're working with complete data. Lets consider the distribution/skew of features.