# Overview

The objective of this notebook is to build predictive models on readmission status of diabetic patients. The models will be iteratively improved through grid search cross validation method. The model will be scored using recall, `recall_micro` for multiclass, to reduce false negatives or patients that will be readmitted to the hospital (predicted not returning). 

## Method

- Upsample minority class with resampling to combat class imbalance in predictant.
- Build DummyClassifier using `uniform` strategy for baseline.
- Build baseline Decision Tree and Random Forest Model.
- Hypertune baseline models using grid search cross validation method.
- Check for feature importance on Decision Tree model by selecting features with certain level of weight from the hypertuned Decision Tree model to compare model improvement. 
    - High scores on Random Forest deemed above steps unncessary. 
- All model were checked for overfitting by running 100 iteration of prediction to check metric between the train and test prediction. 
    - All hypertuned model saw very low chance of overfitting.

## Summary

- Hypertuned Decision Tree observed ~10% reduction in overfitting.
- Hypertuned Decision Tree with select features observed 4% increase in metric.
- Hypertuned Random Forest observed 95.2% recall score on both train and test on average during 100 predictions.

## Authors

[Yung Han Jeong](https://github.com/yunghanjeong) <br>
[Malcolm Katzenbach](https://github.com/malcolm206)

# Import

## Library Import

In [182]:
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, recall_score, accuracy_score

from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import tree

import os
import numpy as np
import pickle
import json
import pandas as pd
pd.set_option('display.max_columns', 200) #set to show all columns
pd.set_option('display.max_rows', 200) 

## Cleaned Data Import

In [7]:
df = pd.read_csv(r"..\data\diabetic_data_dummy.csv", index_col=0)
df.head()

Unnamed: 0,gender,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,change,diabetesMed,readmitted,admission_type_id_2,admission_type_id_3,admission_type_id_4,admission_type_id_5,admission_type_id_6,admission_type_id_7,admission_type_id_8,discharge_disposition_id_2,discharge_disposition_id_3,discharge_disposition_id_4,discharge_disposition_id_5,discharge_disposition_id_6,discharge_disposition_id_7,discharge_disposition_id_8,discharge_disposition_id_9,discharge_disposition_id_10,discharge_disposition_id_11,discharge_disposition_id_12,discharge_disposition_id_13,discharge_disposition_id_14,discharge_disposition_id_15,discharge_disposition_id_16,discharge_disposition_id_17,discharge_disposition_id_18,discharge_disposition_id_19,discharge_disposition_id_20,discharge_disposition_id_22,discharge_disposition_id_23,discharge_disposition_id_24,discharge_disposition_id_25,discharge_disposition_id_27,discharge_disposition_id_28,admission_source_id_2,admission_source_id_3,admission_source_id_4,admission_source_id_5,admission_source_id_6,admission_source_id_7,admission_source_id_8,admission_source_id_9,admission_source_id_10,admission_source_id_11,admission_source_id_13,admission_source_id_14,admission_source_id_17,admission_source_id_20,admission_source_id_22,admission_source_id_25,age_[10-20),age_[20-30),age_[30-40),age_[40-50),age_[50-60),age_[60-70),age_[70-80),age_[80-90),age_[90-100),race_AfricanAmerican,race_Asian,race_Caucasian,race_Hispanic,race_Other,metformin_Down,metformin_Steady,metformin_Up,repaglinide_Down,repaglinide_Steady,repaglinide_Up,nateglinide_Down,nateglinide_Steady,nateglinide_Up,chlorpropamide_Down,chlorpropamide_Steady,chlorpropamide_Up,glimepiride_Down,glimepiride_Steady,glimepiride_Up,glipizide_Down,glipizide_Steady,glipizide_Up,glyburide_Down,glyburide_Steady,glyburide_Up,pioglitazone_Down,pioglitazone_Steady,pioglitazone_Up,rosiglitazone_Down,rosiglitazone_Steady,rosiglitazone_Up,acarbose_Down,acarbose_Steady,acarbose_Up,miglitol_Down,miglitol_Steady,miglitol_Up,tolazamide_Steady,tolazamide_Up,insulin_Down,insulin_Steady,insulin_Up,glyburide-metformin_Down,glyburide-metformin_Steady,glyburide-metformin_Up,max_glu_serum_>200,max_glu_serum_>300,max_glu_serum_Norm,A1Cresult_>7,A1Cresult_>8,A1Cresult_Norm,diag_1_circulatory,diag_1_diabetes,diag_1_digestive,diag_1_genitourinary,diag_1_injury,diag_1_musculoskeletal,diag_1_neoplasms,diag_1_respiratory,diag_2_circulatory,diag_2_diabetes,diag_2_digestive,diag_2_genitourinary,diag_2_injury,diag_2_musculoskeletal,diag_2_neoplasms,diag_2_respiratory,diag_3_circulatory,diag_3_diabetes,diag_3_digestive,diag_3_genitourinary,diag_3_injury,diag_3_musculoskeletal,diag_3_neoplasms,diag_3_respiratory
0,1,1,41,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,3,59,0,18,0,0,0,9,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,2,11,5,13,2,0,1,6,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,2,44,1,16,0,0,0,7,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,0,1,51,0,8,0,0,0,5,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0


## Class Imbalance

In [8]:
df.groupby("readmitted").gender.count()
# 0 = Not Readmitted
# 1 = Readmitted in <30 days
# 2 = Readmitted in >30 days

readmitted
0    54861
1    11357
2    35545
Name: gender, dtype: int64

In [9]:
no_read = df[df.readmitted == 0] # 0 = Not Readmitted
read_1 = df[df.readmitted == 1] # 1 = Readmitted in <30 days
read_2 = df[df.readmitted == 2] # 2 = Readmitted in >30 days

In [10]:
read_1_resample = resample(read_1,
                          replace=True, # sample with replacement
                          n_samples=no_read.shape[0], # match number in majority class
                          random_state=42) # reproducible result

read_2_resample = resample(read_2,
                          replace=True, # sample with replacement
                          n_samples=no_read.shape[0], # match number in majority class
                          random_state=42) # reproducible result

In [11]:
resampled_df = pd.concat([no_read, read_1_resample, read_2_resample])

In [12]:
resampled_df.groupby("readmitted").gender.count()
# 0 = Not Readmitted
# 1 = Readmitted in <30 days
# 2 = Readmitted in >30 days

readmitted
0    54861
1    54861
2    54861
Name: gender, dtype: int64

## Initial Data Split

In [13]:
# split features and predictant
y = resampled_df.readmitted
X = resampled_df.drop(columns = "readmitted")
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Baseline DummyClassifier

In [18]:
dummy_clf = DummyClassifier(strategy="uniform")

In [19]:
dummy_clf.fit(X_train, y_train)

DummyClassifier(strategy='uniform')

In [20]:
y_pred = dummy_clf.predict(X_test)

In [22]:
dummy_recall = recall_score(y_test, y_pred, average="micro")
print("Dummy Classifier recall: ", dummy_recall)

Dummy Classifier recall:  0.33220726194526806


Dummy classifier has equal probability of predicting outcome based on balanced class.

***

## Decision Tree - Basic

In [148]:
tree_clf = DecisionTreeClassifier()

In [149]:
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier()

In [150]:
y_tree_pred = tree_clf.predict(X_test)

In [26]:
tree_recall = recall_score(y_test, y_tree_pred, average="micro")
print("Decision Tree recall: ", tree_recall)

Decision Tree recall:  0.7866378262771594


In [29]:
# overfit check
y_tree_train_pred = tree_clf.predict(X_train)
tree__train_recall = recall_score(y_train, y_tree_train_pred, average="micro")
print("Check for Overfitting")
print("Decision Tree Train recall: ", tree__train_recall)
print("Decision Tree Test recall: ", tree_recall)

Check for Overfitting
Decision Tree Train recall:  0.9999837974027237
Decision Tree Test recall:  0.7866378262771594


Basic decision tree model is highly overfit.

***

## Decision Tree - GridSearchCV

Grid search cross validation should reduce overfitting by iteratively comparing train and test result based on tuning parameters.

In [33]:
resampled_df.shape[1]/2

72.5

In [30]:
tree_clf = DecisionTreeClassifier()

In [70]:
tree_param = {"max_depth":range(15,31,5),
              "min_samples_split":range(25,101,25),
              "max_features":range(10, 41, 5)
              }

In [71]:
grid_tree = GridSearchCV(tree_clf, param_grid=tree_param, cv=10, scoring="recall_micro", n_jobs=-1, verbose=1)

In [72]:
grid_tree.fit(X_train, y_train)

Fitting 10 folds for each of 112 candidates, totalling 1120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   11.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   52.6s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 1120 out of 1120 | elapsed:  6.2min finished


GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'max_depth': range(15, 31, 5),
                         'max_features': range(10, 41, 5),
                         'min_samples_split': range(25, 101, 25)},
             scoring='recall_micro', verbose=1)

In [73]:
grid_tree.best_params_ #checking best parameters of the decision tree

{'max_depth': 30, 'max_features': 40, 'min_samples_split': 25}

Hypertuned model has low interpretability with max depth of 30, but its feature weight will provide insight to what features indicate high risk patients. 

In [74]:
# Initial Check
y_tree_train_pred = grid_tree.best_estimator_.predict(X_train)
y_tree_test_pred = grid_tree.best_estimator_.predict(X_test)
grid_tree_train_recall = recall_score(y_train, y_tree_train_pred, average="micro")
grid_tree_test_recall = recall_score(y_test, y_tree_test_pred, average="micro")
print("Check for Overfitting")
print("Decision Tree Train recall: ", grid_tree_train_recall)
print("Decision Tree Test recall: ", grid_tree_test_recall)

Check for Overfitting
Decision Tree Train recall:  0.7177102489529071
Decision Tree Test recall:  0.6039226170223108


Hypertuning reduced overfitting, but still slightly overfit. 

### Checking for Overfitting Through Iteration

In [75]:
n = 100 # of iteration
# value intialization
train_recall_sum = 0
test_recall_sum = 0
# f1 is same as recall when using micro as average value

for i in range(0, n): 
    # new split
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    #predict on new split
    y_tree_train_pred = grid_tree.best_estimator_.predict(X_train)
    y_tree_test_pred = grid_tree.best_estimator_.predict(X_test)
    
    #calculate recall score on new prediction
    train_recall_sum += recall_score(y_train, y_tree_train_pred, average="micro")
    test_recall_sum += recall_score(y_test, y_tree_test_pred, average="micro")
    #f1
    # print("Predicted", i+1, "times") #sanity check

# output average    
print(f"Check for Overfitting with {n} iterations")
print("Decision Tree Train recall: ", train_recall_sum/n)
print("Decision Tree Test recall: ", test_recall_sum/n)

Predicted 1 times
Predicted 2 times
Predicted 3 times
Predicted 4 times
Predicted 5 times
Predicted 6 times
Predicted 7 times
Predicted 8 times
Predicted 9 times
Predicted 10 times
Predicted 11 times
Predicted 12 times
Predicted 13 times
Predicted 14 times
Predicted 15 times
Predicted 16 times
Predicted 17 times
Predicted 18 times
Predicted 19 times
Predicted 20 times
Predicted 21 times
Predicted 22 times
Predicted 23 times
Predicted 24 times
Predicted 25 times
Predicted 26 times
Predicted 27 times
Predicted 28 times
Predicted 29 times
Predicted 30 times
Predicted 31 times
Predicted 32 times
Predicted 33 times
Predicted 34 times
Predicted 35 times
Predicted 36 times
Predicted 37 times
Predicted 38 times
Predicted 39 times
Predicted 40 times
Predicted 41 times
Predicted 42 times
Predicted 43 times
Predicted 44 times
Predicted 45 times
Predicted 46 times
Predicted 47 times
Predicted 48 times
Predicted 49 times
Predicted 50 times
Predicted 51 times
Predicted 52 times
Predicted 53 times
Pr

With 100 iterations of random train test splits slightly reduced train recall score, but demonstrated that the model is not overfit compared to baseline decision tree model. 

### Checking Feature Importance

In [195]:
feature_imp = dict(zip(X.columns, grid_tree.best_estimator_.feature_importances_)) #get feature importance
sorted_feature = {k: v for k, v in sorted(feature_imp.items(), key=lambda item: item[1], reverse=True)} #sort descending
grid_dt_feature_df = pd.DataFrame(sorted_feature, index=[0]) # push to dataframe for better visualization
grid_dt_feature_df = grid_dt_feature_df.transpose()
grid_dt_feature_df[:40]

Unnamed: 0,0
num_lab_procedures,0.103863
num_medications,0.085879
number_inpatient,0.067769
time_in_hospital,0.057316
number_diagnoses,0.035806
num_procedures,0.034211
number_outpatient,0.024633
discharge_disposition_id_11,0.021868
gender,0.01416
diag_3_circulatory,0.013535


In [116]:
n = 0.005 #set comparison threshold
feature_imp = dict(zip(X.columns, grid_tree.best_estimator_.feature_importances_))
nozero_feature = {k: v for k, v in sorted(feature_imp.items(), key=lambda item: item[1], reverse=True) if round(v,3) >=n}
print("dictionary length", len(nozero_feature)) #check how many there are in this dictionary
important_features = list(nozero_feature.keys())

dictionary length 62


Feature weights were extracted and filtered to create `important_features` (referred as `select features`) to tune the decision tree model further. 

### Best Model Refit and Export

The best model was refit and exported along with its feature weight dataframe. 

In [155]:
best_grid_tree = grid_tree.best_estimator_.fit(resampled_df.drop(columns = "readmitted"), resampled_df.readmitted)

In [156]:
with open(r"..\model\best_tree.pickle", "wb") as best_tree:
    pickle.dump(best_grid_tree, best_tree)

In [196]:
grid_dt_feature_df.to_csv(r"..\model\decision_tree_feature_score.csv")

***

## GridSearchCV Decision Tree with Select Features

Using `important_features` above the data was sliced further to hypertune the decision tree model again. 

In [117]:
y = resampled_df.readmitted
X = resampled_df[important_features]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [118]:
tree_param = {"max_depth":range(15,31,5),
              "min_samples_split":range(25,101,25),
              "max_features":range(10, 41, 5)
              }

In [119]:
grid_select_tree = GridSearchCV(tree_clf, param_grid=tree_param, cv=10, scoring="recall_micro", n_jobs=-1, verbose=1)

In [120]:
grid_select_tree.fit(X_train, y_train)

Fitting 10 folds for each of 112 candidates, totalling 1120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   53.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 1120 out of 1120 | elapsed:  5.9min finished


GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'max_depth': range(15, 31, 5),
                         'max_features': range(10, 41, 5),
                         'min_samples_split': range(25, 101, 25)},
             scoring='recall_micro', verbose=1)

In [121]:
grid_select_tree.best_params_

{'max_depth': 30, 'max_features': 40, 'min_samples_split': 25}

Best parameters of this model was the same as the previous model, which is to be expected. 

In [122]:
# Initial Check
y_tree_train_pred = grid_select_tree.best_estimator_.predict(X_train)
y_tree_test_pred = grid_select_tree.best_estimator_.predict(X_test)
grid_tree_train_recall = recall_score(y_train, y_tree_train_pred, average="micro")
grid_tree_test_recall = recall_score(y_test, y_tree_test_pred, average="micro")
print("Check for Overfitting")
print("Decision Tree Train recall: ", grid_tree_train_recall)
print("Decision Tree Test recall: ", grid_tree_test_recall)

Check for Overfitting
Decision Tree Train recall:  0.7627858745756945
Decision Tree Test recall:  0.6274972050746124


Single train test split demonstrated overfitting. 

In [123]:
n = 100 # of iteration
# value intialization
train_recall_sum = 0
test_recall_sum = 0
# f1 is same as recall when using micro as average value

for i in range(0, n): 
    # new split
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    #predict on new split
    y_tree_train_pred = grid_select_tree.best_estimator_.predict(X_train)
    y_tree_test_pred = grid_select_tree.best_estimator_.predict(X_test)
    
    #calculate recall score on new prediction
    train_recall_sum += recall_score(y_train, y_tree_train_pred, average="micro")
    test_recall_sum += recall_score(y_test, y_tree_test_pred, average="micro")
    #f1
    # print("Predicted", i+1, "times") #sanity check

# output average    
print(f"Check for Overfitting with {n} iterations")
print("Decision Tree Train recall: ", train_recall_sum/n)
print("Decision Tree Test recall: ", test_recall_sum/n)

Check for Overfitting with 100 iterations
Decision Tree Train recall:  0.7289588210990222
Decision Tree Test recall:  0.7289775433821025


However, 100 iteration of random train and test prediction demonstrated significant incrase in recall score of both train and test prediction. Also, it demonstrated reduction in overfitting. 

### Fit Whole Data to Model

In [157]:
grid_tree_best_fit = grid_select_tree.best_estimator_.fit(resampled_df.drop(columns = "readmitted"), resampled_df.readmitted)

### Export DecisionTree Related Information

All relevant information and final fit model was export similarly as the previous model. 

#### Model Pickle

In [158]:
with open(r"..\model\best_feature_tree_fit.pickle", "wb") as feat_tree:
    pickle.dump(grid_tree_best_fit, feat_tree)

#### Feature Importance DataFrame

In [197]:
grid_dt_imp_feature_imp = dict(zip(X.columns, grid_tree.best_estimator_.feature_importances_)) #get feature importance
sorted_feature = {k: v for k, v in sorted(grid_dt_imp_feature_imp.items(), key=lambda item: item[1], reverse=True)} #sort descending
grid_dt_imp_feature_df = pd.DataFrame(sorted_feature, index=[0]) # push to dataframe for better visualization
grid_dt_imp_feature_df = grid_dt_imp_feature_df.transpose()
grid_dt_imp_feature_df[:40]

Unnamed: 0,0
num_lab_procedures,0.103863
num_medications,0.085879
number_inpatient,0.067769
time_in_hospital,0.057316
number_diagnoses,0.035806
num_procedures,0.034211
number_outpatient,0.024633
discharge_disposition_id_11,0.021868
gender,0.01416
diag_3_circulatory,0.013535


In [221]:
grid_dt_imp_feature_df.to_csv(r"..\model\grid_dt_imp_feature_df.csv")

#### Important Features List

In [128]:
dt_feature_dict = {"features":important_features}

In [130]:
with open(r"..\model\dt_features.json", "w") as dt_features:  
    json.dump(dt_feature_dict, dt_features) 

*** 

## Radom Forest Classifier - Basic

Random Forest model was chosen due to its high out of the box accuracy. This model's performance will be compared to decision tree models. 

In [131]:
rf_clf = RandomForestClassifier()

In [132]:
# split features and predictant
y = resampled_df.readmitted
X = resampled_df.drop(columns = "readmitted")
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [136]:
rf_clf.fit(X_train, y_train)

RandomForestClassifier()

In [134]:
# overfit check
#make prediction
y_rf_train_pred = rf_clf.predict(X_train)
y_rf_test_pred = rf_clf.predict(X_test)
# get recall score
rf_train_recall = recall_score(y_train, y_rf_train_pred, average="micro")
rf_test_recall = recall_score(y_test, y_rf_test_pred, average="micro")

print("Check for Overfitting")
print("Random Forest Train recall: ", rf_train_recall)
print("Random Forest Test recall: ", rf_test_recall)

Check for Overfitting
Decision Tree Train recall:  0.9999837974027237
Decision Tree Test recall:  0.8611043600836047


Initial prediction showed overfitting, which is expected. 

In [137]:
n = 100 # # of iteration
avg_rf_test_recall = 0
avg_rf_train_recall = 0

for i in range(0, n):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    y_rf_train_pred = rf_clf.predict(X_train)
    y_rf_test_pred = rf_clf.predict(X_test)
    # get recall score
    avg_rf_test_recall += recall_score(y_train, y_rf_train_pred, average="micro")
    avg_rf_train_recall += recall_score(y_test, y_rf_test_pred, average="micro")
    print(i)

print("Check for Overfitting with 100 iterations")
print("Random Forest Train recall: ", avg_rf_train_recall/n)
print("Random Forest Test recall: ", avg_rf_test_recall/n)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Check for Overfitting
Decision Tree Train recall:  0.9642632273953512
Decision Tree Test recall:  0.9642793953239684


However, the iteration showed that the model performed well out of the box with no overfitting. 

In [165]:
rf_feature_rank = dict(zip(resampled_df.drop(columns="readmitted").columns, rf_clf.feature_importances_))
rf_feature_sorted = {k: v for k, v in sorted(rf_feature_rank.items(), key=lambda item: item[1], reverse=True)}
rf_feature_df = pd.DataFrame(rf_feature_sorted, index=[0])
rf_feature_df.transpose()

Unnamed: 0,0
num_lab_procedures,0.09183
num_medications,0.082009
time_in_hospital,0.059196
number_diagnoses,0.041965
number_inpatient,0.041836
num_procedures,0.040635
gender,0.023135
diag_3_circulatory,0.018513
diag_2_circulatory,0.018278
number_outpatient,0.01795


Checking the feature importance demonstrated very similarly ranked features compared to decision tree model. 

***

## Random Forest Classifier - GridSearchCV

The random forest was hypertuned with similar, but with wider ranges of parameters as decision tree model. 

In [159]:
grid_rf_clf = RandomForestClassifier()

In [170]:
rf_params =  {"max_depth":range(10,41,10),
              "min_samples_split":range(10,101,10),
              "max_features":range(5, 51, 5)
              }

In [171]:
grid_rf = GridSearchCV(grid_rf_clf, param_grid=rf_params, cv=10, scoring="recall_micro", n_jobs=-1, verbose=1)

In [172]:
# split features and predictant
y = resampled_df.readmitted
X = resampled_df.drop(columns = "readmitted")
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [173]:
grid_rf.fit(X_train, y_train)

Fitting 10 folds for each of 400 candidates, totalling 4000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 35.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 87.1min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 149.1min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 268.5min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 403.2min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 598.8min
[Parallel(n_jobs=-1)]: Done 4000 out of 4000 | elapsed: 833.4min finished


GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': range(10, 41, 10),
                         'max_features': range(5, 51, 5),
                         'min_samples_split': range(10, 101, 10)},
             scoring='recall_micro', verbose=1)

In [178]:
grid_rf.best_estimator_

RandomForestClassifier(max_depth=40, max_features=45, min_samples_split=10)

In [179]:
# overfit check
#make prediction
y_rf_grid_train_pred = grid_rf.best_estimator_.predict(X_train)
y_rf_grid_test_pred = grid_rf.best_estimator_.predict(X_test)
# get recall score
rf_grid_train_recall = recall_score(y_train, y_rf_grid_train_pred, average="micro")
rf_grid_test_recall = recall_score(y_test, y_rf_grid_test_pred, average="micro")

print("Check for Overfitting")
print("GridsearchCV Random Forest Train recall: ", rf_grid_train_recall)
print("GridsearchCV Random Forest Test recall: ", rf_grid_test_recall)

Check for Overfitting
Decision Tree Train recall:  0.951829678297431
Decision Tree Test recall:  0.9514655130510864


hypertuned model showed no overfitting. 

In [181]:
n = 100 # # of iteration
avg_rf_grid_test_recall = 0
avg_rf_gird_train_recall = 0

for i in range(0, n):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    y_rf_grid_train_pred = grid_rf.best_estimator_.predict(X_train)
    y_rf_grid_test_pred = grid_rf.best_estimator_.predict(X_test)
    # get recall score
    avg_rf_grid_test_recall += recall_score(y_train, y_rf_grid_train_pred, average="micro")
    avg_rf_gird_train_recall += recall_score(y_test, y_rf_grid_test_pred, average="micro")
    print(i)

print("Check for Overfitting with 100 iterations")
print("GridsearchCV Random Forest Train recall: ", avg_rf_grid_test_recall/n)
print("GridsearchCV Random Forest Test recall: ", avg_rf_gird_train_recall/n)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Check for Overfitting
Random Forrest Train recall:  0.9517438855448528
Random Forrest Test recall:  0.95172288922374


The 100 iteration of random split train and test prediction demonstrated no reduction in performance and maintained overfitting. 

Accuracy metrics were also very high

In [183]:
accuracy_score(y_train, y_rf_grid_train_pred)

0.9518215769987929

In [184]:
accuracy_score(y_test, y_rf_grid_test_pred)

0.9514898167501094

Feature weights were similar to untuned random forest model. 

In [191]:
grid_rf_feature_rank = dict(zip(resampled_df.drop(columns="readmitted").columns, grid_rf.best_estimator_.feature_importances_))
sorted_grid_rf_feature = {k: v for k, v in sorted(grid_rf_feature_rank.items(), key=lambda item: item[1], reverse=True)} #sort descending
grid_rf_feature_df = pd.DataFrame(sorted_grid_rf_feature, index=[0]) # push to dataframe for better visualization
grid_rf_feature_df = grid_rf_feature_df.transpose()
grid_rf_feature_df[:40]

Unnamed: 0,0
num_lab_procedures,0.112865
num_medications,0.09134
time_in_hospital,0.058856
number_inpatient,0.045307
num_procedures,0.040027
number_diagnoses,0.038103
number_outpatient,0.016905
gender,0.016418
diag_3_circulatory,0.014962
diag_2_circulatory,0.014337


### GridSearchCV Random Forest Model Export

Final model was refit to whole data set and exported.

In [189]:
best_grid_rf_clf = RandomForestClassifier(max_depth=40, max_features=45, min_samples_split=10)
best_grid_rf_clf.fit(X, y)

RandomForestClassifier(max_depth=40, max_features=45, min_samples_split=10)

#### Pickle Export

In [190]:
#pickle

with open(r"..\model\best_grid_rf_clf.pickle", "wb") as best_grid_rf:
    pickle.dump(best_grid_rf_clf, best_grid_rf)

#### Export Feature Rank

In [192]:
grid_rf_feature_df.to_csv(r"..\model\grid_rf_feature_score.csv")

## Feature Rank Comparison Between Models

All previous feature weights were combined for feature importance analysis.

In [203]:
grid_dt_feature_df.columns = ["grid_decisiontree"]
grid_dt_feature_df.head()

Unnamed: 0,grid_decisiontree
num_lab_procedures,0.103863
num_medications,0.085879
number_inpatient,0.067769
time_in_hospital,0.057316
number_diagnoses,0.035806


In [205]:
grid_rf_feature_df.columns = ["grid_randomforest"]
grid_rf_feature_df.head()

Unnamed: 0,grid_randomforest
num_lab_procedures,0.112865
num_medications,0.09134
time_in_hospital,0.058856
number_inpatient,0.045307
num_procedures,0.040027


In [207]:
all_feature_grid_models = grid_dt_feature_df.join(grid_rf_feature_df)

In [212]:
grid_dt_imp_feature_df.columns = ["grid_dt_select_features"]

In [214]:
all_feature_grid_models  = all_feature_grid_models.join(grid_dt_imp_feature_df)

In [217]:
all_feature_grid_models = all_feature_grid_models[["grid_decisiontree", "grid_dt_select_features", "grid_randomforest"]]

Between decision tree and random forest models following changes were observed:

- Time in hospital and number of inpatients ranks were flipped.
- Number of diagnoses and procedure ranks were flipped. 
- Discharge dispotision id 11 was not as important for random forest model compared to decision tree model.
- Random forest model put emphasis on all circulatory diagnosis compared to decision tree model.
- Number of emergency visits were more important on random forest model than decision tree.
- Besides insulin, drugs that managed type ii diabetes were ranked higher. 

In [220]:
all_feature_grid_models.head(20)

Unnamed: 0,grid_decisiontree,grid_dt_select_features,grid_randomforest
num_lab_procedures,0.103863,0.103863,0.112865
num_medications,0.085879,0.085879,0.09134
number_inpatient,0.067769,0.067769,0.045307
time_in_hospital,0.057316,0.057316,0.058856
number_diagnoses,0.035806,0.035806,0.038103
num_procedures,0.034211,0.034211,0.040027
number_outpatient,0.024633,0.024633,0.016905
discharge_disposition_id_11,0.021868,0.021868,0.012951
gender,0.01416,0.01416,0.016418
diag_3_circulatory,0.013535,0.013535,0.014962


In [223]:
all_feature_grid_models.to_csv(r"..\model\all_model_feature_weight.csv")