# Stroke Prediction Model (Binary Classification)

### Remark: The data is strongly imbalanced in this case: We have 4861 patients with target=0 (no stroke), but only 249 (<5%) cases with target=1 (stroke). By using a trivial predictor which always returns 0 we can achieve an accuracy of 4861/5110 = 95.13%. This sounds at first like a good performance, however, this trivial predictor is completely useless as it has absolutely no discriminative power. We can see that accuracy is not a really useful metric in the context of strongly imbalanced data. In the following we will - for the sake of completeness - evaluate also accuracy but our focus will be on AUC as performance metric instead (our trivial predictor would have an AUC of 0.5)!


## Table of Contents
* [Import and first glance](#1)
* [Data Cleansing](#2)
* [Numerical Features](#3)
* [Categorical Features](#4)
* [Target](#5)
* [Build Model](#6)
* [Evaluate on Training Data](#7)
* [Evaluate on Test Set](#8)

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plot
import matplotlib.pyplot as plt
import seaborn as sns

# statistics tools
from statsmodels.graphics.mosaicplot import mosaic

# ML
import h2o
from h2o.estimators import H2ORandomForestEstimator
from h2o.estimators import H2OGradientBoostingEstimator

<a id='1'></a>
# Import and first glance

In [None]:
# load data
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
# dimensions of data
df.shape

In [None]:
# column names
print(df.columns.tolist())

<a id='2'></a>
# Data cleansing

In [None]:
df.info()

We have missing values for BMI!

In [None]:
# impute with -99
df.bmi = df.bmi.fillna(-99)

In [None]:
# rename columns
df.rename(columns = {'Residence_type':'residence_type'}, inplace = True)

In [None]:
# define target variable
df['target'] = df.stroke
df = df.drop(['stroke'], axis=1) # remove stroke column

<a id='3'></a>
# Numerical Features

In [None]:
# select numerical features
features_num = ['age', 'avg_glucose_level','bmi']

In [None]:
# basic stats
df[features_num].describe(percentiles=[0.1,0.25,0.5,0.75,0.9])

In [None]:
# plot distribution of numerical features
for f in features_num:
    df[f].plot(kind='hist', bins=50)
    plt.title(f)
    plt.grid()
    plt.show()

In [None]:
# pairwise scatter plot
sns.pairplot(df[features_num], 
             kind='reg', 
             plot_kws={'line_kws':{'color':'magenta'}, 'scatter_kws': {'alpha': 0.1}})
plt.show()

In [None]:
# Spearman (Rank) correlation
corr_spearman = df[features_num].corr(method='spearman')

fig = plt.figure(figsize = (6,5))
sns.heatmap(corr_spearman, annot=True, cmap="RdYlGn", vmin=-1, vmax=+1)
plt.title('Spearman Correlation')
plt.show()

<a id='4'></a>
# Categorical Features

In [None]:
features_cat = ['gender','hypertension','heart_disease','ever_married',
                'work_type','residence_type','smoking_status']

In [None]:
for f in features_cat:
    df[f].value_counts().plot(kind='bar')
    plt.title(f)
    plt.grid()
    plt.show()

<a id='5'></a>
# Target

In [None]:
# calc frequencies
target_count = df.target.value_counts()
print(target_count)
print()
print('Percentage of strokes [1]:', np.round(100*target_count[1] / target_count.sum(),2), '%')

In [None]:
# plot target distribution
target_count.plot(kind='bar')
plt.title('Target = Stroke')
plt.grid()
plt.show()

### Target vs Numerical Features

In [None]:
# add binned version of numerical features

# quantile based:
df['age_bin'] = pd.qcut(df['age'], q=10, precision=1)
df['avg_glucose_level_bin'] = pd.qcut(df['avg_glucose_level'], q=10, precision=1)

# explicitly defined bins:
df['bmi_bin'] = pd.cut(df['bmi'], [-100,10,20,25,30,35,40,50,100])

In [None]:
# plot target vs features using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

for f in features_num:
    f_bin = f+'_bin'
    plt.rcParams["figure.figsize"] = (16,7) # increase plot size for mosaics
    mosaic(df, [f_bin, 'target'], title='Target vs ' + f + ' [binned]')
    plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

#### "Naive" Interpretations based on those univariate plots:
* Risk increases with age and glucose level (diabetes).
* High BMI levels are also indicating higher risk.
* A missing value for BMI (the leftmost column) seems to indicate a massively increased risk!?

In [None]:
# BMI - check cross table
ctab = pd.crosstab(df.bmi_bin, df.target)
ctab

In [None]:
# normalize each row to get row-wise target percentages
(ctab.transpose() / ctab.sum(axis=1)).transpose()

#### Almost 20% of the missing BMIs had a stroke! This is way higher than for the other bins.

### Target vs Categorical Features

In [None]:
# plot target vs features using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

for f in features_cat:
    plt.rcParams["figure.figsize"] = (8,7) # increase plot size for mosaics
    mosaic(df, [f, 'target'], title='Target vs ' + f)
    plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

#### "Naive" Interpretations based on those univariate plots:
* Influence of gender seems surprisingly low
* Hypertension and heart disease massively increase risk of stroke
* "Ever married" too!?
* Work type: Higher risk for self-employed (more stress?)
* Residence type: Slightly higher risk for urban vs rural
* Smoking: Highest risk for *former* smokers. Not much difference between "smokes" and "never smoked"?

In [None]:
# "ever married" - check cross table
ctab = pd.crosstab(df.ever_married, df.target)
ctab

In [None]:
# normalize each row
(ctab.transpose() / ctab.sum(axis=1)).transpose()

<a id='6'></a>
# Build Model

In [None]:
# select predictors
predictors = features_num + features_cat
print('Number of predictors: ', len(predictors))
print(predictors)

In [None]:
# start H2O
h2o.init(max_mem_size='12G', nthreads=4) # Use maximum of 12 GB RAM and 4 cores

In [None]:
# upload data frame in H2O environment
df_hex = h2o.H2OFrame(df)

df_hex['target'] = df_hex['target'].asfactor()

# train / test split (70/30)
train_hex, test_hex = df_hex.split_frame(ratios=[0.7], seed=999)

# pandas versions of train/test
df_train = train_hex.as_data_frame()
df_test = test_hex.as_data_frame()

In [None]:
# export for potential external processing
df_train.to_csv('df_train.csv')
df_test.to_csv('df_test.csv')

In [None]:
# define Gradient Boosting model
fit_1 = H2OGradientBoostingEstimator(ntrees = 100,
                                     max_depth=4,
                                     min_rows=10,
                                     learn_rate=0.01, # default: 0.1
                                     sample_rate=1,
                                     col_sample_rate=0.7,
                                     nfolds=5,
                                     score_each_iteration=True,
                                     stopping_metric='auto',
                                     stopping_rounds=10,
                                     seed=999)

In [None]:
# train model
t1 = time.time()
fit_1.train(x=predictors,
            y='target',
            training_frame=train_hex)
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# show training scoring history
plt.rcParams['figure.figsize']=(7,4)
fit_1.plot()

In [None]:
# show cross validation metrics
fit_1.cross_validation_metrics_summary()

In [None]:
# show scoring history - training vs cross validations
for i in range(5):
    cv_model_temp = fit_1.cross_validation_models()[i]
    df_cv_score_history = cv_model_temp.score_history()
    my_title = 'CV ' + str(1+i) + ' - Scoring History [AUC]'
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.training_auc, 
                c='blue', label='training')
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.validation_auc, 
                c='darkorange', label='validation')
    plt.title(my_title)
    plt.xlabel('Number of Trees')
    plt.ylabel('AUC')
    plt.ylim(0.8,1)
    plt.legend()
    plt.grid()
    plt.show()

<a id='7'></a>
# Evaluate on Training Data

### ROC Curve - Training Data

In [None]:
# training performance
perf_train = fit_1.model_performance(train=True)
perf_train.plot()

### ROC Curve - Cross Validation

In [None]:
# cross validation performance
perf_cv = fit_1.model_performance(xval=True)
perf_cv.plot()

### Confusion Matrix

In [None]:
# on training data - automatic threshold (optimal F1 score)
conf_train = fit_1.confusion_matrix(train=True)
conf_train.show()

In [None]:
# corresponding accuracy for this threshold:
conf_list_temp = conf_train.to_list()
n_matrix = sum(conf_list_temp[0]) + sum(conf_list_temp[1])
acc_t0 = (conf_list_temp[0][0]+conf_list_temp[1][1]) / n_matrix
print('Accuracy:', np.round(acc_t0,6))

#### Selecting threshold by optimal F1 is not really helpful here, we have a big difference between actual positives (184) and predicted positives (302). Let's try to improve by selecting the threshold manually:

In [None]:
# alternatively specify threshold manually - here we try to achieve a symmetric outcome
tt = 0.148
conf_train_man = fit_1.confusion_matrix(train=True, thresholds=tt)
conf_train_man.show()

In [None]:
# corresponding accuracy for manual threshold:
conf_list_temp = conf_train_man.to_list()
n_matrix = sum(conf_list_temp[0]) + sum(conf_list_temp[1]) 
acc_t1 = (conf_list_temp[0][0]+conf_list_temp[1][1]) / n_matrix
print('Accuracy:', np.round(acc_t1,6))

#### Much better: 184 actual positives vs. 185 predicted positives!

In [None]:
# check on cross validation
conf_cv_man = fit_1.confusion_matrix(xval=True, thresholds=tt)
conf_cv_man.show()

In [None]:
# corresponding accuracy for our manual threshold:
conf_list_temp = conf_cv_man.to_list()
n_matrix = sum(conf_list_temp[0]) + sum(conf_list_temp[1])
acc_t1_CV = (conf_list_temp[0][0]+conf_list_temp[1][1]) / n_matrix
print('Accuracy:', np.round(acc_t1_CV,6))

### Variable Importance

In [None]:
# basic version
fit_1.varimp_plot()

In [None]:
# variable importance using shap values => see direction as well as severity of feature impact
t1 = time.time()
fit_1.shap_summary_plot(train_hex);
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

#### The blue dots for BMI are probably a little bit confusing. They are based on the strongly predictive missing values which we have encoded with -99!

### Predictions on training data

In [None]:
# predict on train set (extract probabilities only)
pred_train = fit_1.predict(train_hex)['p1']
pred_train = pred_train.as_data_frame().p1

# and plot
plt.hist(pred_train, bins=50)
plt.title('Predictions on Train Set')
plt.grid()
plt.show()

In [None]:
# check calibration
frequency_pred = sum(pred_train)
frequency_act = df_train.target.sum()
print('Predicted Frequency:', frequency_pred)
print('Actual Frequency   :', frequency_act)

<a id='8'></a>
# Evaluate on Test Set

In [None]:
# calc performance on test test
perf_test = fit_1.model_performance(test_hex)

# ROC Curve - Test Set
perf_test.plot()

In [None]:
# confusion matrix using our manual threshold
conf_test_man = perf_test.confusion_matrix(thresholds=tt)
conf_test_man.show()

#### Quite good:  65 actual positives vs 69 predicted positives.

In [None]:
# calc accuracy for manual threshold:
conf_list_temp = conf_test_man.to_list()
n_matrix = sum(conf_list_temp[0]) + sum(conf_list_temp[1]) 
acc_t1_test = (conf_list_temp[0][0]+conf_list_temp[1][1]) / n_matrix
print('Accuracy:', np.round(acc_t1_test,6))

In [None]:
# predict on test set (extract probabilities only)
pred_test = fit_1.predict(test_hex)['p1']
pred_test = pred_test.as_data_frame().p1

# and plot
plt.hist(pred_test, bins=50)
plt.title('Predictions on Test Set')
plt.grid()
plt.show()

In [None]:
# connect prediction with data frame
df_test['prediction'] = pred_test

### Show examples

In [None]:
# show most endangered patients (according to our model) in test set
df_high_20 = df_test.nlargest(20, columns='prediction')
df_high_20

#### Check calibration at high end:

In [None]:
print('Actual cases in highest 20    :', df_high_20.target.sum())
print('Predicted cases in highest 20 :', np.round(df_high_20.prediction.sum(),2))

In [None]:
# show least endangered patients (according to our model) in test set
df_low_20 = df_test.nsmallest(20, columns='prediction')
df_low_20

#### Check calibration at low end:

In [None]:
print('Actual cases in lowest 20    :', df_low_20.target.sum())
print('Predicted cases in lowest 20 :', np.round(df_low_20.prediction.sum(),2))