# Predicting stroke ¶

* Stroke is a medical condition in which poor blood flow to the brain causes cell death. Sudden bleeding in the brain can also cause a stroke if it damages brain cells. (1)
* There are over 13.7 million new strokes each year 3. Globally, one in four people over age 25 will have a stroke in their lifetime. (2) 
* Worldwide, cerebrovascular accidents (stroke) are the second leading cause of death and the third leading cause of disability. (3)
* Each year, 52% of all strokes occur in men and 48% in women. Metabolic factors (high systolic blood pressure, high BMI, high fasting plasma glucose, high total cholesterol, and   low glomerular filtration rate), as well as behavioural factors (smoking, poor diet, and low physical activity) are risk factors for stroke. (2) 


## Aim
Build a model to predict whether a patient is likely to get stroke based on the parameters: id, gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, smoking_status. This is a binary classification problem.


In [None]:
#Load libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import time
import warnings
warnings.filterwarnings("ignore")

#Load dataset
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

## EDA

### Get to know the dataset

In [None]:
#Get to know the dataset and display all columns
pd.set_option('display.max_columns', None)
df.head()

In [None]:
#Determine number of rows and columns
df.shape

In [None]:
#Check features, datatypes and null values
df.info()

Feature "bmi" contains 201 null values. This is only a few values and can therefore be dropped from the dataset.

### Remove null values from dataset

In [None]:
#Remove rows with null values
df = df.dropna(how='any',axis=0) 

In [None]:
#Determine number of rows and columns
df.shape

### Proportion of stroke patients in dataset

In [None]:
plt.figure(figsize=(3,5))
countplot_stroke = sns.countplot(data=df,x='stroke')
plt.title("Number of stroke and no stroke patients in dataset")
for p in countplot_stroke.patches: 
    countplot_stroke.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.ylim(0, 6000)

The dataset is imbalanced with only few stroke patients. This needs to be addressed during data preprocessing before building the predictive models.

### Gender of stroke and no stroke patients

In [None]:
x,y = 'stroke', 'gender'
(df.groupby(x)[y].value_counts(normalize=True).mul(100).rename('percent').reset_index().pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

* approx. 800 more female patients than male patients in the dataset
* ratio female/male is smaller in stroke patients vs. non-stroke patients, suggesting that stroke is more prevalent in male patients.



## Hypertension in stroke patients

In [None]:
x,y = 'stroke', 'hypertension'
(df.groupby(x)[y].value_counts(normalize=True).mul(100).rename('percent').reset_index().pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

* hypertension more prevalent in stroke patients compared to no stroke patients

## Heart disease in stroke patients

In [None]:
x,y = 'stroke', 'heart_disease'
(df.groupby(x)[y].value_counts(normalize=True).mul(100).rename('percent').reset_index().pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

* heart disease more prevalent in stroke patients compared to no stroke patients

## Work type of stroke patients

In [None]:
x,y = 'stroke', 'work_type'
(df.groupby(x)[y].value_counts(normalize=True).mul(100).rename('percent').reset_index().pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

Difficult to draw conclusions due to small number of stroke patients:
* class "children" less prevalent in stroke patients
* work type "private" and "self-employed" more prevalent in stroke patients

## Residence type of stroke patients

In [None]:
x,y = 'stroke', 'Residence_type'
(df.groupby(x)[y].value_counts(normalize=True).mul(100).rename('percent').reset_index().pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

* resident type does not seem to be a relevant influential factor for stroke in this cohort of patients

### Smoking status and stroke

In [None]:
x,y = 'stroke', 'smoking_status'
(df.groupby(x)[y].value_counts(normalize=True).mul(100).rename('percent').reset_index().pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

Difficult to draw conclusions due to large number of "Unknown" and low number of stroke patients in dataset:
* former smokers more prevalent in stroke patients

### Age and stroke

In [None]:
plt.title("Age distribution of stroke and no stroke patients")
sns.kdeplot(df['age'], data = df, hue = 'stroke', fill=True)

* dataset with good age distribution
* stroke is more frequent in older age

### Marital status of stroke patients

In [None]:
ax = sns.boxplot(x="ever_married", y="age", data=df)

* stroke is more prevalent in patients ever married

In [None]:
plt.title("Age distribution and marital status")
sns.kdeplot(df['age'], data = df, hue = 'ever_married', fill=True)

* stroke is more prevalent in patients ever married, but  ever married patients are also older contributing to an increased risk for stroke

### BMI and stroke

In [None]:
ax = sns.boxplot(x="stroke", y="bmi", data=df)

* bmi of stroke patients is slightly higher compared to no stroke patients
* no stroke class with more outliers 

### Glucose level and stroke

In [None]:
ax = sns.boxplot(x="stroke", y="avg_glucose_level", data=df)

* average glucose level higher in stroke patients

### Pearson correlation matrix of dichotomous categorical variable stroke and a continuous variables age, bmi and average glucose level

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
df_corr = df.drop(columns=['id', 'hypertension', 'heart_disease']).select_dtypes(include=np.number)
df_corr = df_corr.corr()
plt.figure(figsize=(7,7))
sns.heatmap(df_corr,annot=True)

* age correlates with bmi and hypertension, avg_glucose_level 
* stroke correlates weakly with age and average glucose level


## Data preprocessing

### Dropping feature 'id"

Feature will be dropped, because it has no predictive value.

In [None]:
df.drop('id', axis='columns', inplace=True)

### Addressing imbalance within dataset of classes stroke / no stroke

To address the imbalance within this dataset, oversampling will be used to increase the size of  the minority class stroke.

In [None]:
df['stroke'].value_counts()

In [None]:
from sklearn.utils import resample

#Upsampling minority class: stroke = 1
df_majority = df[df['stroke']==0]
df_minority = df[df['stroke']==1]

df_minority_oversampled = resample(df_minority, replace = True, n_samples=4700, random_state=21)

df_oversampled = pd.concat([df_majority, df_minority_oversampled])

df_oversampled['stroke'].value_counts()

### Deal with categorical data

Machine learning models require input and output variables to be numeric. The categorical features in this dataset will therefore be encoded to numbers before fitting and evaluating a model.

In [None]:
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

In [None]:
df_oversampled = create_dummies(df_oversampled,"gender")
df_oversampled = create_dummies(df_oversampled,"ever_married")
df_oversampled = create_dummies(df_oversampled,"work_type")
df_oversampled = create_dummies(df_oversampled,"Residence_type")
df_oversampled = create_dummies(df_oversampled,"smoking_status")
df_oversampled.head()


This is the list of columns inlcuding the new columns created during encoding. 

In [None]:
df_oversampled.columns

### Define target variable

This is a binary classification problem and the classifier is supposed to output a stroke or no stroke event.

In [None]:
columns = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi',
       'gender_Female', 'gender_Male', 'gender_Other',
       'ever_married_No', 'ever_married_Yes', 'work_type_Govt_job',
       'work_type_Never_worked', 'work_type_Private',
       'work_type_Self-employed', 'work_type_children', 'Residence_type_Rural',
       'Residence_type_Urban', 'smoking_status_Unknown',
       'smoking_status_formerly smoked', 'smoking_status_never smoked',
       'smoking_status_smokes']

X = df_oversampled[columns]
y = df_oversampled['stroke']


### Split dataset into trainingset and testset 

The dataset is split into a training set (70%) to train the models and test set (30%) to evaluate the performance of the models.

In [None]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3,random_state=0)

### Feature scaling to account for models sensitive to range of data

The dataset will be transformed using Standard Scaler to achieve distribution with a mean value of 0 and standard deviation of 1 to prepare data for models that are sensitive to range, like like logistic regression, support vector machine and  k-nearest neighbor.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(train_X)
X_test_scaled = sc.transform (test_X)


### Logistic regression (sensitive to range of data)

Fit the model with scaled data and default settings.

In [None]:
from sklearn.linear_model import LogisticRegression
lr_1 = LogisticRegression()
lr_1.fit(X_train_scaled, train_y)
prediction_lr_1 = lr_1.predict(X_test_scaled)


Get performance results of the logistic regression model.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
print(confusion_matrix(test_y, prediction_lr_1))
print(classification_report(test_y, prediction_lr_1))

In [None]:
roc_lr_1 = roc_auc_score(test_y, prediction_lr_1) 
roc_lr_1

Trying to improve the performance of the logistic regression using hyperparameter tuning. RandomSearchCV uses randomized search over parameters from the parameter grid.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid_lr = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'], 'max_iter': list(range(100,800,100)), 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

lr_2 = LogisticRegression()
lr_2_model = RandomizedSearchCV(lr_2, param_grid_lr, cv = 5)
lr_2_model.fit(X_train_scaled, train_y)
prediction_lr_2 = lr_2_model.best_estimator_.predict(X_test_scaled)

print("Tuned Logistic Regression Parameters: {}".format(lr_2_model.best_params_)) 
print("Best score is {}".format(lr_2_model.best_score_))

print(confusion_matrix(test_y,prediction_lr_2))
print(classification_report(test_y,prediction_lr_2))



In [None]:
roc_lr_2 = roc_auc_score(test_y, prediction_lr_2) 
roc_lr_2

Hyperparameter tuning did not improve the performance of the logistic regression model.

### Decision tree classifier (not sensitive to range of data)

Fit the model with default settings.

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics 

dtc_1 = DecisionTreeClassifier()
dtc_1 = dtc_1.fit(train_X, train_y)
prediction_dtc_1 = dtc_1.predict(test_X)

Get performance results of the decision tree model

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(test_y, prediction_dtc_1))
print(classification_report(test_y, prediction_dtc_1))

In [None]:
roc_dtc_1 = roc_auc_score(test_y, prediction_dtc_1) 
roc_dtc_1

Trying to improve performance of decision tree model using hyperparametertuning with RandomSearchCV.

In [None]:
from scipy.stats import randint
param_grid_dtc = {"max_depth": [3,None], "max_features":randint(1,5), "min_samples_leaf":randint(1,9), "criterion": ["gini", "entropy"]}
dtc_2 = DecisionTreeClassifier()
dtc_2_model = RandomizedSearchCV(dtc_2, param_grid_dtc, cv = 5)
dtc_2_model.fit(train_X, train_y)
prediction_dtc_2 = dtc_2_model.best_estimator_.predict(test_X)

print("Tuned Decision Tree Parameters: {}".format(dtc_2_model.best_params_)) 
print("Best score is {}".format(dtc_2_model.best_score_))

print(confusion_matrix(test_y, prediction_dtc_2))
print(classification_report(test_y,prediction_dtc_2))


In [None]:
roc_dtc_2 = roc_auc_score(test_y, prediction_dtc_2) 
roc_dtc_2

Hyperparameter tuning slightly improved the performance of the decision tree classifier model.

### Random Forest Classifier (not sensitive to range of data)

Fit the model with default settings and get performance results.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc_1=RandomForestClassifier()
rfc_1.fit(train_X, train_y)
prediction_rfc_1=rfc_1.predict(test_X)

print(confusion_matrix(test_y, prediction_rfc_1))
print(classification_report(test_y, prediction_rfc_1))


In [None]:
roc_rfc_1 = roc_auc_score(test_y, prediction_rfc_1) 
roc_rfc_1

Trying to improve performance of the random forest model using hyperparametertuning with RandomSearchCV. Trying to improve the model with hyperparameter tuning.



In [None]:
param_grid_rfc = {'bootstrap': [True, False], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'max_features': ['auto', 'sqrt'], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10], 'n_estimators': [130, 180, 230]}
rfc_2 =RandomForestClassifier()
rfc_2_model = RandomizedSearchCV(rfc_2, param_grid_rfc, cv = 5)
rfc_2_model.fit(train_X, train_y)
prediction_rfc_2 = rfc_2_model.best_estimator_.predict(test_X)

print("Tuned RFC Parameters: {}".format(rfc_2_model.best_params_)) 
print("Best score is {}".format(rfc_2_model.best_score_))

print(confusion_matrix(test_y, prediction_rfc_2))
print(classification_report(test_y,prediction_rfc_2))

In [None]:
roc_rfc_2 = roc_auc_score(test_y, prediction_rfc_2) 
roc_rfc_2

Hyperparameter tuning improved the performance of the random forest classifier.

### Support vector machine (sensitive to range of data)

Fit the model with scaled data and default setting and get performance results.

In [None]:
from sklearn import svm
svc_1 = svm.SVC()
svc_1.fit(X_train_scaled, train_y)
prediction_svc_1 = svc_1.predict(X_test_scaled)

print(confusion_matrix(test_y, prediction_svc_1))
print(classification_report(test_y,prediction_svc_1))

In [None]:
roc_svc_1 = roc_auc_score(test_y, prediction_svc_1) 
roc_svc_1

Trying to improve the model using hyperparameter tuning.

In [None]:
param_grid_svc = {'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]} 

svc_2 = svm.SVC()
svc_2_model = RandomizedSearchCV(svc_2, param_grid_svc, cv = 5)
svc_2_model.fit(X_train_scaled, train_y)
prediction_svc_2 = svc_2_model.best_estimator_.predict(X_test_scaled)

print("Tuned RFC Parameters: {}".format(svc_2_model.best_params_)) 
print("Best score is {}".format(svc_2_model.best_score_))

print(confusion_matrix(test_y, prediction_svc_2))
print(classification_report(test_y,prediction_svc_2))

In [None]:
roc_svc_2 = roc_auc_score(test_y, prediction_svc_2) 
roc_svc_2

Hyperparameter tuning improved the performance of the support vector machine model.

### K-nearest neighbor (sensitive to range)

Fit the model with scaled data and default settings and get performance results.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_1 = KNeighborsClassifier()
knn_1.fit(X_train_scaled, train_y)
prediction_knn_1 = knn_1.predict(X_test_scaled)

print(confusion_matrix(test_y, prediction_knn_1))
print(classification_report(test_y,prediction_knn_1))

In [None]:
roc_knn_1 = roc_auc_score(test_y, prediction_knn_1) 
roc_knn_1

Trying to improve performance of the knn model using hyperparameter tuning.

In [None]:
param_grid_knn = {'n_neighbors': [3,5,11, 13, 15, 17, 19], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhatten']}
knn_2 = KNeighborsClassifier()
knn_2_model = RandomizedSearchCV(knn_2, param_grid_knn, cv = 5)
knn_2_model.fit(X_train_scaled, train_y)
prediction_knn_2 = knn_2_model.best_estimator_.predict(X_test_scaled)

print("Tuned KNN Parameters: {}".format(knn_2_model.best_params_)) 
print("Best score is {}".format(knn_2_model.best_score_))

print(confusion_matrix(test_y, prediction_knn_2))
print(classification_report(test_y,prediction_knn_2))

In [None]:
roc_knn_2 = roc_auc_score(test_y, prediction_knn_2) 
roc_knn_2

Hyperparameter tuning improved the k-nearest neighbor model.

### Gradient Boosting Classifier (not sensitive to range of data)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbc_1 = GradientBoostingClassifier()
gbc_1.fit(train_X, train_y)
prediction_gbc_1 = gbc_1.predict(test_X)

print(confusion_matrix(test_y, prediction_gbc_1))
print(classification_report(test_y,prediction_gbc_1))

In [None]:
roc_gbc_1 = roc_auc_score(test_y, prediction_gbc_1) 
roc_gbc_1

In [None]:
param_grid_gbc = {'n_estimators':[10, 100, 1000], 'learning_rate': [0.001, 0.01, 0.1], 'subsample': [0.5, 0.7, 1.0], 'max_depth': [3, 7, 9]}
gbc_2 = GradientBoostingClassifier()
gbc_2_model = RandomizedSearchCV(gbc_2, param_grid_gbc, cv = 5)
gbc_2_model.fit(train_X, train_y)
prediction_gbc_2 = gbc_2_model.best_estimator_.predict(test_X)

print("Tuned GBC Parameters: {}".format(gbc_2_model.best_params_)) 
print("Best score is {}".format(gbc_2_model.best_score_))

print(confusion_matrix(test_y, prediction_gbc_2))
print(classification_report(test_y,prediction_gbc_2))

In [None]:
roc_gbc_2 = roc_auc_score(test_y, prediction_gbc_2) 
roc_gbc_2

Hyperparameter tuning improved the gradient boosting model.

### Best performing model

Both, random forest classifier and gradient boosting classifier performed very well. Hyperparameter tuning further improved the performance of both models, with random forest classifier providing the best performance of predicting stroke. 

### Evaluation and importance of features for best performing model (Random Forest Classfier)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc_best=RandomForestClassifier(n_estimators = 130, min_samples_split = 2, min_samples_leaf = 1, max_features = 'sqrt', max_depth = 20, bootstrap = False)
rfc_best.fit(X_train_scaled, train_y)
prediction_rfc_best=rfc_best.predict(X_test_scaled)

print(confusion_matrix(test_y, prediction_rfc_best))
print(classification_report(test_y, prediction_rfc_best))

In [None]:
roc_rfc_best = roc_auc_score(test_y, prediction_rfc_best) 
roc_rfc_best

Evaluating feature importance in the random forest model on original data. 

In [None]:
feature_importances = pd.DataFrame(rfc_best.feature_importances_, index = X.columns, columns=['importance']).sort_values('importance', ascending=False)
feature_importances

In [None]:
feature_importances.plot(kind='bar')

Age, average glucose level and bmi are the most important features for the random forest classifier model.

## Conclusions

The random forest classifier resulted in the most accurate prediction of stroke with highest recall and precision:
* using hyperparameters 'n_estimators': 230, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 60, 'bootstrap': False 
* and age, average glucose level and bmi being the most important features

Interestingly, point-biserial correlation coefficient of bmi and stroke was very low.

Hyperparameter tuning using RandomSearchCV resulted:
* in better performance for Random Forest Classifier, Support Vector Machine, K-nearest neighbor and Gradient Boosting classifier
* same performance for Logistic Regression
* worse performance for Decision Tree Classifier



## Resources
1. NIH: National Heart, Lung and Blood institute: https://www.nhlbi.nih.gov/health-topics/stroke
2. World Stroke Organization (WSO): Global Stroke Fact Sheet 2019: https://www.world-stroke.org/assets/downloads/WSO_Fact-sheet_15.01.2020.pdf
3. Bulletin of the World Health Organization 2016;94:634-634A. doi: http://dx.doi.org/10.2471/BLT.16.181636
