# Stroke prediction

Description : According to the World Health Organization (WHO) stroke is
the 2nd leading cause of death globally, responsible for approximately 11% of
total deaths.This dataset is used to predict whether a patient is likely to get a
stroke based on the input parameters like gender, age, various diseases, and
smoking status. Each row in the data provides relevant information about the
patient.

Dataset : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset?select=healthcaredataset-stroke-data.csv

In [2]:
import pandas as pd

In [3]:
import warnings

In [4]:
warnings.filterwarnings('ignore')

In [5]:
raw_df = pd.read_csv('healthcare-dataset-stroke-data.csv')

In [6]:
raw_df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [7]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Our dataset has some missing values in bmi column let's fill them using the mean of bmi column.

In [8]:
raw_df.fillna(raw_df.bmi.mean(),inplace=True)

In [9]:
raw_df.isna().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

Now there are no missing values in our dataset.

In [22]:
raw_df.corr()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
id,1.0,0.003538,0.00355,-0.001296,0.001092,0.002999,0.006388
age,0.003538,1.0,0.276398,0.263796,0.238171,0.325942,0.245257
hypertension,0.00355,0.276398,1.0,0.108306,0.174474,0.160189,0.127904
heart_disease,-0.001296,0.263796,0.108306,1.0,0.161857,0.038899,0.134914
avg_glucose_level,0.001092,0.238171,0.174474,0.161857,1.0,0.168751,0.131945
bmi,0.002999,0.325942,0.160189,0.038899,0.168751,1.0,0.038947
stroke,0.006388,0.245257,0.127904,0.134914,0.131945,0.038947,1.0


From the above correlation matrix we could observe that age,hypertension,heart_disease,avg_glucose_level have major impact on stroke.

In [75]:
raw_df.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [76]:
inputs = raw_df.columns[1:-1]

In [77]:
inputs

Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status'],
      dtype='object')

Above are the columns that will be given to our model as inputs.

In [78]:
target = raw_df.columns[-1]

In [79]:
target

'stroke'

target is the column we are going to predict.

In [80]:
X = raw_df[inputs]

X is the dataset which has only inputs to be given to the model.

In [81]:
y = raw_df[target]

y is the dataset which we are going to predict using the model.

Let's split the dataset into train and test set's using StratifiedShuffleSplit.

In [82]:
from sklearn.model_selection import StratifiedShuffleSplit

In [83]:
split = StratifiedShuffleSplit(n_splits=3,test_size=0.2,random_state=42)

In [84]:
for train_indx,test_indx in split.split(X,y):
    train_X,train_y = X.loc[train_indx],y.loc[train_indx]
    test_X,test_y = X.loc[test_indx],y.loc[test_indx]

In [85]:
print(len(train_X))
print(len(test_X))

4088
1022


Our training set has 4088 instances where as testing set has 1022 instances.

Let's separate the numeric and categorical columns from the input columns.

In [86]:
numeric = train_X.select_dtypes('number').columns.tolist()

In [87]:
numeric

['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi']

Above are the numeric columns .

In [88]:
categorical = train_X.select_dtypes('object').columns.tolist()

In [89]:
categorical

['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

Above are the categorical columns.

Let's encode the categorical columns from our dataset using OneHotEncoder.

In [90]:
from sklearn.preprocessing import OneHotEncoder

In [91]:
encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)

In [92]:
encoder.fit(train_X[categorical])

OneHotEncoder(handle_unknown='ignore', sparse=False)

In [93]:
encoded = encoder.get_feature_names(categorical).tolist()

In [94]:
encoded

['gender_Female',
 'gender_Male',
 'gender_Other',
 'ever_married_No',
 'ever_married_Yes',
 'work_type_Govt_job',
 'work_type_Never_worked',
 'work_type_Private',
 'work_type_Self-employed',
 'work_type_children',
 'Residence_type_Rural',
 'Residence_type_Urban',
 'smoking_status_Unknown',
 'smoking_status_formerly smoked',
 'smoking_status_never smoked',
 'smoking_status_smokes']

Above are the encoded columns of categorical columns.

In [95]:
train_X[encoded] = encoder.transform(train_X[categorical])

In [96]:
test_X[encoded] = encoder.transform(test_X[categorical])

In [97]:
train_X

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
3517,Male,49.0,0,0,Yes,Private,Urban,193.87,41.0,Unknown,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2951,Male,75.0,0,0,Yes,Private,Rural,70.73,26.7,smokes,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
377,Male,25.0,0,0,No,Private,Urban,138.29,27.3,Unknown,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2737,Male,55.0,0,0,Yes,Self-employed,Rural,163.82,27.5,never smoked,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
24,Male,71.0,0,0,Yes,Private,Urban,102.87,27.2,formerly smoked,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4429,Female,40.0,0,0,Yes,Private,Urban,86.78,35.5,smokes,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4126,Female,42.0,0,0,Yes,Private,Urban,74.80,50.6,Unknown,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1071,Female,48.0,0,0,Yes,Private,Rural,195.16,42.2,Unknown,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4748,Female,28.0,0,0,Yes,Govt_job,Rural,86.91,21.1,formerly smoked,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [98]:
test_X

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
4832,Female,20.0,0,0,No,Private,Urban,61.88,20.1,never smoked,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
90,Female,79.0,0,1,Yes,Private,Urban,226.98,29.8,never smoked,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4844,Female,4.0,0,0,No,children,Urban,72.49,16.9,Unknown,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
39,Female,49.0,0,0,Yes,Private,Urban,60.91,29.9,never smoked,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1715,Female,35.0,0,0,Yes,Private,Urban,86.87,43.2,Unknown,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4194,Female,34.0,0,0,Yes,Private,Urban,76.42,27.6,smokes,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1444,Female,24.0,0,0,No,Private,Rural,120.77,16.9,never smoked,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3515,Female,55.0,0,0,Yes,Private,Urban,102.10,22.5,formerly smoked,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2107,Female,52.0,1,0,No,Private,Rural,170.22,27.2,formerly smoked,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


Let's scale the numeric values of our dataset using StandardScaler.

In [99]:
from sklearn.preprocessing import StandardScaler

In [100]:
scaler = StandardScaler()

In [101]:
train_X[numeric] = scaler.fit_transform(train_X[numeric])

In [102]:
test_X[numeric] = scaler.transform(test_X[numeric])

In [108]:
from sklearn.metrics import classification_report

I am going to use DecisionTreeClassifier as my first model :

In [103]:
from sklearn.tree import DecisionTreeClassifier

In [104]:
dt_clf = DecisionTreeClassifier()

In [105]:
dt_clf.fit(train_X[numeric+encoded],train_y)

DecisionTreeClassifier()

In [107]:
pred_y = dt_clf.predict(test_X[numeric+encoded])

In [109]:
dt_clf.score(test_X[numeric+encoded],test_y)

0.9099804305283757

By using DecisionTreeClassifier I am having 90% accuracy.

In [110]:
print(classification_report(test_y,pred_y))

              precision    recall  f1-score   support

           0       0.96      0.95      0.95       972
           1       0.12      0.14      0.13        50

    accuracy                           0.91      1022
   macro avg       0.54      0.54      0.54      1022
weighted avg       0.91      0.91      0.91      1022



I am using XGBClassifier as my second model.

In [111]:
from xgboost import XGBClassifier

In [112]:
xgb_clf = XGBClassifier()

In [113]:
xgb_clf.fit(train_X[numeric+encoded],train_y)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [114]:
pred_y = xgb_clf.predict(test_X[numeric+encoded])

In [115]:
xgb_clf.score(test_X[numeric+encoded],test_y)

0.9432485322896281

In [116]:
print(classification_report(test_y,pred_y))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       972
           1       0.21      0.06      0.09        50

    accuracy                           0.94      1022
   macro avg       0.58      0.52      0.53      1022
weighted avg       0.92      0.94      0.93      1022



By using XGBClassifier I am having an accuracy of 94% which is better than the previous model.