<h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Stroke Prediction</h1>




![picture](https://imgk.timesnownews.com/story/silent-stroke.gif?tr=w-400,h-300,fo-auto)
# Introduction
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.Notebook summary
* [Data understanding](#1)
* [Exploratory Data Analysis](#2)
* [Re-sampling](#7)
* [Data Preprocessing](#3)
* [Modeling](#4)
* [Model Evaluation(k cross validation & ROC Auc curve)](#5)
* [Feature importance](#6)




In [None]:
#Importing Libraries

# linear algebra
import numpy as np 

# data processing
import pandas as pd

# data visualization(for EDA)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
sns.set(color_codes=True)
import plotly.express as px
import shap
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode()
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

#ignore warnings
import warnings
warnings.filterwarnings('ignore')



# Importing sklearn methods

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
from sklearn.metrics import roc_curve, roc_auc_score

# import labelencoder
from sklearn.preprocessing import LabelEncoder

#Feature Scaling
from sklearn.preprocessing import StandardScaler

<a id="1"></a> <br>
<h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Data understanding</h1>

In [None]:
df=pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.head()

In [None]:
print('Number of rows and  number of columns in our dataset  :',df.shape)

In [None]:
print(f'Categorical features in our data {df.columns[df.dtypes==object].tolist()}')
print(f'Numerical features in our data {df.columns[df.dtypes!=object].tolist()}')

In [None]:
#null values
# Draw plot
fig, ax = plt.subplots(figsize=(20,5))
x=df.isnull().sum().index
y=df.isnull().sum()
ax.vlines(x, ymin=0, ymax=y, color='firebrick', alpha=0.7, linewidth=2)
ax.scatter(x, y, s=75, color='firebrick', alpha=0.7)


# Title, Label, Ticks and Ylim
ax.set_title('Missing values in a data', fontdict={'size':22})
ax.set_ylabel('Null Values', fontdict={'size':18})
ax.set_xlabel('Column Name', fontdict={'size':18})
ax.set_xticks(df.columns)
ax.set_xticklabels(df.columns.str.upper(),rotation=0, fontdict={'horizontalalignment': 'center', 'size':10})
ax.set_ylim(-10)

# Annotate

for index, value in enumerate(y):
    plt.text(index, value, str(value),horizontalalignment= 'center', verticalalignment='bottom', fontsize=17)
fig.tight_layout()
plt.show()

In [None]:
# Let's drop the null values
df = df.dropna()

<a id="2"></a> <br>
<h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Exploratory Data Analysis</h1>

It is easy to see that the data is very unbalanced and we will look at that after the EDA.

In [None]:
value=df.stroke.value_counts()
x=['Stroke -VE','Stroke +VE']
fig = go.Figure(data=[go.Pie(labels=x,values=value , name="Smoking Status",hole=0.4,pull=[0.1],
                     textinfo="label+percent")])
fig.update_layout(title_text='Target Feature Pie Chart',autosize=False,
                  title=dict(x=0.5))
fig.show()





The graph shows that stroke positive does not correlate with smokers, since the proportion of people with stroke is about the same among other smoking status.
.

In [None]:
value=df.smoking_status.value_counts()
y1=df.query('stroke==0')['smoking_status'].value_counts()
y2=df.query('stroke==1')['smoking_status'].value_counts()
x=df.smoking_status.unique()

fig = make_subplots(rows=1, cols=2, subplot_titles=("Smoking Status",'smoking status vs stroke'),specs=[[{'type':'domain'}, {"type": "bar"}]])
fig.add_trace(go.Pie(labels=x,values=value , name="Smoking Status",hole=0.4,pull=[0.02,0.02,0.02,0.02],
                     textinfo="label+percent"),
              1, 1)
fig.add_trace(go.Bar(name='Stroke -VE', x=x, y=y1),1, 2)
fig.add_trace(go.Bar(name='Stroke +VE', x=x, y=y2),1, 2)

fig.show()

It seems that the Residence Type is uniformly distributed and has no relationship with stroke-positive individuals.

In [None]:
value=df.Residence_type.value_counts()
y1=df.query('stroke==0')['Residence_type'].value_counts()
y2=df.query('stroke==1')['Residence_type'].value_counts()
x=df.Residence_type.unique()


fig = make_subplots(rows=1, cols=2,specs=[[{'type':'domain'}, {"type": "bar"}]],subplot_titles=("Residence type",'Residence type vs stroke'))
fig.add_trace(go.Pie(labels=x, values=value, name="Residence_type",hole=0.4,pull=[0.02,0.02,0.02,0.02],
                     textinfo="label+percent"),
              1, 1)
fig.add_trace(go.Bar(name='Stroke -VE', x=x, y=y1),1, 2)
fig.add_trace(go.Bar(name='Stroke +VE', x=x, y=y2),1, 2)



fig.show()

At first glance, it is clear that the elderly are more likely to have a stroke

In [None]:

fig = px.histogram(df, x="age", color="stroke",marginal="box",
                   hover_data=df.columns,title='Distribution of Age')
fig.update_layout(autosize=False,width=500,height=350,title=dict(x=0.5))
fig.show()

From the figure, it is clear that glucose level and stroke positivity are not related. This means that people can have a stroke at every glucose level

In [None]:
x1=df.query('stroke==0')['avg_glucose_level']
x2=df.query('stroke==1')['avg_glucose_level']

group_labels = ['Stroke -VE', 'Stroke +VE']

colors = ['slategray', 'magenta']

# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([x1,x2], group_labels, bin_size=5,curve_type='normal', colors=colors)

# Add title
fig.update_layout(title_text='Distribution of Glucose Level',autosize=False,
    width=500,
    height=400,title=dict(x=0.5))

fig.show()

People with a low BMI have a higher risk of stroke

In [None]:
import plotly.figure_factory as ff
import numpy as np

x1=df.query('stroke==0')['bmi']
x2=df.query('stroke==1')['bmi']

group_labels = ['Stroke -VE', 'Stroke +VE']

colors = ['slategray', 'magenta']

# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([x1,x2], group_labels, bin_size=5,
                         curve_type='normal',
                         colors=colors)

# Add title
fig.update_layout(title_text='Distribution of BMI(Body mass index)',autosize=False,
    width=500,
    height=400,title=dict(x=0.5))
fig.show()


Findings:
* There are more women than men in our dataset.
* The number of married people are shown more in our data.
* In both sexes, married people are more likely to have a stroke.

In [None]:
Data=df[['gender','ever_married','stroke']]
fig = px.parallel_categories(Data, dimensions=['gender','ever_married','stroke'])
fig.show()

<a id="7"></a> <br>
<h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Random Oversampling Imbalanced Datasets</h1>

Before applying any classifier algorithm we should balance over dataset because imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class. This results in models that have poor predictive performance, specifically for the minority class. And now we will Apply re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem.

In [None]:

# Class count
count_class_0, count_class_1 = df.stroke.value_counts()

# Divide by class
df_class_0 = df[df['stroke'] == 0]
df_class_1 = df[df['stroke'] == 1]


df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df = pd.concat([df_class_0, df_class_1_over], axis=0)



In [None]:
#Draw pie chart
value=df.stroke.value_counts()
x=['Stroke -VE','Stroke +VE']
fig = go.Figure(data=[go.Pie(labels=x,values=value , name="Smoking Status",hole=0.4,pull=[0.1],
                     textinfo="label+percent")])
fig.update_layout(title_text='Target Feature after Re-sampling',autosize=False,
                  title=dict(x=0.5))
fig.show()


<a id="3"></a> <br>
<h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Data Preprocessing</h1>

In [None]:
#Encoding categorical data values 
categorical_feature_mask = df.dtypes==object
categorical_features =df.columns[categorical_feature_mask].tolist()
le = LabelEncoder()
df[categorical_features] =df[categorical_features].apply(lambda col: le.fit_transform(col))

In [None]:
#Splitting the dataset into the Training set and Test set
X=df.drop(columns=['stroke','id'])
y=df['stroke']
x_train, x_test,  y_train, y_test = train_test_split(X, y,  random_state=0)

<a id="4"></a> <br>
<h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Modeling</h1>

In [None]:
#setting up models that weâ€™ll be testing out
models = [('LogReg', LogisticRegression()), 
          ('RF' , RandomForestClassifier(n_estimators=20, random_state=0)),
           ('DecTree', DecisionTreeClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('LinDisc', LinearDiscriminantAnalysis()),
          ('GaussianNB', GaussianNB())]

In [None]:
#k-fold validation to evaluate each algorithm
scores=[]
cross_val_scores = []
model_names=[]
for model_name, model in models:
    
    
    results = model_selection.cross_val_score(model, X, y, cv=5, scoring='f1') 
    model.fit(x_train, y_train)
    y_pred =model.predict(x_test)
    score = f1_score(y_pred, y_test)
    cross_val_scores.append(results)
    scores.append(score)
    model_names.append(model_name)
    
    


<a id="5"></a> <br><h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Model Evaluation</h1>

## k cross validation scores of different classifiers

In [None]:
# Prepare Data
data=[]
data=np.array(cross_val_scores)
data = data.transpose()
Data=pd.DataFrame(data, columns=model_names)
#Draw graph
fig = px.line(Data,x=['one','Two','Three','Four','Five'],y=Data.columns,title='Comparing models k Cross validation scores', labels=dict(value="f1 Score", x="k(FOLD)", variable="Classifier"))
fig.update(layout=dict(title=dict(x=0.5)))
fig.update_traces(mode="markers+lines",marker=dict(
            color='Gray'))

## Comparing model performance on testing data
Findings:RandomForest has a higher F1 score than the other classifiers.

In [None]:
fig = px.line(x=model_names,y=scores,labels=dict(y="f1 Score", x="classfier"),title='Comparing All classifiers Performance on testing dataset')
fig.update_traces(mode="markers+lines",marker=dict(
            color='Red'))

### ROC AUC Curve
Another way to evaluate and compare your binary classifier is provided by the ROC AUC Curve.I will draw both models whose f1 is higher and lower in order to see the difference in curves. The Black line in the middel represents a purely random classifier and therefore your classifier should be as far away from it as possible. Our Random Forest model seems to do a good job as it far from that line.The ROC AUC Score is the corresponding score to the ROC AUC Curve. It is simply computed by measuring the area under the curve, which is called AUC.A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random classiffier would have a score of 0.5..

In [None]:
AUC_models = [('GaussianNB', GaussianNB()), 
          ('RF' , RandomForestClassifier())]


# Create an empty figure, and iteratively add new lines
fig = go.Figure()
fig.add_shape(
    type='line', line=dict(dash='dash',color='Black'),
    x0=0, x1=1, y0=0, y1=1
)

for model_name, model in AUC_models:

    model.fit(x_train, y_train)
    y_pred =model.predict(x_test)
    fpr, tpr, _ = roc_curve(y_test, y_pred)
    auc_score = roc_auc_score(y_test, y_pred)
    name = f"{model_name}(AUC={auc_score:.2f})"
    fig.add_trace(go.Scatter(x=fpr, y=tpr, name=name, mode='lines'))

fig.update_layout(title='ROC CURVE',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    yaxis=dict(scaleanchor="x", scaleratio=1),
    xaxis=dict(constrain='domain'),
    width=500, height=500
    
)
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

<a id="6"></a> <br><h1 style="background-color:#A8A8A8; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Feature Importance<a id="1"></a> <br></h1>



The figure shows that Age, glucose level and BMI are the most important characteristics for predicting stroke patients.

In [None]:
rf=RandomForestClassifier().fit(x_train, y_train)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values, x_test, plot_type="bar")