<h1 style="font-size:200%">Table Of Content</h1>

* [<a style="font-size:130%;color:green">Preparation](#0_bullet)
* [<a style="font-size:130%;color:green">1. Exploratory Data Analysis (EDA)](#1_bullet)
    * [<a style="font-size:130%;color:green"> 1.1 Smoking Status Analysis](#1.1_bullet)
    * [<a style="font-size:130%;color:green"> 1.2 Gender Analysis](#1.2_bullet)
    * [<a style="font-size:130%;color:green"> 1.3 Age Analysis](#1.3_bullet)
    * [<a style="font-size:130%;color:green"> 1.4 Hypertension Analysis](#1.4_bullet)
    * [<a style="font-size:130%;color:green"> 1.5 Ever Married Analysis](#1.5_bullet)
    * [<a style="font-size:130%;color:green"> 1.6 Avg Glucose Level Anlaysis](#1.6_bullet)
    * [<a style="font-size:130%;color:green"> 1.7 Work Type Anlaysis](#1.7_bullet)

* [<a style="font-size:130%;color:green">2. Feature Engeneering](#2_bullet)
    * [<a style="font-size:130%;color:green"> 2.1 Filling None values](#2.1_bullet)
    * [<a style="font-size:130%;color:green"> 2.2 Encoding features and droping unnecessary](#2.2_bullet)
* [<a style="font-size:130%;color:green">3. Modelling](#3_bullet)
    * [<a style="font-size:130%;color:green">3.1 Preparing Data](#3.1_bullet)
    * [<a style="font-size:130%;color:green">3.2 Model Selection](#3.2_bullet)

<a id="introduction"></a>
**INTRODUCTION**

A stroke is a serious life-threatening medical condition that happens when the blood supply to part of the brain is cut off.

The main symptoms of stroke can be remembered with the word FAST:

* Face ‚Äì the face may have dropped on 1 side, the person may not be able to smile, or their mouth or eye may have dropped.
* Arms ‚Äì the person with suspected stroke may not be able to lift both arms and keep them there because of weakness or numbness in arm.
* Speech ‚Äì their speech may be slurred or garbled, or the person may not be able to talk at all despite appearing to be awake; they may also have problems understanding what you're saying to them.
* Time ‚Äì it's time to dial 999 immediately if you see any of these signs or symptoms.

![](https://wp02-media.cdn.ihealthspot.com/wp-content/uploads/sites/520/2020/04/08160759/iStock-1168179082.jpg)


<strong> Attribute Information </strong>
*  id: unique identifier
*  gender: "Male", "Female" or "Other"
*  age: age of the patient
*  hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
*  heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
*  ever_married: "No" or "Yes"
*  work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
*  Residence_type: "Rural" or "Urban"
*  avg_glucose_level: average glucose level in blood
*  bmi: body mass index
*  smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
*  stroke: 1 if the patient had a stroke or 0 if not <br>

<a id="import"></a>
# Libraries

In [None]:
from matplotlib import pyplot as plt
import plotly.express as px
import scikitplot as skplt
import missingno as msno
import pandas as pd
import numpy as np
import os
import re
import warnings
warnings.filterwarnings("ignore")

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from sklearn.svm import LinearSVC, SVC
from xgboost import XGBClassifier

In [None]:
import seaborn as sns
cm = sns.light_palette("red", as_cmap=True)
cm_discret = {"1.0":"red","0.0":"gray","nan":"gainsboro"}

<a id="data"></a>
# Importing & Data Check

In [None]:
path = "/kaggle/input/stroke-prediction-dataset/"
df = pd.read_csv(f"{path}healthcare-dataset-stroke-data.csv").set_index("id", drop=True)
df.head(10).style.background_gradient(cmap=cm)

In [None]:
msno.bar(df, figsize=(30,2), color='red')

# <a class="anchor" id="1_bullet" style="color:green"> 1. Exploratory Data Analysis (EDA)

## <a class="anchor" id="1.1_bullet" style="color:green"> 1.1 Smoking Status Analysis

I know, that the smoking status of patients may affect on the stroke rate.

For this case we will try to analyze such status, referring to the stroke.



In [None]:
df["smoking_status"].unique()

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="smoking_status",color="stroke",
                   color_discrete_map=cm_discret)

fig.show()


In [None]:
df.loc[(df["smoking_status"]=="formerly smoked") | (df["smoking_status"]=="smokes"),"smoke"] = 1
df["smoke"]=df["smoke"].fillna(0)
df["smoke"]=df["smoke"].astype("int")

In [None]:
df.groupby(["stroke","smoke"])["stroke"].count()

In [None]:
sroke_smoke=df[(df["smoke"]==1)&(df["stroke"]==1)]["smoke"].sum()/df[df["stroke"]==1]["stroke"].sum()
stroke=df[df["stroke"]==1]["stroke"].count()/df["stroke"].count()

print("Stroke Rate: ",'{:,.2%}'.format(stroke))
print("Stroke Smoking Rate: ",'{:,.2%}'.format(sroke_smoke))

We can see some connections in the data:
  - For example, stroke ratio %4.2 but almost half o them smoking. 
 

## <a class="anchor" id="1.2_bullet" style="color:green"> 1.2 Gender Analysis

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="gender",color="stroke",
                   color_discrete_map=cm_discret)
fig.show()

In [None]:
df.groupby(["gender","stroke"])["stroke"].count()

In [None]:
stroke_male=df[(df["gender"]=="Male")&(df["stroke"]==1)]["gender"].count()/df[df["gender"]=="Male"]["gender"].count()
stroke_female=df[(df["gender"]=="Female")&(df["stroke"]==1)]["gender"].count()/df[df["gender"]=="Female"]["gender"].count()

print("Stroke Male Rate: ",'{:,.2%}'.format(stroke_male))
print("Stroke Female Rate: ",'{:,.2%}'.format(stroke_female))

### **Seems no main difference according to gender.**
 

> ## <a class="anchor" id="1.3_bullet" style="color:green"> 1.3 Age Analysis

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="age",color="stroke",
                   color_discrete_map=cm_discret)

fig.show()

In [None]:
df['agebin'] = pd.cut(df.age,bins=3,labels=range(1, 4), retbins=False,include_lowest=True)
df['agebin']=df['agebin'].astype(int)
df.groupby(["agebin","stroke"])["age"].count()

* ###  **Bining for Age is more accurate**

> ## <a class="anchor" id="1.4_bullet" style="color:green"> 1.4 Hypertension Analysis

In [None]:
pd.crosstab(df["hypertension"], df["stroke"]).plot(kind="bar", color=["orange", "purple"]);

In [None]:
# Create another figure
plt.figure(figsize=(10,6))

# Start with positve examples
plt.scatter(df.age[df.stroke==0], 
            df.bmi[df.stroke==0], 
            c="lightblue") # define it as a scatter figure

# Now for negative examples, we want them on the same plot, so we call plt again
plt.scatter(df.age[df.stroke==1], 
            df.bmi[df.stroke==1], 
            c="salmon") # axis always come as (x, y)

# Add some helpful info
plt.title("Stroke in function of Age and Hyper tension(BMI)")
plt.xlabel("Age")
plt.legend(["No Disease", "Disease"])
plt.ylabel("BMI");

plt.savefig('stroke_prediction_hyper_tension.png')
plt.show()

### bmi looks effective

> ## <a class="anchor" id="1.5_bullet" style="color:green"> 1.5 Ever Married Anlaysis

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="ever_married",color="stroke",
                   color_discrete_map=cm_discret)
fig.show()

###  ahahaa this result made me re-think about marriage :)
###  Just joking. Main reason is age. min. age of ever married people is 18 and the older you get, the higher the rate of marriage


In [None]:

df.groupby(["agebin","ever_married","stroke"])["age"].count()

In [None]:
dfplt = df.copy(deep=True)
dfplt = dfplt[~dfplt["stroke"].isna()]
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.scatter_3d(dfplt, x="ever_married", y="age", z= "smoking_status", color="stroke",
                    color_discrete_map=cm_discret, size_max=6, width=1000, height=1000)
fig.show()

> ## <a class="anchor" id="1.5_bullet" style="color:green"> 1.5 Heart Disease Anlaysis

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="heart_disease",color="stroke",
                   color_discrete_map=cm_discret)

fig.show()

In [None]:

df.groupby(["heart_disease","stroke"])["age"].count()

### If you have heart disease, the probability of having a stroke is 16%, while if you do not have a heart disease, the same rate is 3.5%.

In [None]:
dfplt = df.copy(deep=True)
dfplt = dfplt[~dfplt["stroke"].isna()]
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.scatter_3d(dfplt, x="heart_disease", y="bmi", z= "smoking_status", color="stroke",
                    color_discrete_map=cm_discret, size_max=6, width=1000, height=1000)
fig.show()

### also bmi and heart disease are more important together

## <a class="anchor" id="1.6_bullet" style="color:green"> 1.6 Avg Glucose Level Anlaysis

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="avg_glucose_level",color="stroke",
                   color_discrete_map=cm_discret)
fig.show()

In [None]:
dfplt = df.copy(deep=True)
dfplt = dfplt[~dfplt["stroke"].isna()]
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.scatter_3d(dfplt, x="avg_glucose_level", y="bmi", z= "gender", color="stroke",
                    color_discrete_map=cm_discret, size_max=6, width=1000, height=1000)
fig.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
df.plot(kind='scatter', x='age', y='avg_glucose_level', alpha=0.5, color='orange', ax=axes[0], title="Age vs. avg_glucose_level")
df.plot(kind='scatter', x='bmi', y='avg_glucose_level', alpha=0.5, color='purple', ax=axes[1], title="bmi vs. avg_glucose_level")
plt.savefig('stroke_prediction_avg_glucose.png')
plt.show()

> ## <a class="anchor" id="1.7_bullet" style="color:green"> 1.7 Work Type Anlaysis

In [None]:
dfplt = df.copy(deep=True)
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.histogram(dfplt, x="work_type",color="stroke",
                   color_discrete_map=cm_discret)
fig.show()

In [None]:
dfplt = df.copy(deep=True)
dfplt = dfplt[~dfplt["stroke"].isna()]
dfplt["stroke"] = dfplt["stroke"].astype(str)
fig = px.scatter_3d(dfplt, x="work_type", y="heart_disease", z= "age", color="stroke",
                    color_discrete_map=cm_discret, size_max=6, width=1000, height=1000)

fig.show()

<div class="alert alert-warning" role="alert">
  <h4 class="alert-heading">Observation üîéüîéüîé.</h4>
  <p> üìå 
data interpretation power seems high</p>
  <hr>


# <a class="anchor" id="2_bullet" style="color:green"> 2. Feature Engeneering

## <a class="anchor" id="2.1_bullet" style="color:green"> 2.1 Filling None values

In [None]:
msno.bar(df, figsize=(30,2), color="red")

In [None]:
df_bmi_null=df[df["bmi"].isnull()]
df_bmi_null.groupby("stroke")["age"].count()

### Null values are important (stroke ratio %20). So we can not drop them

In [None]:

df_corr=df.corr()
fig, ax = plt.subplots(figsize=(12, 10))
# mask
mask = np.triu(np.ones_like(df_corr, dtype=np.bool))
# adjust mask and df
mask = mask[1:, :-1]
corr = df_corr.iloc[1:,:-1].copy()
# color map
cmap = sns.diverging_palette(0, 230, 90, 60, as_cmap=True)
# plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", 
           linewidths=5, cmap=cmap, vmin=-1, vmax=1, 
           cbar_kws={"shrink": .8}, square=True)
# ticks
yticks = [i.upper() for i in corr.index]
xticks = [i.upper() for i in corr.columns]
plt.yticks(plt.yticks()[0], labels=yticks, rotation=0)
plt.xticks(plt.xticks()[0], labels=xticks)
# title
title = 'Stroke Prediction\nFirst Look\n'
plt.title(title, loc='center', fontsize=18)
plt.show()



### BMI and Age seem highly related. We can group and fill in the blank data by age.

In [None]:
add_bmi=df.groupby(["age","stroke"])[["bmi"]].mean().reset_index()
add_bmi=add_bmi.rename(columns={"bmi":"bmi_add"})
df_bmi_null=df_bmi_null.merge(add_bmi,how="left")
df_bmi_null[df_bmi_null["bmi_add"].isnull()]


<div class="alert alert-danger" role="alert">
  <h4 class="alert-heading">‚õîÔ∏è‚õîÔ∏è‚õîÔ∏è</h4>
  <p>1 year old child has a stroke</p>
  <hr>
  <p class="mb-0">We can drop it as a outlier</p>
</div>

In [None]:
df=df.merge(df_bmi_null,how="left")

df["bmi"].fillna(df["bmi_add"],inplace=True)
df=df.drop("bmi_add",axis=1)
df=df.dropna()

In [None]:
msno.bar(df, figsize=(30,2), color="red")
plt.savefig('stroke_prediction_no_missing.png')
plt.show()

<div class="alert alert-success" role="alert">
Great there is no Null values anymore
</div>


## <a class="anchor" id="2.2_bullet" style="color:green"> 2.2 Encoding features and droping unnecessary

In [None]:
df['gender'] = np.where((df.gender == 'Male'),'0',df["gender"])
df['gender'] = np.where((df.gender == 'Female'),'1',df["gender"])
df['gender'] = np.where((df.gender == 'Other'),'1',df["gender"])
df["gender"]=df["gender"].astype("int")

 <div class="alert alert-success" role="alert">
  <p>üí° "Other" segment is probably female according to age,work_type and residence_type</p>
</div>



In [None]:
df['ever_married'] = np.where((df.ever_married == 'No'),'0',df["ever_married"])
df['ever_married'] = np.where((df.ever_married == 'Yes'),'1',df["ever_married"])
df["ever_married"] = df["ever_married"].astype('int')

In [None]:
df['work_type'] = np.where((df.work_type == 'Private'),'0',df["work_type"])
df['work_type'] = np.where((df.work_type == 'Self-employed'),'1',df["work_type"])
df['work_type'] = np.where((df.work_type == 'Govt_job'),'2',df["work_type"])
df['work_type'] = np.where((df.work_type == 'children'),'3',df["work_type"])
df['work_type'] = np.where((df.work_type == 'Never_worked'),'4',df["work_type"])
df["work_type"] = df["work_type"].astype('int')


In [None]:
df['Residence_type'] = np.where((df.Residence_type == 'Urban'),'0',df["Residence_type"])
df['Residence_type'] = np.where((df.Residence_type == 'Rural'),'1',df["Residence_type"])
df["Residence_type"] = df["Residence_type"].astype('int')


In [None]:
df['smoking_status'] = np.where((df.smoking_status == 'formerly smoked'),'0',df["smoking_status"])
df['smoking_status'] = np.where((df.smoking_status == 'never smoked'),'1',df["smoking_status"])
df['smoking_status'] = np.where((df.smoking_status == 'smokes'),'2',df["smoking_status"])
df['smoking_status'] = np.where((df.smoking_status == 'Unknown'),'3',df["smoking_status"])
df["smoking_status"] = df["smoking_status"].astype('int')

 <div class="alert alert-success" role="alert">
  <p>üí° We also have "smoke" column" as binary( Yes/No)</p>
</div>


### Now should check correaltion matrix again with new features

In [None]:
df_corr=df.corr()
fig, ax = plt.subplots(figsize=(12, 10))
# mask
mask = np.triu(np.ones_like(df_corr, dtype=np.bool))
# adjust mask and df
mask = mask[1:, :-1]
corr = df_corr.iloc[1:,:-1].copy()
# color map
cmap = sns.diverging_palette(0, 230, 90, 60, as_cmap=True)
# plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", 
           linewidths=5, cmap=cmap, vmin=-1, vmax=1, 
           cbar_kws={"shrink": .8}, square=True)
# ticks
yticks = [i.upper() for i in corr.index]
xticks = [i.upper() for i in corr.columns]
plt.yticks(plt.yticks()[0], labels=yticks, rotation=0)
plt.xticks(plt.xticks()[0], labels=xticks)
# title
title = 'Stroke Prediction'
plt.title(title, loc='center', fontsize=18)

plt.savefig('stroke_prediction_heatmap.png')
plt.show()

### <a class="anchor" id="3.1_bullet" style="color:green"> 3.1 Preparing Model

In [None]:
from sklearn.model_selection import train_test_split
x = df.drop("stroke", axis=1)
y = df["stroke"].values

np.random.seed(42)

# Splitting the data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(x,y,test_size=0.2)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier
from sklearn.svm import LinearSVC, SVC
from xgboost import XGBClassifier

models = {"CatBoostClassifier": CatBoostClassifier(silent=True),
          "Logistic Regression": LogisticRegression(), 
          "Random Forest": RandomForestClassifier(),
          "XGBoost": XGBClassifier(objective= 'binary:logistic')}

def fit_and_score(models, x_train, x_test, y_train, y_test):

    np. random.seed(42)
    model_scores = {}
    for name, model in models.items():
        model.fit(x_train, y_train)
        model_scores[name] = model.score(x_test, y_test)

    return model_scores

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from lightgbm import LGBMClassifier

kfold = StratifiedKFold(n_splits=10)

# Modeling step Test differents algorithms 
random_state = 42
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(XGBClassifier(n_estimators= 200,objective= 'binary:logistic', random_state = random_state))
classifiers.append(LGBMClassifier(random_state = random_state))
classifiers.append(CatBoostClassifier())

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_train, y = Y_train, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis",'XGBClassifier','LGBMClassifier','CatBoostClassifier']})


### <a class="anchor" id="3.2_bullet" style="color:green"> 3.2 Model Selection

In [None]:
plt.figure(figsize=(20,10))
g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="tab10",orient = "h",**{'xerr':cv_std})
plt.axvline(0.95)
plt.axvline(0.90)
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

plt.savefig('stroke_prediction_model_comparison.png')
plt.show()

In [None]:
cv_res.sort_values(by="CrossValMeans",ascending=False)

In [None]:
bm = SVC(random_state=random_state)
bm.fit(X_test,Y_test)

y_pred=bm.predict(X_test)
y_true=pd.DataFrame(Y_test)
from sklearn.metrics import classification_report
cr=classification_report(y_true,y_pred,output_dict=True)
pd.DataFrame(cr)

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
import graphviz
# DOT data

clf = DecisionTreeClassifier(random_state=1234)
model = clf.fit(X_train,Y_train)


feature_names=X_test.columns.values

dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=feature_names,  

                                filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph.render("decision_tree_graphivz")

graph