<h1>STROKE PREDICTION & ANALYSIS</h1>

![Stroke](https://blog.encompasshealth.com/wp-content/uploads/2020/09/did-you-have-a-stroke.jpg?w=700&h=400&crop=1)

<h3>In this notebook we'll deal with analyzing all factors that can lead to stroke. We'll compare what has more impact on stroke and after that we'll build a model to predict whether patient suffers of stroke. This is a classification problem and later we'll se which models we are going to use.</h3>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <span style="color:khaki">**Content**<span> 
- [Data Cleansing](#Data-Cleansing) <a href = '#Data-Cleansing'></a>
- [Label Encoding](#Label-Encoding) <a href = '#Label-Encoding'></a>
- [Data Analysis](#Data-Analysis) <a href = '#Data-Analysis'></a>
- [Models](#Models) <a href = '#Models'></a>
- [Oversampling](#OVERSAMPLING) <a href = '#OVERSAMPLING'></a>
- [Grid Search CV KNN](#KNNCV) <a href = '#KNNCV'></a>
- [KNN](#KNN) <a href = '#KNN'></a>
- [Grid Search CV Random Forest Classifier](#RFRCV) <a href = '#RFRCV'></a>
- [Random Forest Classifier](#RFR) <a href = '#KRFR'></a>







# <a id='Data-Cleansing' style="color:khaki" >**Data Cleansing**



Let's read our dataset and see what we have.

In [None]:
df = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.head()

In [None]:
df.info()

1. First we'll fill those null values in BMI column
2. We'll check all unique data in:
* gender
* work type
* residence type
* smoking status


In [None]:
df["bmi"] = df["bmi"].fillna(df["bmi"].mean())

print("Smoking Status:\n{}".format(df["smoking_status"].value_counts()))
print("\n\nGender:\n{}".format(df["gender"].value_counts()))
print("\n\nWork Type:\n{}".format(df["work_type"].value_counts()))
print("\n\nResidence Type:\n{}".format(df["Residence_type"].value_counts()))


I'll remove gender == "other" because there's only 1 field

In [None]:
indexToDrop = df[df["gender"] == "Other"].index
df.drop(indexToDrop,inplace=True)
df["gender"].value_counts()

I'll also put "Formerly smoked" into "Smokes" group

In [None]:
df["smoking_status"] = df["smoking_status"].apply(lambda x: x.replace("formerly smoked","smokes"))
df["smoking_status"].value_counts()


Now let's Label Encode our string columns.
* Gender
* Work type
* Residence type
* Ever Married
* Smoking Status

# <a id='Label-Encoding' style="color:khaki" >**Label Encoding**



In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

to_encode = ["gender","work_type","Residence_type","ever_married","smoking_status"]
def encode(colName):
    newName = colName + "_encoded"
    df[newName] = le.fit_transform(df[colName])
    return df

for x in to_encode:
    encode(x)
    
df.head()

# <a id='Data-Analysis' style="color:khaki" >**Data Analysis**



Let's import libraries for visualizations.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Let's start with basic countplot for Marriage effect on stroke

In [None]:
marriageStrokeAffect = sns.countplot(x="stroke",hue="ever_married",data=df,palette="husl")
marriageStrokeAffect.set_title("Marriage Affect on Stroke")
plt.show()

Let's have a closer look for those patients who had a stroke and their marriage status.

In [None]:
onlyStroke = df[df["stroke"] == 1]
marriageStrokeAffectOnlyStroke = sns.countplot(x="stroke",hue="ever_married",data=onlyStroke,palette="husl")
marriageStrokeAffectOnlyStroke.set_title("Marriage Affect On Stroke")
plt.show()

Let's see if age has any impact on stroke.

In [None]:
f, ax = plt.subplots(figsize=(18, 7))

ageStroke = sns.countplot(x="age",hue="stroke",data=onlyStroke,palette="husl")
ageStroke.set_title("Age Affect on Stroke")
plt.show()


Next plot will see which gender had more strokes.


In [None]:
genderStroke = sns.countplot(x="stroke",hue="gender",data=df,palette="Set2")
genderStroke.set_title("Gender Affect on Stroke")
plt.show()

In [None]:
genderStroke = sns.countplot(x="stroke",hue="gender",data=onlyStroke,palette="Set2")
genderStroke.set_title("Gender Affect on Stroke(Data Only Stroke)")
plt.show()

In [None]:
ageGenderStrokeFig = px.bar(onlyStroke,x="age",y="stroke",color="gender",title="Age and Gender Affect on Stroke")
ageGenderStrokeFig.show()

In [None]:
heartDiseaseStroke = px.bar(onlyStroke,x="heart_disease",y="stroke",title="Heart Disease affect on Stroke")
heartDiseaseStroke.show()

In [None]:
heartDiseaseStrokeGender = px.bar(onlyStroke,x="heart_disease",y="stroke",color="gender",title="Heart Disease affect on Stroke")
heartDiseaseStrokeGender.show()


In [None]:
df["stroke_str"] = df["stroke"].apply(str)
bmiAgeStroke = px.scatter(df,x="age",y="bmi",color="stroke_str",color_discrete_sequence=px.colors.qualitative.Set3,
                          title="Age and BMI Affect on Stroke")
bmiAgeStroke.show()


In [None]:
glucoseAgeStroke = px.scatter(df,x="age",y="avg_glucose_level",color="stroke",
                              color_discrete_sequence=px.colors.qualitative.Set3,
                              title="Glucose Level and Age Affect on Stroke")
glucoseAgeStroke.show()

In [None]:
smokeStroke = px.bar(df,x="smoking_status",y="stroke",title="Smoke Affect on Stroke")
smokeStroke.show()

In [None]:
hypertensionStroke = sns.countplot(x="hypertension",hue="stroke",data=df)
hypertensionStroke.set_title("Hypertension Affect on Stroke")
plt.show()

In [None]:
workStroke = px.bar(df,x="work_type",y="stroke",color="gender",title="Work Type Affect on Stroke")
workStroke.show()

In [None]:
residenceStroke = sns.countplot(x="Residence_type",data=df,hue="stroke")
plt.show()

In [None]:
corr_df = df.corr()
f, ax = plt.subplots(figsize=(10, 8))

corr_vis = sns.heatmap(corr_df,cmap="YlGnBu")
plt.show()

# <a id='Models' style="color:khaki" >**Models**



Libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE

Features and target data

In [None]:
X = df[["age","hypertension","heart_disease","avg_glucose_level","bmi","gender_encoded","work_type_encoded","Residence_type_encoded","ever_married_encoded","smoking_status_encoded"]]
y = df["stroke"]

# <a id='OVERSAMPLING' style="color:khaki" >**Oversampling**



Our data is imbalanced - there is much more data for no stroke, then there is for stroke. So, we'll need to do oversample. I'll use SMOTE.

In [None]:
smk = SMOTE()
X_sam, y_sam = smk.fit_resample(X,y)

Old data

In [None]:
print(X.shape)
print(y.shape)


New data

In [None]:
print(X_sam.shape)
print(y_sam.shape)

In [None]:
X = X_sam[["age","hypertension","heart_disease","avg_glucose_level","bmi","gender_encoded","work_type_encoded","Residence_type_encoded","ever_married_encoded","smoking_status_encoded"]]
y = y_sam

Train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

# <a id='KNNCV' style="color:khaki" >**GRID SEARCH CV - KNN**



In [None]:
knn = KNeighborsClassifier()
parameters_knn = {"n_neighbors" : range(1,100,5), "weights" : ("uniform","distance"), "leaf_size" : range(10,100,10)}
gsKnn = GridSearchCV(knn,parameters_knn,scoring="accuracy")
gsKnn.fit(X_train,y_train)
print(gsKnn.best_score_)
print(gsKnn.best_estimator_)

# <a id='KNN' style="color:khaki" >**KNN**



Let's train our KNN with best parameters.

In [None]:
knn = KNeighborsClassifier(leaf_size=10, n_neighbors=1)
knn.fit(X_train,y_train)
knnPred = knn.predict(X_test)

print(classification_report(y_test,knnPred))
print(confusion_matrix(y_test,knnPred))


In [None]:
knnCfMatrix = confusion_matrix(y_test,knnPred)
f, ax = plt.subplots(figsize=(10, 8))
knnHeat = sns.heatmap(knnCfMatrix,annot=True,cmap="Blues",fmt="g")
plt.show()

# <a id='RFCCV' style="color:khaki" >**GRID SEARCH CV - RANDOM FOREST CLASSIFIER**



In [None]:
rfc = RandomForestClassifier()
parameters = {"n_estimators" : range(10,300,10), "criterion" : ("gini","entropy"), "max_features" : ("auto", "sqrt", "log2")}
gs = GridSearchCV(rfc,parameters,scoring="accuracy")
gs.fit(X_train,y_train)
print(gs.best_score_)
print(gs.best_estimator_)

# <a id='RFC' style="color:khaki" >**RANDOM FOREST CLASSIFIER**



Let's implement best parameters on Random Forest Classifier Model

In [None]:
rfc = RandomForestClassifier(max_features='sqrt', n_estimators=210)
rfc.fit(X_train,y_train)
rfcPred = rfc.predict(X_test)
print(classification_report(y_test,rfcPred))
print(confusion_matrix(y_test,rfcPred))

In [None]:
rfcCfMatrix = confusion_matrix(y_test,rfcPred)
f, ax = plt.subplots(figsize=(10, 8))
knnHeat = sns.heatmap(rfcCfMatrix,annot=True,cmap="Blues",fmt="g")
plt.show()

<h3> Upvote if you want to see more editing/model improvement/tuning on this code. :) </h3>