# **Stroke Prediction**

According to the World Health Organization(WHO), stroke is responsible for approximately 11% of total deaths. This notebook attempts to use stacking of machine learning models in order to predict if a person is likely to get a stroke.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Importing the dataset**

The dataset contains 11 features which can be used to predict if the person is likely to get a stroke. This public datatset can be downloaded from this [link](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)

In [None]:
df= pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
df.head()

In [None]:
df['gender']=df['gender'].map({'Male':1 ,'Female':0})
df['ever_married']=df['ever_married'].map({'Yes':1 ,'No':0})

In [None]:
df.head()

In [None]:
df['work_type'].unique()

In [None]:
df['smoking_status'].unique()

In [None]:
df['Residence_type'].unique()

In [None]:
df['work_type']=df['work_type'].map({'Private':0 ,'Self-employed':1, 'Govt_job':2, 'children':3, 'Never_worked':4})
df['smoking_status']=df['smoking_status'].map({'formerly smoked':0 ,'never smoked':1, 'smokes':2, 'Unknown':3})
df['Residence_type']=df['Residence_type'].map({'Urban':1 ,'Rural':0})

In [None]:
df.head()

**Handling the null values**

In [None]:
df.isnull().sum()

In [None]:
df=df.dropna(subset=['gender'])
df.bmi.fillna(df.bmi.mean(), inplace=True)

In [None]:
df.isnull().sum()

# Exploratory Data Analysis

In [None]:
df['stroke'].value_counts()

In [None]:
df=df.drop(['id'],axis=1)

Correlation Marix

In [None]:
fig=plt.figure(figsize=(15,10))
corr= df.corr()
sns.heatmap(corr,annot=True)

**Relationship of stroke variable with continous variables**

In [None]:
sns.violinplot(x='stroke', y='age',
           data=df)

In [None]:
sns.violinplot(x='stroke', y='avg_glucose_level',
           data=df)

In [None]:
sns.violinplot(x='stroke', y='bmi',
           data=df)

**Relationship of stroke variable with categorical variables**

In [None]:
sns.catplot(y="stroke",x='gender',data=df,kind='bar',ci=None)

In [None]:
sns.catplot(y="stroke",x='hypertension',data=df,kind='bar',ci=None)

In [None]:
sns.catplot(y="stroke",x='heart_disease',data=df,kind='bar',ci=None)

In [None]:
sns.catplot(y="stroke",x='ever_married',data=df,kind='bar',ci=None)

In [None]:
sns.catplot(y="stroke",x='work_type',data=df,kind='bar',ci=None)

In [None]:
sns.catplot(y="stroke",x='Residence_type',data=df,kind='bar',ci=None)

In [None]:
sns.catplot(y="stroke",x='smoking_status',data=df,kind='bar',ci=None)

# Prediction using Stacking

The variables 'gender' and 'Residence_type' have been removed as they do not have much affect on the prediction

In [None]:
y=df['stroke']
X=df.drop(['stroke','gender','Residence_type'],axis=1)

Splitting the dataset into train and test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_t, y_train, y_t = train_test_split(X,y,test_size=0.2,shuffle= True,random_state=69)

The stacking model uses DecisionTree, Random Forest and XGBoost classifiers as estimators in the first level and Logistic Regression as the final estimator

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from xgboost import XGBClassifier

level0=list()
level0.append(('xgb', XGBClassifier()))
level0.append(('rf', RandomForestClassifier()))
level0.append(('dt', DecisionTreeClassifier()))

level1= LogisticRegression()

model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
model.fit(X_train, y_train)

In [None]:
model.score(X_t,y_t)

We are able to achieve the accuracy of 95.59% on the test set. It is also seen that examples of people suffering from stroke are very less in the dataset and the performance of the model can be improved by working around this issue.