## Stroke Prediction

Predicting whether or not a person would have a stroke. The dataset is acquired from kaggle (https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df=pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
sns.heatmap(df.isna(), cmap='magma')

As we can see from the .info() command, the bmi columns are missing about 100 data. But in the heatmap we see that the missing values are actually not too significant, so we're just gonna drop the entire missing values.

In [None]:
df2=df.dropna()

In [None]:
df2.info()

In [None]:
df2.head()

## Some EDA

In [None]:
sns.countplot(x='gender', data=df2)
print(df['gender'].value_counts())

In [None]:
sns.countplot(x='stroke', data=df2, palette='rocket', hue='gender')
plt.xlabel('Stroke')
plt.ylabel('Count')
print(df['stroke'].value_counts())

As we can see, the data is heavily imbalanced. We're going to have to deal with this later on.

In [None]:
sns.countplot(x='Residence_type', data=df2, palette='magma', hue='gender')
plt.xlabel('Residence Type')
plt.ylabel('Count')
plt.title('Classification Based on Residence Type', fontsize=14)
print(df['Residence_type'].value_counts())

In [None]:
sns.countplot(x='smoking_status', data=df2, palette='viridis', hue='gender')
plt.xlabel('Smoking Status')
plt.ylabel('Count')
plt.title('Classification Based on Smoking Status', fontsize=14)
print(df['smoking_status'].value_counts())

In [None]:
sns.countplot(x='ever_married', data=df2, palette='mako', hue='gender')
plt.xlabel('Ever Married')
plt.ylabel('Count')
plt.title('Classification Based on Marriage', fontsize=14)
print(df['ever_married'].value_counts())

In [None]:
sns.countplot(x='work_type', data=df2, palette='viridis', hue='gender')
plt.xlabel('Work Type')
plt.ylabel('Count')
plt.title('Classification Based on Work', fontsize=14)
print(df['work_type'].value_counts())
plt.legend(loc='upper right')

## Turn Categorical Columns into Numerical Values

Using One Hot encoding, we should transform categorical columns such as gender, ever_married, work_type, Residence_type and smoking_status into numerical columns

In [None]:
gender=pd.get_dummies(df2['gender'], drop_first=True)
married=pd.get_dummies(df2['ever_married'], drop_first=True)
work=pd.get_dummies(df2['work_type'], drop_first=True)
reside=pd.get_dummies(df2['Residence_type'], drop_first=True)
smoke=pd.get_dummies(df2['smoking_status'], drop_first=True)

Concat to the dataframe (df2) and make it into a new dataframe

In [None]:
ndf=pd.concat([df2,gender,married,work,reside,smoke], axis=1)

In [None]:
ndf.head()

Drop 'id' column and also the original categorical columns

In [None]:
ndf.drop(['id','gender','ever_married','work_type','Residence_type','smoking_status'], axis=1, inplace=True)

In [None]:
ndf.head()

## Deal with the imbalanced data

What we're gonna do is that we're gonna oversample the minority data (stroke=1) and undersample the majority data (stroke=0). <br>
<br>
We are going to use SMOTE for the oversampling process and RandomUnderSampler for the undersampling process

First we separate the target and the features

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X=ndf.drop('stroke', axis=1)

In [None]:
y=ndf['stroke']

Import SMOTE and RandomUnderSampler

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [None]:
oversample = SMOTE()
undersample = RandomUnderSampler()
steps = [("o", oversample), ("u", undersample)]
pipeline = Pipeline(steps=steps)
# transform the dataset
X, y = pipeline.fit_resample(X, y)

In [None]:
y.value_counts()

Now we do train test split with test size = 30%

In [None]:
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3)

We can see that the y dataframe is now evenly distributed. Now we build the model.

## Building model

For this project we're gonna use Logistic Regression, Random Forest, KNN, and Gaussian Naive Bayes and then we're gonna compare the scores of each model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

Elbow method for optimal number of n_neighbors in KNN

In [None]:
err_rate=[]

for i in range (1,50):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    err_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(12,5))
sns.set_style('whitegrid')
plt.plot(range(1,50),err_rate, color='green', marker='d', ls='--')
plt.xticks(np.arange(1,50,1))

So the optimal number of n_neighbors is 2

In [None]:
lm=LogisticRegression()
rfc=RandomForestClassifier()
gnb=GaussianNB()
knn=KNeighborsClassifier(n_neighbors=2)

In [None]:
lm.fit(X_train,y_train)
rfc.fit(X_train,y_train)
gnb.fit(X_train,y_train)
knn.fit(X_train,y_train)

In [None]:
lmpredict=lm.predict(X_test)
rfcpredict=rfc.predict(X_test)
gnbpredict=gnb.predict(X_test)
knnpredict=knn.predict(X_test)

## Comparison

Use classification report to determine which model fits this project best.

In [None]:
from sklearn.metrics import classification_report

In [None]:
print('Classification report for Logistic Regression')
print(classification_report(lmpredict,y_test))

In [None]:
print('Classification report for Random Forest Classifier')
print(classification_report(rfcpredict,y_test))

In [None]:
print('Classification report for KNN')
print(classification_report(knnpredict,y_test))

In [None]:
print('Classification report for GNB')
print(classification_report(gnbpredict,y_test))

## Thank You