In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Predicting Stroke Using Machine Learning

This notebook is designed to predict whether or not someone is a stroke candidate based on their medical backround, using selected data science libraries and machine learning classifications models. 

The notebook consist of 4 parts :

<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Content</h3>

* [1. Preparing the Tools](#1)
* [2. About the Dataset](#2)
* [3. Data Exploration (Explaratory Data Analysis or EDA)](#3)
    - [3.1 Focus on the categorical features](#3.1)
    - [3.2 Focus on the numerical features](#3.2)
* [4. Detailed Analysis and Model Selection](#4) 
    - [4.1 Preparing the data and the Correlation Matrix](#4.1)
    - [4.2 Model Selection](#4.2)
    

------------------------------------------------------------------------------------------------------------------
First, let's look in out data in detail
1. Problem definition
2. Data
3. Evaluation
4. Features


## 1.Problem Definition
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Our top priority in this health problem is to identify patients with a stroke.

## 2.Data

The dataset is downloaded from https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

## 3.Evaluation

Evaluation using F1-Score (given the output class imbalance)

## 4. Features

Information about the data :
 
1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

<a id="1"></a>
<font color="darkslateblue" size=+2.5><b>1. Preparing the Tools</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


We're going to use pandas, Matplotlib, Seaborn and NumPy for data analysis and manipulation, then sklearn to create the models 

In [None]:
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns 
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

# we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-Learn
from sklearn import preprocessing
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score

<a id="2"></a>
<font color="darkslateblue" size=+2.5><b>2. About the Dataset</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
df = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

#dimension of data
print(df.shape)

<a id="3"></a>
<font color="darkslateblue" size=+2.5><b>3. Data Exploration (exploratory data analysis or EDA)</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


The goal here is to find out more about the data and become a subject matter export on the dataset you're working with.

1. What is the shape of the target variable?
2. What kind of data do we have and how do we treat different types?
3. What's missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# Let's find out how many of each class there
df["stroke"].value_counts().plot(kind="bar",color=["red", "blue"]);
df["stroke"].value_counts()

We can see that the target variable "stroke" is very imbalanced : 249 Stroke wrt to 4861 healty people

In [None]:
df.info() 

In [None]:
#checking for missing value percent
round(df.isnull().sum()/df.shape[0]*100,2) #3.93% of BMI is missing

In [None]:
df.isnull().sum().sum() #201 people's BMI index is missing

In [None]:
#Let's impute missing BMI values with "mean"
val = ["bmi"] 
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values= np.nan, strategy="mean" )
df[val] = imputer.fit_transform(df[val])

In [None]:
#df2 has imputed BMI values
df.describe()

In [None]:
#Let's frop the ID column
data = df.drop(columns=['id'], axis=1)

In [None]:
data.head()

In [None]:
print(data.shape)

<a id="3.1"></a>
<font color="dimgrey" size=+2.0><b>3.1 Focus on the categorical features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


In [None]:
cols = list(data.columns)
cols

In [None]:
cat_data = [x for x in data.columns if data[x].dtype == "object"]
num_data = [y for y in data.columns if data[y].dtype != "object"]

In [None]:
for i in cat_data:
    print(i," = ",data[i].unique())

In [None]:
cat_data

In [None]:
num_data

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
for i in cat_data:
    plt.figure(figsize=(8,5))
    sns.countplot(data[i], palette="pastel")
    plt.title(i,fontsize=15,color="b")
    plt.show()

In [None]:
data['gender'].value_counts()

In [None]:
#Remove the "other" gender row

data = data[data.gender!="Other"]

<a id="3.2"></a>
<font color="dimgrey" size=+2.0><b>3.2. Focus on the numerical features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


In [None]:
# Check the distribution of the age column with a histogram
data.age.plot.hist();

In [None]:
# Check the distribution of the age, hypertension,heart_disease,av_glucose_level, bmi and stroke columns with a histogram
for i in num_data:    
    fig = plt.figure(figsize=(8,5))
    sns.histplot(data[i],kde=True, palette="pastel")
    plt.title(i,fontsize=12,color="r")
    plt.show()

<a id="3.3"></a>
<font color="dimgrey" size=+2.0><b>3.3. Focus on cross features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
  sns.displot(data, x='age', y='bmi',height=6, aspect=1)

BMI figures are more dense in young ages and on 60+ ages. The data has also some outliers

In [None]:
g = sns.FacetGrid(data, row="heart_disease", col="hypertension", hue="gender", height=4, aspect=1.4, palette="viridis")
g.map(sns.scatterplot, "age", "bmi")
g.add_legend()

We cannot really say that high BMI and having a heart_disease cause stroke. 

In [None]:
g = sns.FacetGrid(data, row="ever_married", col="stroke", hue="work_type", height=4)
g.map(sns.scatterplot, "avg_glucose_level", "age")
g.add_legend()

According to data at hand, the stroke happens among mostly married people (interesting:))

<a id="4"></a>
<font color="darkslateblue" size=+2.5><b>4. Detailed Analysis and Model Selection</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


<a id="4.1"></a>
<font color="dimgrey" size=+2.0><b>4.1.Preparing The Data and Correlation Matrix</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


In [None]:
data_dummies = pd.get_dummies(data, columns=['gender','ever_married', 'Residence_type','work_type','smoking_status'], drop_first=True)

In [None]:
print(data_dummies.shape)
data_dummies.head()

In [None]:
# Let's make a pretty correlation matrix 
corr_matrix = data_dummies.corr()
fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.heatmap(corr_matrix,
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

<a id="4.2"></a>
<font color="dimgrey" size=+2.0><b>4.2.Model Selection</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


In [None]:
from sklearn.model_selection import train_test_split,cross_val_predict,StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,roc_auc_score,roc_curve
from imblearn.pipeline import Pipeline as imbPipe
from imblearn.over_sampling import SMOTE

In [None]:
# Split data into train and test sets
np.random.seed(42)

X = data_dummies.drop("stroke", axis = 1)
y = data_dummies["stroke"]

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

In [None]:
X_train

In [None]:
y_train, len(y_train)

---
* **1.Logistic Regression**
---


In [None]:
#We form a pipline with standard scaler, smote and the model. Standard scaler is needed to uniform the values. 
#Smote is necessary cause our data is imbalanced

Logistic_pipeline = imbPipe([
    
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42,n_jobs=-1)),
    ("logistic",LogisticRegression(solver='lbfgs', max_iter=1000))
])


y_pred = cross_val_predict(Logistic_pipeline, X_train, y_train, cv = 3)
print(classification_report(y_train, y_pred))

In [None]:
# Plot ROC curve and calculate and calculate AUC metric
Logistic_pipeline.fit(X_train, y_train)
plot_roc_curve(Logistic_pipeline, X_test, y_test)

In [None]:
y_pred=Logistic_pipeline.predict(X_test)

In [None]:
sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_pred):
    """
    Plots a nice looking confusion matrix using Seaborn's heatmap()
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_pred),
                     annot=True,
                     cbar=False,
                     fmt='g')
    plt.xlabel("True label")
    plt.ylabel("Predicted label")
    
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)
    
plot_conf_mat(y_test, y_pred)

---
* **2. KNN**
---

In [None]:
KNN_pipeline = imbPipe([
    
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42,n_jobs=-1)),
    ("KNN", KNeighborsClassifier())
])


y_pred = cross_val_predict(KNN_pipeline, X_train, y_train, cv = 3)
print(classification_report(y_train, y_pred))

---
* **3. Random Forest**
---

In [None]:
Random_Forest_pipeline = imbPipe([
    
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42,n_jobs=-1)),
    ("Random Forest", RandomForestClassifier(random_state=42))
])


y_pred = cross_val_predict(Random_Forest_pipeline, X_train, y_train, cv = 3)
print(classification_report(y_train, y_pred))

Let's find the best params for Random Forest

In [None]:
Random_Forest_pipeline = imbPipe([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42,n_jobs=-1)),
    ("rfc", RandomForestClassifier(random_state=42))
])

params={
    'rfc__n_estimators': [100, 200],
    'rfc__max_features': [7,8],
    'rfc__min_samples_leaf': [5,6],
    'rfc__min_samples_split': [15,20]   
}

rfc_grid = GridSearchCV(Random_Forest_pipeline, params, cv=3,n_jobs=-1,scoring="f1")
rfc_grid.fit(X_train, y_train)
print("Best Parameters for Model:  ",rfc_grid.best_params_)

In [None]:
#Let's place best params into the Random Forest model
Random_Forest_pipeline = imbPipe([
                                  ("scaler", StandardScaler()),
                                  ("smote", SMOTE(random_state=42,n_jobs=-1)),
                                  ("Random Forest", RandomForestClassifier(random_state=42,max_features= 7, min_samples_leaf= 5, min_samples_split= 15, n_estimators= 100))
                                  ])


y_pred = cross_val_predict(Random_Forest_pipeline, X_train, y_train, cv = 3)
print(classification_report(y_train, y_pred))

With the best parametrics, we've reruned the model. The results are still very poor but the ROC curve's performance is high

In [None]:
# Plot ROC curve and calculate and calculate AUC metric
Random_Forest_pipeline.fit(X_train, y_train)
plot_roc_curve(Random_Forest_pipeline, X_test, y_test)

In [None]:
y_pred_test=Random_Forest_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_test))

In [None]:
sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_pred):
    """
    Plots a nice looking confusion matrix using Seaborn's heatmap()
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_pred),
                     annot=True,
                     cbar=False,
                     fmt='g')
    plt.xlabel("True label")
    plt.ylabel("Predicted label")
    
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)
    
plot_conf_mat(y_test, y_pred_test)

We can see that the model is not overfitted but need to work in detail to improve the F1 score