<h1> <div align="center"> Stroke Predictions with Machine Learning</div> </h1>

![stroke](https://cms.sehatq.com/public/img/article_img/stroke-hemoragik-penyakit-mematikan-yang-bisa-datang-tiba-tiba-1557716574.jpg)

# Introduction

Stroke is a disease that occurs when the blood supply to part of your brain is interupted or reduced. This phenomenon prevents brain tissue from getting oxygen and nutrients. A stroke is medical emergency. According to the Centers for Disease Control and Prevention (CDC), stroke is the fifth-leading cause of death in the United States. Every year, more than 795,000 U.S. people have a stroke.

Based on the cause of their occurence, there are three types of stroke. The Ischemic stroke, Hemorrhagic stroke, and Transient Ischemic Attack (TIA)

* Ischemic Stroke

    Ischemic Stroke happens when the brain's blood vessles become narrowed or blocked. It causes severly reduced blood flow (Ischemia). They can also be caused by pieces of plaque due to atherosclerosis breaking off and blocking a blood vessel
    The two most common types of ischemic strokes are thrombotic and embolic. 
    
* Hemorrhagic Stroke

    Hemorrhagic stroke occurs when a blood vessel in your brain leaks or ruptures. Brain hemorrhages can result from many conditions that affect your blood vessels. According to the American Heart Association, about 13 percent of strokes are hemorrhagic. Learn more about the causes of hemorrhagic stroke, as well as treatment and prevention.
    
* Transient Ischemic Attack (TIA)

    A Transient Ischemic Attack (TIA) sometimes known as a ministroke. TIA is a temporary period of symtomps similar to those you would have in a stroke. TIA does not cause permanent damage. They're caused by a temporary decrease in blood supply to part of your brain, which may last as little as five minutes.
    
This notebook aims to build and determine machine learning models that can help to identify whether a patient is likely to suffer from a brain stroke.

# Prepare the Packages

In [None]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.utils import resample 

# Data Profiling

In [None]:
dataset = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
dataset.head()

In [None]:
dataset.info()

In [None]:
dataset.shape

In [None]:
dataset.describe()

# Data Cleansing and Preprocessing

In [None]:
dataset.isnull().sum()

In [None]:
dataset.bmi = dataset.bmi.fillna(dataset.bmi.mean())

In [None]:
dataset = dataset.drop('id', axis=1)

In [None]:
stroke_dataset = dataset.copy()
stroke_dataset['Stroke'] = stroke_dataset.stroke.copy()
stroke_dataset = stroke_dataset.drop('stroke', axis=1)
stroke_dataset.hypertension = stroke_dataset.hypertension.apply(lambda x: 'Hypertension' if x == 1 else 'No Hypertension')
stroke_dataset.heart_disease = stroke_dataset.heart_disease.apply(lambda x: 'Heart Disease' if x == 1 else 'No Heart Disease')
stroke_dataset.ever_married = stroke_dataset.ever_married.apply(lambda x: 'Married' if x == 'Yes' else 'Unmarried')
stroke_dataset.Stroke = stroke_dataset.Stroke.apply(lambda x: 'Stroke' if x == 1 else 'No Stroke')
stroke_dataset.head()

# Exploratory Data Analysis (EDA)

In [None]:
dataset.stroke.value_counts()

In [None]:
sns.countplot(x='Stroke', data=stroke_dataset, palette='RdYlGn', edgecolor='black')
plt.title('Patient Distribution (Stroke or No Stroke)', loc='center', pad='30', fontsize='14')
plt.xlabel('Patient Type')
plt.ylabel('Count of Patient')
plt.show()

In [None]:
fig, (ax1, ax2, ax3)  = plt.subplots(1, 3, figsize=(18,6))
plt.text(-120,1000, 'Numerical Features Distribution By Stroke and No Stroke', horizontalalignment='center', fontsize=20)

sns.histplot(x='age', data=stroke_dataset, hue='Stroke', palette='RdYlGn', bins=20, edgecolor='black', ax=ax1)
ax1.set(xlabel='Age', ylabel='Count of Patient')

sns.histplot(x='bmi', data=stroke_dataset, hue='Stroke', palette='RdYlGn', bins=20, edgecolor='black', ax=ax2)
ax2.set(xlabel='Body Mass Index', ylabel='Count of Patient')

sns.histplot(x='avg_glucose_level', data=stroke_dataset, hue='Stroke', palette='RdYlGn', bins=20, edgecolor='black', ax=ax3)
ax3.set(xlabel='Avg Glucose Level', ylabel='Count of Patient')
plt.show()

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(15,12))

plt.text(-3,6500, 'Non-Numerical Features Distribution By Stroke and No Stroke', horizontalalignment='center', fontsize=20)
sns.countplot(x='gender', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[0][0], edgecolor='black')
ax[0][0].set(xlabel='Patient Gender', ylabel='Count of Patient')

sns.countplot(x='ever_married', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[0][1], edgecolor='black')
ax[0][1].set(xlabel='Patient Marital Status', ylabel='Count of Patient')

sns.countplot(x='hypertension', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[0][2], edgecolor='black')
ax[0][2].set(xlabel=' Patient Hypertension Status', ylabel='Count of Patient')

sns.countplot(x='heart_disease', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[1][1], edgecolor='black')
ax[1][1].set(xlabel='Patient Heart Disease Status', ylabel='Count of Patient')

sns.countplot(x='Residence_type', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[2][0], edgecolor='black')
ax[2][0].set(xlabel='Patient Residence Type', ylabel='Count of Patient')

sns.countplot(x='work_type', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[2][1], edgecolor='black')
ax[2][1].set(xlabel='Patient Work Type', ylabel='Count of Patient')
labels = ['Private', 'Self employed', 'Govt job', 'Children', 'Never worked']
ax[2][1].set_xticklabels(labels, rotation=30, horizontalalignment='center')

sns.countplot(x='smoking_status', data=stroke_dataset, hue='Stroke', palette='autumn', ax=ax[2][2], edgecolor='black')
ax[2][2].set(xlabel='Patient Smoking Status', ylabel='Count of Patient')
labels = ['Formerly smoked', 'Never smoked', 'Smokes', 'Unknown',]
ax[2][2].set_xticklabels(labels, rotation=30, horizontalalignment='center')
plt.show()

## Insight
* A major portion of patients who have suffered a stroke are senior citizen.
* A major portion of patients who have suffered a stroke are patients who have body mass index between 20 and 40.
* We observe that there are many outliers in the body mass index and the average glucose level, which needs to be handled before data modeling
* A major portion of patients are patients who have not suffered a stroke. 95% for patients who have not suffered a stroke and 5% for patients who have suffered stroke. This unbalanced portion of data can cause the model to make incorrect predictions. To prevent this phenomenon, we should balance the data portion
* There are no significant correlation between patients' gender and patients who have suffered a stroke or not
* More portions (88%) patients who suffer a stroke are patients who have ever married
* More patients who suffer a stroke are patients who have suffered hypertension
* More patients who suffer a stroke are patients who have suffered heart disease
* There are no significant correlation between patients' residence type and patients who have suffered a stroke or not
* Children and patient who have never worked are less likely to suffer a stroke
* Patient who formerly smoked and who smokes are more likely to suffer a stroke

# Data Modeling

In [None]:

dataset = dataset[((dataset['bmi'] > 10.3) & (dataset['bmi'] < 43))]
dataset = dataset[((dataset['avg_glucose_level'] > 31 ) & (dataset['avg_glucose_level'] < 130))]


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,2))
sns.boxplot(x=dataset['bmi'], palette='autumn', ax=ax1)
ax1.set(xlabel='Body Mass Index')
sns.boxplot(x=dataset['avg_glucose_level'], palette='autumn', ax=ax2)
ax2.set(xlabel='Average Glucose Level')
plt.show()

In [None]:
majority_data = dataset[dataset['stroke'] == 0]
minority_data = dataset[dataset['stroke'] == 1]

upsampled = resample(minority_data, replace=True, n_samples=len(majority_data))

In [None]:
new_stroke_dataset = pd.concat([majority_data,upsampled])
new_stroke_dataset = new_stroke_dataset.sample(frac=1).reset_index(drop=True)

In [None]:
for column in new_stroke_dataset.columns:
    if new_stroke_dataset[column].dtype == np.number: continue
    new_stroke_dataset[column] = LabelEncoder().fit_transform(new_stroke_dataset[column])

In [None]:
X = new_stroke_dataset.drop(['stroke'], axis=1)
y = new_stroke_dataset['stroke']

In [None]:
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size = 0.25, random_state = 42)

normalize = MinMaxScaler().fit(X_train)

normalize_train = normalize.transform(X_train)
normalize_train = normalize.transform(X_test)

In [None]:
log_model = LogisticRegression()
log_model = log_model.fit(X_train, y_train)
y_predict = log_model.predict(X_test)

In [None]:
lr_precision = precision_score(y_test, y_predict)
lr_recall = recall_score(y_test, y_predict)
lr_f1_score = f1_score(y_test, y_predict)
lr_accuracy = accuracy_score(y_test, y_predict)

In [None]:
lr_confu_matrix = pd.DataFrame((confusion_matrix(y_test, y_predict)),('No Stroke', 'Stroke'), ('No Stroke', 'Stroke'))

In [None]:
rf_model = RandomForestClassifier()
rf_model = rf_model.fit(X_train, y_train)
y_predict = rf_model.predict(X_test)

In [None]:
rf_precision = precision_score(y_test, y_predict)
rf_recall = recall_score(y_test, y_predict)
rf_f1_score = f1_score(y_test, y_predict)
rf_accuracy = accuracy_score(y_test, y_predict)

In [None]:
rf_confu_matrix = pd.DataFrame((confusion_matrix(y_test, y_predict)),('No Stroke', 'Stroke'), ('No Stroke', 'Stroke'))

In [None]:
gbt_model = GradientBoostingClassifier()
gbt_model = gbt_model.fit(X_train, y_train)
y_predict = gbt_model.predict(X_test)

In [None]:
gbt_precision = precision_score(y_test, y_predict)
gbt_recall = recall_score(y_test, y_predict)
gbt_f1_score = f1_score(y_test, y_predict)
gbt_accuracy = accuracy_score(y_test, y_predict)

In [None]:
gbt_confu_matrix = pd.DataFrame((confusion_matrix(y_test, y_predict)),('No Stroke', 'Stroke'), ('No Stroke', 'Stroke'))

In [None]:
dt_model = DecisionTreeClassifier()
dt_model = dt_model.fit(X_train, y_train)
y_predict = dt_model.predict(X_test)

In [None]:
dt_precision = precision_score(y_test, y_predict)
dt_recall = recall_score(y_test, y_predict)
dt_f1_score = f1_score(y_test, y_predict)
dt_accuracy = accuracy_score(y_test, y_predict)

In [None]:
dt_confu_matrix = pd.DataFrame((confusion_matrix(y_test, y_predict)),('No Stroke', 'Stroke'), ('No Stroke', 'Stroke'))

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(15,12))
plt.text(-0.5,-2.7, 'Confusion Matrix', horizontalalignment='center', fontsize=20)

sns.heatmap(lr_confu_matrix, annot=True, annot_kws={'size': 14}, fmt='d', cmap='YlGnBu', ax=ax[0][0])
ax[0][0].set(xlabel='Predicted label', ylabel='True Label', title='Logistic Regression')

sns.heatmap(rf_confu_matrix, annot=True, annot_kws={'size': 14}, fmt='d', cmap='YlGnBu', ax=ax[0][1])
ax[0][1].set(xlabel='Predicted label', ylabel='True Label', title='Random Forest Classifier')

sns.heatmap(gbt_confu_matrix, annot=True, annot_kws={'size': 14}, fmt='d', cmap='YlGnBu', ax=ax[1][0])
ax[1][0].set(xlabel='Predicted label', ylabel='True Label', title='Gradient Boosting Classifier')

sns.heatmap(dt_confu_matrix, annot=True, annot_kws={'size': 14}, fmt='d', cmap='YlGnBu', ax=ax[1][1])
ax[1][1].set(xlabel='Predicted label', ylabel='True Label', title='Decision Tree Classifier')
plt.show()

In [None]:
score = { 'Logistic Regression' : [lr_precision, lr_recall, lr_f1_score, lr_accuracy],
          'Random Forest' : [rf_precision, rf_recall, rf_f1_score, rf_accuracy], 
          'Gradient Boosting' : [gbt_precision, gbt_recall, gbt_f1_score, gbt_accuracy],
          'Decision Tree' : [dt_precision, dt_recall, dt_f1_score, dt_accuracy],
          'Metrics' : ['Precision', 'Recall', 'F1 Score', 'Accuracy'] }

score_df = pd.DataFrame(data=score)
score_df = score_df.set_index('Metrics')
score_df