# Predicting a Stroke

This is going to be very short notebook to illustrate how **SMOTE** (synthetic minority over-sampling technique) combined with `Random Forest` classifier can improve accuracy for imbalanced classification problem.

We call classification problem imbalanced when one or more target classes have significantly smaller number of observations. In our case we have:
 - **4860** observations with class `0` (patient had no stroke)
 - **249** observations with class `1` (patient had a stroke)
 
This complicates life of the researcher, since ML models will choose path of least resistance and predict that patient had no stroke **at all times** and it will be correct approximately 90% of the time. I will illustrate how `Random Forest` would behave under these circumstances.

Predicting that patient had no stroke is a good thing, but predicting if he will have a stroke might be even more important. As a researcher we have to find the right balance between `true positives` and `false positives` or other characteristics, depends on where our place of interest is in.

So let's get started

## TL;DR

`Random Forest` classifier on imbalanced data on average predicted 0-5% `true positives` (patient has a stroke), which is pretty bad, considering we would like to notify a person who has a risk of stroke beforehand.. It had no problem correctly identifying patients with no stroke, of course. With oversampling technique `Random Forest` was able to predict ~90% `true positives` and number of `false negatives` was also decreased.

# Importing Libraries

In [None]:
import os
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        file = os.path.join(dirname, filename)

# 1. Loading Data

In [None]:
df = pd.read_csv(file)

In [None]:
df.head()

In [None]:
df.stroke.value_counts()

# 2. Preprocessing

## 2.1 Dealing With Missing Values

In [None]:
df.isna().sum()

In [None]:
df.bmi = df.bmi.fillna(df.bmi.median()) # => 28

## 2.2 Transforming Categorical Features

In [None]:
df = df[df.gender != 'Other'] # there is just one row with 'Other' gender. Just drop it
gender = pd.get_dummies(df.gender)
hypertension = pd.get_dummies(df.hypertension, prefix='hypertension')
heart_disease = pd.get_dummies(df.heart_disease, prefix='heart_disease')
ever_married = pd.get_dummies(df.ever_married, prefix='married')
work_type = pd.get_dummies(df.work_type)
smoking_status = pd.get_dummies(df.smoking_status)

In [None]:
clean_df = pd.concat(
    [
        df.age, 
        df.avg_glucose_level, 
        df.bmi,
        gender, 
        hypertension, 
        heart_disease, 
        ever_married, 
        work_type, 
        smoking_status,
        df.stroke
    ],
    axis=1)

In [None]:
clean_df.head()

## 2.3 Train-Test-Split

In [None]:
# Separating features and target variable AKA dependent variable
X = clean_df.drop('stroke', axis=1)
y = clean_df.stroke

# Scaling features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# We're going to use imbalanced dataset first to see what accuracy we can get
X_train, \
X_test, \
y_train, \
y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)

# 3 Modeling

## 3.1 Random Forest on Imbalanced Data

In [None]:
rf = RandomForestClassifier(n_estimators=100, n_jobs=4)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f'Random Forest accuracy: {accuracy_score(y_test, y_pred)}')

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(f'TP: {tp}')
print(f'FN: {fn}')
print(f'TN: {tn}')
print(f'FP: {fp}')

> So what does it mean? Random Forest was able to predict class '0' pretty well (TN), but failed to identify patients with class '1' (TP). On top of that it falsely identified 79 patients with a stroke. Not good.

## 3.2 Random Forest 2.0 (with SMOTE)

In [None]:
# Oversampling with SMOTE
oversampling = SMOTE()
oversampling = oversampling.fit_resample(X_scaled, y)
X_smote = oversampling[0]
y_smote = oversampling[1]

In [None]:
y_smote.value_counts()

In [None]:
X_train_smote, \
X_test_smote, \
y_train_smote, \
y_test_smote = train_test_split(X_smote, y_smote, test_size=0.25, random_state=42)

In [None]:
rf_smote = RandomForestClassifier(n_estimators=100, n_jobs=4)
rf_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = rf_smote.predict(X_test_smote)
print(f'Random Forest accuracy: {accuracy_score(y_test_smote, y_pred_smote)}')

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test_smote, y_pred_smote).ravel()
print(f'TP: {tp}')
print(f'FN: {fn}')
print(f'TN: {tn}')
print(f'FP: {fp}')

This looks better. Model was able to correctly identify 1158 patients with stroke and 1120 of without. Though it had it flaws with false positive and false negatives, it did much better job than base model.

But what we're interested in is to see how our new model works on imbalanced data from first example. Lets try

## 3.3 Random Forest 2.0 on Imbalanced Data

In [None]:
y_pred = rf_smote.predict(X_test)
print(f'Random Forest accuracy: {accuracy_score(y_test, y_pred)}')

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(f'TP: {tp}')
print(f'FN: {fn}')
print(f'TN: {tn}')
print(f'FP: {fp}')
print('\n')
print(f'Precision: {tp/(tp+fp)}\nRecall: {tp/(tp+fn)}')

Pretty good! Now instead of 0 TP we have 73, but we did sacrifice quite a bit on `false positives` (31), that means our model wrongfully marked healthy person having a stroke. Yikes

## Conclusion

Oversampling worked magic in this example, bringing total accuracy up to the 97%. But since we care more about correctly identifying patients with a stroke (class 1), we should pay attention to the `true positives` and `false negatives`. Those numbers are up. I have to admit that the title of this notebook is a bit clickbaity. If you run through test many times you'll get 95% accuracy on average.

**PS: if you have questions or recommendations, please feel free to comment. Cheers.**