# Introduction

In this notebook we will go through the data of the "Stroke prediction" dataset. First we'll explore the data and visualize it, then filling the missing values using KNN. 
Last we'll use ANN to predict the stroke of a given individual.

Walkthrough the notebook:
1. <a href="#eda">Exploratory Data Analysis</a>
2. <a href="#preproc">Preprocessing the data</a>
    * <a href="#format">Formatting the features</a>
    * <a href="#fillna">Filling the missing values</a>
    * <a href="#datanorm">Data Normalization</a>
    * <a href="#train_dev_split">Train dev splits</a>
3. <a href="#SVC">Classification using SVC</a>
4. <a href="#randForest">Classification using Random Forest</a>
5. <a href="#ANN">Classification using ANN</a>

# 1. Exploratory Data Analysis: <a id="eda"></a>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random as rd

data = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
print("Shape:", data.shape)
print("Features:", list(data.columns))
print()
print("Data Describe:")
print(data.iloc[:, 0:5].describe())
print(data.iloc[:, 5:12].describe())

In [None]:
# Setting up graphics and color palette
from pylab import rcParams
rcParams['figure.figsize'] = 9, 7

sns.set_context('notebook')
sns.set_style('whitegrid')
pal = sns.color_palette('Set2')
sns.set_palette(pal)

import warnings  
warnings.filterwarnings('ignore')

## Checking for null values:

In [None]:
print(data.isnull().sum())

## Distribution of the features:

In [None]:
sns.histplot(data['stroke'], bins=2)
plt.show()
imbalance_ratio = sum(data['stroke']==1)/sum(data['stroke']==0)
print("Ratio stroke/no_stroke:")
print(imbalance_ratio)

In [None]:
rcParams['figure.figsize'] = 20, 7
fig, axes = plt.subplots(1, 2)
sns.histplot(data['age'], bins=8, kde= True, ax=axes[0])
sns.histplot(data['bmi'], bins=10, kde=True, ax=axes[1])
plt.suptitle('Continuous data distribution')
plt.show()
rcParams['figure.figsize'] = 9, 7
sns.histplot(data['avg_glucose_level'], kde=True, bins=8)
plt.show()

In [None]:
cat_features = ['gender', 'work_type', 'Residence_type', 'smoking_status']
rcParams['figure.figsize'] = 20, 15
fig, axes = plt.subplots(2, 2)
for i in range(0, len(cat_features)):
    sns.histplot(data[cat_features[i]], ax=axes[int(i/2), i%2])
plt.suptitle('Categorical data distribution')
plt.show()

In [None]:
bin_features = ['hypertension', 'heart_disease', 'ever_married', 'stroke']
data['ever_married'] = [int(b) for b in data['ever_married'] == 'Yes']
rcParams['figure.figsize'] = 20, 15
fig, axes = plt.subplots(2, 2)
for i in range(0, len(bin_features)):
    sns.histplot(data[bin_features[i]], bins=2, ax=axes[int(i/2), i%2])
plt.suptitle('Binary data distribution')
plt.show()

In [None]:
rcParams['figure.figsize'] = 15,11
corr_mat = data.drop(['id'], axis=1).corr()
sns.heatmap(corr_mat, vmin=-1, vmax=1, cmap=sns.diverging_palette(360, 180, as_cmap=True))
plt.show()

# 2. Preprocessing the data: <a id="preproc"></a>

## Formatting the features: <a id="format"></a>

In [None]:
smoke_to_int = {
    'never smoked': 0,
    'formerly smoked': 1,
    'smokes': 2,
    'Unknown': -1
}
data['smoking_status'] = [smoke_to_int[s] for s in data['smoking_status']]
print(data['smoking_status'])

In [None]:
work_to_int = {
    'Private': 1,
    'Self-employed': 2,
    'Govt_job': 2,
    'children': 4,
    'Never_worked': 0
}
data['work_cat'] = [work_to_int[s] for s in data['work_type']]
print(data['work_cat'])

In [None]:
data['gender_female'] = [int(m) for m in data['gender'] == 'Female']
data['residence_urban'] = [int(m) for m in data['Residence_type'] == 'Urban']
print(data['residence_urban'])

In [None]:
features = [
    'id', 'gender_female', 'age', 'hypertension', 
    'heart_disease', 'ever_married', 'work_cat',
    'residence_urban', 'avg_glucose_level', 'bmi', 
    'smoking_status', 'stroke'
]

df = data[features]

## Filing in missing values (KNN algorithm):<a id="fillna"></a>

### 1) BMI

In [None]:
from sklearn.neighbors import KNeighborsRegressor
train = df[df['bmi'].isna()==False]
pred = df[df['bmi'].isna()]

bmi_regressor = KNeighborsRegressor(n_neighbors=5)
X = train.drop(['bmi'], axis=1)
y = train['bmi']
bmi_regressor.fit(X, y)
y_hat = bmi_regressor.predict(pred.drop(['bmi'], axis=1))
pred.loc[:, 'bmi'] = y_hat

In [None]:
train.loc[:, 'cat'] = 'Train'
pred.loc[:, 'cat'] = 'Pred'

fig, axes= plt.subplots(2,1, sharex=True)
sns.histplot(train, x='bmi', stat='probability', hue='cat', bins=15, ax=axes[0], kde=True)
sns.histplot(pred, x='bmi', stat='probability', hue='cat', bins=15, ax=axes[1], kde=True)
plt.show()

df = pd.concat([train, pred])
df.drop(['cat'], axis=1, inplace=True)


### 2) Smoking status:

Since there's an important ratio of people with the smoking status 'Unknown', we'll apply KNN in order to fill their smoking status:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
train = df[df['smoking_status']!=-1]
pred = df[df['smoking_status']==-1]

smoker_classifier = KNeighborsClassifier(n_neighbors=5)
X = train.drop(['smoking_status'], axis=1)
y = train.loc[:, 'smoking_status']
smoker_classifier.fit(X, y)
y_hat = smoker_classifier.predict(pred.drop(['smoking_status'], axis=1))
pred.loc[:, 'smoking_status'] = y_hat

In [None]:
train.loc[:, 'cat'] = 'Train'
pred.loc[:, 'cat'] = 'Pred'

fig, axes= plt.subplots(2,1, sharex=True)
sns.histplot(train, x='smoking_status', stat='probability', hue='cat', bins=3, ax=axes[0])
sns.histplot(pred, x='smoking_status', stat='probability', hue='cat', bins=3, ax=axes[1])
plt.show()

df = pd.concat([train, pred])
df.drop(['cat'], axis=1, inplace=True)

## Data Normalization: <a id="datanorm"></a>

Very important step!

In [None]:
df.describe()

In [None]:
from sklearn.preprocessing import StandardScaler

X = np.array(df.drop(['stroke', 'id'], axis=1))
y = np.array(df.loc[:, 'stroke'])
scaler = StandardScaler()
X = scaler.fit_transform(X)
pd.DataFrame(X).describe()

## Train and Dev Splits: <a id="train_dev_split"></a>

In [None]:
from sklearn.model_selection import train_test_split

print("# of samples: " + str(y.shape[0]))

# Splitting data into train (80%) CV (10%) test(10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1, stratify = y, random_state = 42)
y_train = y_train.astype(np.float32).reshape((-1,1))
y_test = y_test.astype(np.float32).reshape((-1,1))

# #Transposing the data
# X_train, X_test = [np.array(x).T for x in [X_train, X_test]]
# y_train, y_test = [np.array(y).reshape(1, -1) for y in [y_train, y_test]]

print("X_train shape: " + str(X_train.shape) + "\t y_train shape:" + str(y_train.shape))
print("X_test shape:  " + str(X_test.shape) + "\t y_test shape: " + str(y_test.shape))

print(sum(y_train==1))
print(sum(y_test==1))

# 3. Classification using SVC <a id="SVC"></a>

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

svc = SVC(C=1000, class_weight = {0: 1, 1: 20}, kernel='poly', degree=5)
svc.fit(X_train, y_train)

In [None]:
#Added Parameter Tuning:
iterations = 30
C_range = [int(10**rd.uniform(0, 4)) for i in range(iterations)] # Weights for Regularization param
weight_range = [int(10**rd.uniform(1, 2.5)) for i in range(iterations)] # Weights for imbalanced data
degree_range = [int(rd.uniform(2, 7)) for i in range(iterations)]

combos = list(zip(C_range, weight_range, degree_range))
best_score = 0
best_combination = (1, 10, 2)

for C, weight, degree in combos:
    svc = SVC(C=C, class_weight = {0: 1, 1: weight}, kernel='poly', degree=degree)
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    current_score = f1_score(y_test, y_pred)
    print("C:", C, "\tWeight:", weight,"\tdegree:", degree, "\tScore:", current_score)
    if current_score > best_score:
        best_score = current_score
        best_combination = (C, weight, degree)

C, weight, degree = best_combination
svc = SVC(C=C, class_weight = {0: 1, 1: weight}, kernel='poly', degree=degree)
svc.fit(X_train, y_train)

In [None]:
y_pred = svc.predict(X_train)
print('Score on the training set:')
print(classification_report(y_train, y_pred))
print('roc_auc score: ', end='')
print(roc_auc_score(y_train, y_pred))
print('f1 score:', f1_score(y_train, y_pred), end='\n\n')

y_pred = svc.predict(X_test)
print('Score on the dev set:')
print(classification_report(y_test, y_pred))
print('roc_auc score: ', end='')
print(roc_auc_score(y_test,  y_pred))
print('f1 score:', f1_score(y_test,  y_pred), end='\n\n')

# 4. Cassification using Random Forest: <a id="randForest"></a>

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=150, criterion='gini',
                                  class_weight = {0: 1, 1: 100}, min_samples_leaf=6,
                                  max_features = None)
rf_model.fit(X_train, np.ravel(y_train))

In [None]:
#Added Parameter Tuning:
weight_range = [int(x) for x in np.logspace(1, 2.5, num=5)] # Weights for imbalanced data
min_leaf_range = [int(x) for x in np.linspace(8, 20, num=5)] # 
best_score = 0
best_combination = (10, 5)

for w, leaf_s in [(x, y) for x in weight_range for y in min_leaf_range]:
    rf_model = RandomForestClassifier(n_estimators=150, criterion='gini',
                              class_weight = {0: 1, 1: w}, min_samples_leaf=leaf_s,
                              max_features = None)
    rf_model.fit(X_train, np.ravel(y_train))
    y_pred = rf_model.predict_proba(X_test)
    current_score = f1_score(y_test,np.around(y_pred[:, 1]))
    print("Weight:", w,"\tMin leaf sample:", leaf_s, "\tF1-Score:", current_score)
    if current_score > best_score:
        best_score = current_score
        best_combination = (w, leaf_s)

w, leaf_s = best_combination
rf_model = RandomForestClassifier(n_estimators=150, criterion='gini',
                              class_weight = {0: 1, 1: w}, min_samples_leaf=leaf_s,
                              max_features = None)
rf_model.fit(X_train, np.ravel(y_train))

In [None]:
y_pred = rf_model.predict_proba(X_train)
print('Score on the training set:')
print(classification_report(y_train, np.around(y_pred[:, 1])))
print('roc_auc score: ', end='')
print(roc_auc_score(y_train, y_pred[:, 1]))
print('f1 score:', f1_score(y_train,np.around(y_pred[:, 1])), end='\n\n')

y_pred = rf_model.predict_proba(X_test)
print('Score on the dev set:')
print(classification_report(y_test, np.around(y_pred[:, 1])))
print('roc_auc score: ', end='')
print(roc_auc_score(y_test, y_pred[:, 1]))
print('f1 score:', f1_score(y_test,np.around(y_pred[:, 1])), end='\n\n')

# 5. Cassification using ANN: <a id="ANN"></a>

First we'll define a function to visualize the history of our model (performance):

In [None]:
def dfify(hist):
	df = pd.DataFrame(hist.history)
	df['epoch'] = df.index
	val_cols = [x for x in df.columns if x.startswith('val')]
	df_val = df[val_cols+['epoch']]
	df.drop(columns=val_cols, inplace=True)
	df_val.rename(columns={col: col.split('val_')[-1] for col in df_val.columns}, inplace=True)
	df['phase'] = 'train'
	df_val['phase'] = 'val'
	return pd.concat([df, df_val], ignore_index=True)

def visu_history(hist):
    rcParams['figure.figsize'] = 14, 10
    hist_df = dfify(hist)
    fig, axes = plt.subplots(2, 2)
    sns.lineplot(data = hist_df, x='epoch', y='loss', hue='phase', ax=axes[0,0])
    sns.lineplot(data = hist_df, x='epoch', y='auc', hue='phase', ax=axes[0,1])
    sns.lineplot(data = hist_df, x='epoch', y='precision', hue='phase', ax=axes[1,0])
    sns.lineplot(data = hist_df, x='epoch', y='recall', hue='phase', ax=axes[1,1])
    plt.show()


In [None]:
import tensorflow as tf
import tensorflow_addons as tfa
from sklearn.metrics import classification_report

tf.keras.backend.clear_session()


def make_model(optimizer, loss_fn, metrics, output_bias='zeros', dropout=0):
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(X_train.shape[-1],)),
        tf.keras.layers.Dense(10, activation='relu', kernel_initializer=tf.keras.initializers.HeNormal),
        tf.keras.layers.Dropout(dropout),
        tf.keras.layers.Dense(6, activation='relu', kernel_initializer=tf.keras.initializers.HeNormal),
        tf.keras.layers.Dropout(dropout),
#         tf.keras.layers.Dense(6, activation='relu', kernel_initializer=tf.keras.initializers.HeNormal),
#         tf.keras.layers.Dropout(dropout),
#         tf.keras.layers.Dense(3, activation='tanh', kernel_initializer=tf.keras.initializers.HeNormal),
#         tf.keras.layers.Dropout(dropout),
        tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=tf.initializers.GlorotUniform, bias_initializer=output_bias)
    ])
    
    model.compile(
        optimizer=optimizer,
        loss=loss_fn,
        metrics=metrics
    )
    
    return model


loss_fn = tfa.losses.SigmoidFocalCrossEntropy(from_logits=False)
# loss_fn = tf.losses.BinaryCrossentropy(from_logits=False)

f1_score_tf = tfa.metrics.F1Score(num_classes=1, average='macro')
pres = tf.keras.metrics.Precision()
rec = tf.keras.metrics.Recall()
auc = tf.keras.metrics.AUC()
metrics = ['accuracy', pres, rec, f1_score_tf, auc]

optimizer = tf.keras.optimizers.SGD(learning_rate=.1, momentum=1)

model = make_model(optimizer, loss_fn, metrics)
model.summary()

## Initializing the output layer bias:

In [None]:
# Initializing The final layer:
output_initializer = tf.keras.initializers.Constant(np.log(imbalance_ratio)) # sigmoid(ln(x))=x
model = make_model(optimizer, loss_fn, metrics, output_initializer)
model.save_weights('initial_weights')

In [None]:
result = model.evaluate(X_train, y_train, batch_size=256)

In [None]:
model.layers[-1].bias.assign([0])
result = model.evaluate(X_train, y_train, batch_size=256)

Clearly the model with the initialized bias has a better initial loss.

## Overfitting the model on 10 rows of data (test phase):

In [None]:
# Overfitting a single batch of 10 rows
model.load_weights('initial_weights')
history = model.fit(X_train[97:107], y_train[97:107], epochs=100, batch_size=256, verbose=2)

In [None]:
rcParams['figure.figsize'] = 7, 5
grid = sns.lineplot(data = history.history['loss'])
grid.set(yscale='log')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()

In [None]:
np.column_stack((np.around(model.predict(X_train[97:107]), 3), y_train[97:107]))

## Overfitting the whole dataset:

In [None]:
# defining the class_weights
class_weight = {0: 1, 1: 1/imbalance_ratio}
class_weight

In [None]:
model = make_model(tf.keras.optimizers.Adam(learning_rate=1e-2), loss_fn, metrics, output_initializer)
model.load_weights('initial_weights')
history = model.fit(X_train, y_train, epochs=300, batch_size=1024, verbose=2, class_weight=class_weight, validation_data=(X_test, y_test))

In [None]:
visu_history(history)

## Regularization:

Applying weight decay, dropout of .2, and early stopping:

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=400, mode='max', restore_best_weights=True, verbose=1)
model = make_model(tfa.optimizers.AdamW(learning_rate=1e-2, weight_decay=5e-4), loss_fn, metrics, output_initializer, dropout=.2)
model.load_weights('initial_weights')
history = model.fit(
    X_train, y_train, epochs=400, batch_size=256, class_weight=class_weight, 
    callbacks=[callback], validation_data=(X_test, y_test), verbose=2
)

In [None]:
visu_history(history)

In [None]:
y_pred = model.predict(X_train)
print(classification_report(y_train, np.around(y_pred)))
print('roc_auc score: ', end='')
print(roc_auc_score(y_train, y_pred))
print('f1 score:', f1_score(y_train,np.around(y_pred)), end='\n\n')

y_pred = model.predict(X_test)
print(classification_report(y_test, np.around(y_pred)))
print('roc_auc score: ', end='')
print(roc_auc_score(y_test, y_pred))
print('f1 score:', f1_score(y_test,np.around(y_pred)), end='\n\n')

# Conclusion:

In this notebook we looked at the data of of stroke prediction, understood it doing EDA, preprocessed it and applied different machine learning models (polynomial SVM Classifier, Random Forest, ANN) with a special emphasis on the ANN model. The results of the models were pretty much the same (AUC score of .82), arguabely acceptable given that the data is highly imbalanced.

We can further improve the performance of the models by applying these steps:
* Tuning the hyperparameters for the ann
* Data agmentation / Oversampling / Underesampling
* Ensemble

Comments and Critics are welcome.

Thank you for your time!