# 0. Import libraries and Read data

In [None]:

import numpy as np 
import pandas as pd 
import seaborn as sns
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier

import matplotlib.pyplot as plt

In [None]:
DATA_PATH = '/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv'
x_df = pd.read_csv(DATA_PATH)
x_df.head()

In [None]:
x_df.describe()

# 1. Exploratory Data Analysis

In this section, we will analyze our features and target distribution to get some raw insights on the potential relationships to helps us for the features selection.

In [None]:
sns.countplot(x='stroke',data=x_df)

The dataset is unbalanced because the number of people having strokes are much lower than people who have not. Obviously that is natural since a study in 2010 showed that 0.25% of the world population had a stroke during that year

In [None]:
sns.countplot(x='Residence_type',data=x_df)

We have a fair representation between rural and urban people.


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.violinplot(ax=axes[0], x="stroke", y="avg_glucose_level", data=x_df)
sns.violinplot(ax=axes[1], x="stroke", y="bmi", data=x_df)
sns.violinplot(ax=axes[2], x="stroke", y="age", data=x_df)

* The distribution of **average glucose level** between the two classes is almost similar with the only difference being that there are slightly more people with stroke who have an average glucose level above 150.

* There is no significant difference between the distribution of **BMI** between the two classes. So the BMI has not a big impact in having a stroke here. However we may notice potential outliers as a bmi over 65 is quite rare.

* The difference of **age** distribution between the two classes is significant with people having strokes who are much older than the rest of the population. We can deduce that age is an important factor of stroke. The more people are old the more chance to have a stroke.


In [None]:
x_df = x_df[~(x_df['gender'] == 'Other')]
sns.violinplot(x="stroke", y="age", data=x_df ,hue='gender')

* Among people having strokes, the ages at which the risk of stroke is significant are almost similar for men and women. The only difference is men are more at risk when they are around 60 years old.

In [None]:
sns.violinplot(x="stroke", y="age", data=x_df ,hue='smoking_status')



* Among people having strokes, the ones who formerly smoked or smokes are more likely to have a stroke earlier than others. So smoking might have an impact on chances of stroke.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

sns.violinplot(ax=axes[0], x="stroke", y="age", data=x_df ,hue='Residence_type')
sns.violinplot(ax=axes[1], x="stroke", y="age", data=x_df ,hue='hypertension')
sns.violinplot(ax=axes[2], x="stroke", y="age", data=x_df ,hue='heart_disease')




* Among people having strokes, the ages at which the risk of stroke is significant are similar for rural and urban people. So living in a city or a rural area does not have an impact on strokes.

* Among people having strokes, the ages at which the risk of stroke is significant are similar for people with hypertension and people without hypertension. So it is hard to tell if hypertension alone has an impact on strokes.

* It is also hard to tell whether heart_disease alone has an impact on strokes or not




# 2. Missing values BMI

In [None]:
print("Missing values :\n{}".format(x_df.isnull().sum()))


In [None]:
# source : https://en.wikipedia.org/wiki/Body_mass_index

NORMAL = 18.5 # All bmi values under 18.5 refer to underweight
OVERWEIGHT = 25
OBESE_1 = 30
OBESE_2 = 35
OBESE_3 = 40
MAX_BMI = 55


x_df = x_df[(x_df['bmi'].isnull()) | (x_df['bmi'] < MAX_BMI)] # Filter out BMI outliers
x_overweight_df = x_df[x_df['bmi'] > OVERWEIGHT]
x_underweight_df = x_df[x_df['bmi'] <= NORMAL]
x_test_df = x_df[(x_df['bmi'] > NORMAL) & (x_df['bmi'] <= OVERWEIGHT )]
x_null_df = x_df[x_df['bmi'].isnull()]



print('Ratio of overweight ppl over positive class : {:.02f}%'.format(100 * (x_overweight_df['stroke'].sum() / x_df['stroke'].sum())))
print('Ratio of underweight ppl over positive class : {:.02f}%'.format(100 * (x_underweight_df['stroke'].sum() / x_df['stroke'].sum())))
print('Ratio of normal ppl over positive class : {:.02f}%'.format(100 * (x_test_df['stroke'].sum() / x_df['stroke'].sum())))
print('Ratio of MISSING BMI ppl over positive class : {:.02f}%'.format(100 * (x_null_df['stroke'].sum() / x_df['stroke'].sum())))

sns.violinplot(data=x_df,x='stroke',y='bmi')

## Replace missing BMI with median 

In [None]:
# Solution 1 :
#x_df['bmi'] = x_df['bmi'].fillna(x_df['bmi'].median())

## Replace missing BMI using DecisionTree

In [None]:
# Solution 2 :
# Predict missing 'bmi' with other values based on 'age' and 'gender' attributes with a simple Decision Tree
bmi_pipe = Pipeline([('scaler', StandardScaler()), 
                     ('dtr', DecisionTreeRegressor(random_state=42))
                    ])

x_pipe_df = x_df[['age','gender','bmi']].copy()
x_pipe_df['gender'] = x_pipe_df['gender'].replace({'Male':0,'Female':1,'Other':-1}).astype(np.uint8)

x_missing_df = x_pipe_df[x_pipe_df['bmi'].isnull()].drop(columns='bmi')

x_pipe_df = x_pipe_df[~x_pipe_df['bmi'].isnull()]
y_pipe_df = x_pipe_df.pop('bmi')

bmi_pipe.fit(x_pipe_df,y_pipe_df)
x_df.loc[x_missing_df.index, 'bmi'] = bmi_pipe.predict(x_missing_df)

# 3. Numerical features


A child stroke is a 'very' rare event and half of them cannot be precisely explained ([source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3255104/)).  
The dataset contains only 2 children stroke samples which are certainly not enough to grasp children stroke event and including these two samples might drift our model training process. Therefore we have decided to remove them to predict **ONLY adults strokes**.   
Note that if we wanted to predict all type of strokes removing these children stroke introduces a strong bias.


In [None]:
# Drop children stroke
x_children_stroke_df = x_df[(x_df['age'] < 20 ) & (x_df['stroke'] == 1)]
x_df = x_df.drop(x_children_stroke_df.index)

In [None]:
# Drop id column
if 'id' in x_df.columns:
    x_df = x_df.drop(columns='id')


# 4. Categorical features

In [None]:
categorical_cols = ['gender','ever_married','work_type','Residence_type','smoking_status']

if 'Residence_type' in x_df.columns:
    x_df = x_df.drop(columns='Residence_type') # From our experiments : 'Residence_type' add more noise than relevant information
x_df = pd.get_dummies(x_df)
x_df.head() 


# 5. Split dataset

In [None]:
x,y = x_df.drop(columns='stroke'), x_df['stroke']


x_train, x_val, y_train, y_val = train_test_split(x,y, test_size=0.2, random_state = 42, shuffle = True, stratify=y)
print("Train shape : {}\nValidation Shape : {}".format(x_train.shape, x_val.shape))
print("Positive # samples : {}".format(np.count_nonzero(y_val == 1)))




# 6. Balancing dataset

Balacing the dataset did not result in better performance (as for recall) so for now we have commented out the below code. Note that upsampling and downsampling should be done on the training set and the final model should always be evaluated on original (not synthetic) data samples.

## Upsampling : SMOTE

In [None]:
# Solution 1 : Upsampling with SMOTE
# from imblearn.over_sampling import SMOTE
# oversample = SMOTE()
# x_train, y_train = oversample.fit_resample(x_train, y_train)

# print("Input shape after SMOTE : {}".format(x_train.shape))


## Downsampling

In [None]:
# Solution 2 : Downsample
# FRAC = 0.9 # Drop 70% of negative samples
# y_train_to_drop = y_train[y_train == 0].sample(frac = FRAC,random_state = 42)
# x_train = x_train.drop(y_train_to_drop.index)
# y_train = y_train.drop(y_train_to_drop.index)

# print("Input shape after downsampling by {}% : {}".format(FRAC*100,x_train.shape))


# 7. Classification

## Evaluation metrics


Depending on business goals, the evaluation metrics to be optimized might be different. From our experiments, identifying accurately positive cases (people that had stroke) is difficult for many several reasons : unsufficent data samples, unsufficient relevant features...
Also this classification problem is different from other classical classification problems such as fraud detection or image classification where a sample has a unique target label regardless of the 'time'. However in the stroke classification problem, we might have samples having a lot of stroke-correlated features but that haven't had any stroke YET. Therefore they are labelled as negative samples but they might actually have a stroke anytime soon.

TLDR : In this notebook, we want to focus on the recall metric instead of accuracy as we have an imbalanced dataset with positive samples as minority. We also consider that having False Negative predictions in this dataset is more dangerous than False Positive.

In [None]:
from sklearn.metrics import recall_score, f1_score, precision_score, accuracy_score, average_precision_score
from sklearn.metrics import precision_recall_curve, auc
from matplotlib import pyplot as plt

def plot_metrics(targets,predictions):
    print("Validation accuracy : {:.4f}".format(accuracy_score(targets, predictions)))
    print("Validation recall : {:.4f}".format(recall_score(targets,predictions)))
    print("Validation precision : {:.4f}".format(precision_score(targets,predictions)))
    print("Validation f1-score : {:.4f}".format(f1_score(targets,predictions)))
    precision, recall, _ = precision_recall_curve(targets, predictions)
    pr_auc = auc(recall,precision) # NB : average_precision_score(y_val, predictions) also gives "AUC" precision/recall
    print("Validation auc : {:.4f}".format(pr_auc))

## Logistic regression

In [None]:
# Scale features

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)

In [None]:
from sklearn.linear_model import LogisticRegression


clf = LogisticRegression(C=1,random_state=42,class_weight='balanced')
clf.fit(x_train, y_train)
predictions = clf.predict(x_val)
plot_metrics(y_val, predictions)


In [None]:
# Features impotance
plt.xticks(rotation=90)
plt.bar(x.columns, clf.coef_[0])



* The model seems to give too much importance to the 'age' attribute.
* The above non detected cases all have relatively low ages compared to person that had stroke (80 yo vs 58 yo)
* if the individual is a child, it helps the model predicting the sample as non-stroke (high negative peak for work_type = children)

## Decision tree

In [None]:

dt_clf = DecisionTreeClassifier(random_state=42, class_weight='balanced')
dt_clf.fit(x_train,y_train)
predictions = dt_clf.predict(x_val)
plot_metrics(y_val, predictions)

## SVM

In [None]:
from sklearn.svm import LinearSVC
svm_clf = LinearSVC(C=0.01, class_weight ='balanced')
svm_clf.fit(x_train, y_train)

# Evaluation
predictions = svm_clf.predict(x_val)
plot_metrics(y_val, predictions)


In [None]:
# Features impotance
plt.xticks(rotation=90)
plt.bar(x.columns, svm_clf.coef_[0])

## Non linear SVM

In [None]:
from sklearn.svm import SVC
svm_clf = SVC(C=1, class_weight ='balanced', kernel='rbf',gamma='auto')
svm_clf.fit(x_train, y_train)

# Evaluation
predictions = svm_clf.predict(x_val)
plot_metrics(y_val, predictions)


# 8. Gridsearch and cross-validation with SVM
In the last section, simple models were evaluated with manually chosen parameters using a fix training/validation sets. 
To automate the evaluation process and to get a better estimate on how accurate our model will be in practice we will use **GridSearch** and **cross-validation**

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)


In [None]:
scaler = StandardScaler()
svc = SVC()
svc_pipe = Pipeline(steps=[('scaler', scaler), ('svc', svc)])

In [None]:
parameters = {'svc__kernel':['linear', 'rbf'], 
              'svc__C':[0.1, 1, 10],
             'svc__class_weight' : ['balanced']}


clf = GridSearchCV(svc_pipe, parameters,cv=skf, scoring=['recall','precision'], refit='recall')
clf.fit(x,y)


In [None]:
best_estimator = clf.best_estimator_
print('Best params :{}\nBest CV score(recall) : {}'.format(clf.best_params_,clf.best_score_))

* Finally,we end up with a high CV recall 0.84 traded with accuracy/precision meaning that our models predict a lot of non-positive sample as positive. By ignoring the classication problem, a positive sample predicted by our model can also be considered as an individual that has more than 50% to get a stroke. Thinking about probabilities might be better as the predictions are given to people that did not have a stroke yet so they can get a preventive treatment depending on the risks.