<img src="https://www.heart.org/-/media/images/news/2019/october-2019/1017strokeptsd_sc.jpg" alt="drawing" height="600" width="600"/>

# **A stroke is a medical condition in which poor blood flow to the brain causes cell death.**
### There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding.Both cause parts of the brain to stop functioning properly. Signs and symptoms of a stroke may include an inability to move or feel on one side of the body, problems understanding or speaking, dizziness, or loss of vision to one side.Signs and symptoms often appear soon after the stroke has occurred. If symptoms last less than one or two hours, the stroke is a transient ischemic attack (TIA), also called a mini-stroke. A hemorrhagic stroke may also be associated with a severe headache. The symptoms of a stroke can be permanent. Long-term complications may include pneumonia and loss of bladder control.

### **In this notebook I will analyze dataset to find what factors increase the probability of the storke and create model for automatic classification of this task.**


## Please upvote my work if you find it helpful. Happy reading :)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl

%matplotlib inline

In [None]:
data = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe()

### The proportion of records containing stroke seems to be very low. To train the model we will have to resolve this problem otherwise model will be skew towards non-stroke patients.

### lets go through all categorical cols to see if everything looks correct

In [None]:
def present_categorical(col):    
    print(data[col].value_counts())
    data[col].value_counts().plot.bar()

In [None]:
present_categorical('gender')

### "Other" is present in single record, let's drop it because it seems irrelevant

In [None]:
data = data[data['gender'] != 'Other']

In [None]:
present_categorical('ever_married')

In [None]:
present_categorical('work_type')

In [None]:
present_categorical('Residence_type')

In [None]:
present_categorical('smoking_status')

### Now that we know how categorical data is distributed let's see what is the relationship with stroke

In [None]:
def cat_plot(x):
    sns.catplot(data=data, x=x, hue='stroke',kind='count')

In [None]:
cat_plot('gender')

In [None]:
cat_plot('ever_married')

In [None]:
cat_plot('work_type')

In [None]:
cat_plot('Residence_type')

In [None]:
cat_plot('smoking_status')

### The only thing which seems to have a clear relationship with stoke is ever_married column
### But generaly it is hard to spot anything because of the dataset imbalance


### Let's plot scatter matrix but before that to improve readibility we should remove binary columns

In [None]:
from pandas.plotting import scatter_matrix
cols = data.columns
cols = cols.drop(['hypertension', 'heart_disease'])

_ = scatter_matrix(data[cols], alpha=0.1, figsize=(14,14), hist_kwds={'bins': 30})

### Let's check the realtionship for numerical features either

In [None]:
def plot_face_grid(x):
    g = sns.FacetGrid(data, col='stroke', height=6)
    g.map(sns.kdeplot, x, shade=True).add_legend()

In [None]:
plot_face_grid('age')

In [None]:
plot_face_grid('bmi')

In [None]:
plot_face_grid('avg_glucose_level')

### avg_glucose_level and age seems to have the most significant impact

### Before creating the model lets drop all nan values, data set is small but there are not as many of them lets drop ids too, we won't need them for predictions

In [None]:
data.dropna(inplace=True)
data.drop(['id'], axis=1, inplace=True)

### Let's see the correlations between features

In [None]:
fig, ax = plt.subplots(figsize=(10,5)) 
sns.heatmap(data.corr(), annot=True, ax=ax)

### Let's split data into train, test and val set

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, train_size=0.8, test_size=0.2, random_state=123)
val, test = train_test_split(test, train_size=0.5, test_size=0.5, random_state=123)
train_y = train['stroke']
test_y = test['stroke']
val_y = val['stroke']

train.drop(['stroke'], axis=1, inplace=True)
test.drop(['stroke'], axis=1, inplace=True)
val.drop(['stroke'], axis=1, inplace=True)

In [None]:
cat_cols = train.loc[:,data.dtypes == "object"].columns
num_cols = train.loc[:,data.dtypes != "object"].columns

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler())
    ])

cat_pipeline = Pipeline([
        ('one_hot', OneHotEncoder(handle_unknown='ignore'))
    ])

full_pipeline = ColumnTransformer([
        ('num', num_pipeline, num_cols),
        ('cat', cat_pipeline, cat_cols)
    ])
    

train = full_pipeline.fit_transform(train, train_y)
test = full_pipeline.fit_transform(test)
val = full_pipeline.fit_transform(val)

In [None]:
train.shape

### We can see only 5% of data show patients who had a stroke. It is a clear inbalance which will not allow model to learn properly. To avoid that I will try a couple of methods(undersampling and oversampling) to eliminate the problem.
### Let's check wich method works the best with RandomForestClassifier

In [None]:
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN, SVMSMOTE
from imblearn.under_sampling import NearMiss, RandomUnderSampler, AllKNN, NeighbourhoodCleaningRule

equalizers = [
    SMOTE(),
    BorderlineSMOTE(),
    ADASYN(),
    SVMSMOTE(),
    NearMiss(),
    RandomUnderSampler(),
    AllKNN(),
    NeighbourhoodCleaningRule()
]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

def train_and_evaluate(model, train, train_y, test, test_y, eq=None, train_model=True, threashold=0.5):
    if train_model:
        model.fit(train, train_y)
    
    results = model.predict_proba(test)
    
    proba = results[:,1]
    results = (results[:,1] > threashold).astype(int)
    
    print('/'*80)
    print(model)
    if eq != None:
        print(eq)
    print()
    print('confusion_matrix')
    print(confusion_matrix(test_y, results))
    print('roc_auc')
    print(roc_auc_score(test_y, proba))
    print(classification_report(test_y, results))
    
    return proba

In [None]:
for eq in equalizers:
    model = RandomForestClassifier(random_state=1234)
    train_eq, train_y_eq = eq.fit_resample(train, train_y.ravel())
    train_and_evaluate(model, train_eq, train_y_eq, test, test_y, eq)

### As we can see randomundersampler seems to be working the best(it maximize the recall for stoke) keeping 

In [None]:
eq = RandomUnderSampler()
train, train_y = eq.fit_resample(train, train_y.ravel())
train.shape

### Let's quickly go through couple models and pick 2~3 the best of them to try improve the results with various hyperparameters. We are going to try to maximize roc_auc score

In [None]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier



from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import cross_validate

np.random.seed(1234)

In [None]:
models = [
    (AdaBoostClassifier(), 'AdaBoost'),
    (RandomForestClassifier(), 'RandomForest'),
    (ExtraTreesClassifier(), 'ExtraTreesClassifier'),
    (LogisticRegression(), 'LogisticRegression'),
    (KNeighborsClassifier(), 'KNeighbors'),
    (SVC(probability=True), 'SVC'),
    (XGBClassifier(use_label_encoder=False), 'XGB'),
    (LGBMClassifier(), 'LGBM')
]

def print_scores(scores, model_name):
    print(model_name)
    print()
    print(scores)
    print("mean: {}".format(scores.mean()))
    print("std: {}".format(scores.std()))
    print()
    print()

In [None]:
for model, name in models:
    train_and_evaluate(model, train, train_y, test, test_y)

### It seems like logistic regression obtains the best results, and that roc_auc is a good indicator to compare
### Let's run the same but with cross validation to make sure which models are the best

In [None]:
scores = []
scoring = ['roc_auc', 'balanced_accuracy']
for model, name in models:
    score = cross_validate(model, train, train_y, cv=5, scoring=scoring)
    scores.append((score['test_roc_auc'], name))    

In [None]:
for score, name in scores:
    print_scores(score, name)

### As we can see, the best result was obtained by Logistic regression, second by SVC, and third by KNN.
### Let's take these models and try to improve the score as much as we can. Instead of Logistic regression we will try to imporve 4th model RandomForest because it does not have any parameters to tune. 

In [None]:
parameters = [
    {
    'C': [0.01, 0.5, 1, 2, 5, 10],
    'kernel' : ['poly'],
    'degree' : [2,3],
    'gamma': ['scale', 'auto'],
    'coef0': [0.5, 1, 2, 3],
    'class_weight': ['balanced', None]    
    },
    {
    'C': [0.01, 0.5, 1, 2, 5, 10],
    'kernel' : ['rbf', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'class_weight': ['balanced', None]    
    },
    {
    'C': [0.01, 0.5, 1, 2, 5, 10],
    'kernel' : ['linear'],
    'class_weight': ['balanced', None] 
    }
]

model = SVC(probability=True)
grid_search = GridSearchCV(model,
                           param_grid=parameters,
                           cv=5,
                           scoring='roc_auc',
                           refit='roc_auc',
                           )

r = grid_search.fit(train, train_y)
scores = r.cv_results_
svc = r.best_estimator_

In [None]:
max(scores['mean_test_score'])

In [None]:
for mean_score, params in sorted(list(zip(scores["mean_test_score"], scores["params"])),key = lambda x: x[0]):
     print(mean_score, params)

In [None]:
parameters = [
    {
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'n_neighbors': [1, 3, 5, 7],
    },
]

model = KNeighborsClassifier()
grid_search = GridSearchCV(model,
                           param_grid=parameters,
                           cv=5,
                           scoring='roc_auc',
                           refit='roc_auc',
                           )

r = grid_search.fit(train, train_y)
scores = r.cv_results_
knn = r.best_estimator_

In [None]:
max(scores['mean_test_score'])

In [None]:
for mean_score, params in sorted(list(zip(scores["mean_test_score"], scores["params"])),key = lambda x: x[0]):
     print(mean_score, params)

In [None]:
parameters = [
    {
    'n_estimators': [10, 50, 100, 200],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 4, 8],
    'min_samples_split': [2, 4, 8],
    'min_samples_leaf': [1, 2, 4],
    },
]

model = RandomForestClassifier()
grid_search = GridSearchCV(model,
                           param_grid=parameters,
                           cv=5,
                           scoring='roc_auc',
                           refit='roc_auc',
                           )

r = grid_search.fit(train, train_y)
scores = r.cv_results_
forest = r.best_estimator_

In [None]:
max(scores['mean_test_score'])

In [None]:
for mean_score, params in sorted(list(zip(scores["mean_test_score"], scores["params"])),key = lambda x: x[0]):
     print(mean_score, params)

In [None]:
forest_proba = train_and_evaluate(forest, train, train_y, test, test_y, train_model=False)
knn_proba = train_and_evaluate(knn, train, train_y, test, test_y, train_model=False)
svc_proba = train_and_evaluate(svc, train, train_y, test, test_y, train_model=False)

In [None]:
logistic_reg = LogisticRegression()
logistic_reg.fit(train, train_y)
lr_proba = train_and_evaluate(logistic_reg, train, train_y, test, test_y, train_model=False)

### Now, that we have our models lets find the best threashold for predictions. I will use fscore as a scoring metric

In [None]:
from sklearn.metrics import f1_score

def test_threshold(probas, test_y):
    results = []
    for i in range(20, 70):
        result = (probas > i / 100).astype(int)
        results.append((f1_score(test_y, result), i / 100))
    return sorted(results, key=(lambda x : x[0]), reverse=True)

In [None]:
forest_best_f_score = test_threshold(forest_proba, test_y)[0]
svc_best_f_score = test_threshold(svc_proba, test_y)[0]
knn_best_f_score = test_threshold(knn_proba, test_y)[0]
lr_best_f_score = test_threshold(lr_proba, test_y)[0]

In [None]:
train_and_evaluate(forest, train, train_y, test, test_y, train_model=False, threashold=forest_best_f_score[1])
train_and_evaluate(knn, train, train_y, test, test_y, train_model=False, threashold=knn_best_f_score[1])
train_and_evaluate(svc, train, train_y, test, test_y, train_model=False, threashold=svc_best_f_score[1])
_ = train_and_evaluate(logistic_reg, train, train_y, test, test_y, train_model=False, threashold=lr_best_f_score[1])


## Now let's try 2 the best model on val data to evalute final accuracy and recall

In [None]:
_ = train_and_evaluate(svc, train, train_y, val, val_y, train_model=False, threashold=svc_best_f_score[1])

In [None]:
_ = train_and_evaluate(logistic_reg, train, train_y, val, val_y, train_model=False, threashold=lr_best_f_score[1])

# Summary

## The best model turned out to be SVC.
## Final accuracy is around 85% and model recognize around 72% of strokes. 

### The biggest problem with the task is unbalanced and small dataset. It contains only around 200 positive examples. To improve the result the first step would be to collect more data. Assuming bigger dataset is available some data engineering seems to be a good idea. Trying kmeans before classification could work too. I could also try to impute values instead of dropping nan values.

## Please let me know what are your thoughts, it is my fist public notebook. All mistakes found and hints provided are welcome. 