In this notebook, I do:
   - Visualize the effect of variables on the stroke
   - Building the models to predict a stroke disease given the predictors

The main problem of this dataset is that it's highly imbalanced in target class (stroke). But the methods like SMOTE and adjucting the decision threshold can help us deal with this problem.

In [None]:
!pip install seaborn --upgrade

Let's import the fundamental modules.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import os
import warnings

warnings.filterwarnings('ignore')
sns.set()
%matplotlib inline
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
data = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv').drop(['id'], axis=1)
data.head()

# 1) Explore Data Analysis

In [None]:
data.info()

In [None]:
# Create new columns for visualization
data['Stroke?'] = data['stroke']==1
data['Hypertension?'] = data['hypertension']==1
data['Heart Disease'] = data['heart_disease']==1

# Declare size of figures
my_size = {'width':800, 'height':500}

## 1.1) Age

In [None]:
data['gender'].value_counts()

In [None]:
fig = px.histogram(data, x='age',
                   nbins=20, 
                   title='Age distribution', 
                   color_discrete_sequence=px.colors.qualitative.Antique,
                   marginal='box', 
                   color='Stroke?',
                   **my_size,)

fig.update_layout(bargap=0.1)

We clearly see that most of the people that have a stroke are elderly.

## 1.2) Disease record : Hypertension, heart disease

In [None]:
temp = pd.pivot_table(
            data,
            values = ['stroke'],
            index = ['hypertension'],
            columns = ['heart_disease'],
            aggfunc = {'stroke':['count','mean']}
        )

temp.columns = temp.columns.set_levels(['No', 'Yes'], level=2)
temp.index = pd.Index(['No','Yes'], name='Hypertension')

temp.style.set_properties(**{'background-color': 'khaki','border-color': 'white'},subset=[('stroke','mean','No'),('stroke','mean','Yes')])

In [None]:
px.imshow(
    temp.loc[:,('stroke','mean')],
    labels = dict(color='Stroke'),
    title = 'Stroke probabilities',
    color_continuous_scale = px.colors.sequential.Redor,
    **my_size
)

We see that people that have ever had both heart disease and hypertension are most likely to have a stroke. On the other hand, people that never have those diseases tend to not having a stroke too. 

## 1.3) Personal information

In this section, we'll look into the effect of married status, working type, and residence type on a stroke.

In [None]:
def get_quick_report(feature):
    temp = pd.pivot_table(
                    data,
                    values = 'stroke',
                    index = feature,
                    aggfunc = ['sum','count','mean']
                )
    temp.columns = pd.MultiIndex.from_arrays([['Stroke','Stroke','Stroke'],['sum','count','mean']])
    
    return temp

    
temp_married = get_quick_report('ever_married')
temp_work = get_quick_report('work_type')
temp_residence = get_quick_report('Residence_type')

In [None]:
def one_to_many(index):
    out = []
    for i in index.values:
        out.append((index.name, i))
    return out

In [None]:
temp = pd.concat([temp_married, temp_work, temp_residence], axis=0)

arr = one_to_many(temp_married.index) + one_to_many(temp_work.index) + one_to_many(temp_residence.index)

temp.index = pd.MultiIndex.from_tuples(arr)
temp.style.background_gradient(sns.light_palette('darkorange',as_cmap=True), subset=[('Stroke','mean')])

## 1.4) Health information

In this section, we'll look into the effect of smoke level, BMI, ,gender, and Glucose level on a stroke.

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    shared_yaxes =True,
    rows=1, cols=2,
    horizontal_spacing = 0.02,
    subplot_titles = ("Average Glucose level", "Body mass index")
)

for i in [0,1]:
    if i == 0:
        name = 'No'
        color = 'rgb(217,175,107)'
        group = 'g_No'
    else:
        name = 'Yes'
        color = 'rgb(204,80,62)'
        group = 'g_Yes'
        
    fig.add_trace(
        go.Histogram(
            x = data[data['stroke']==i]['avg_glucose_level'],
            nbinsx  = 50,
            legendgroup = group,
            name = name,
            marker = dict(color=color),
            showlegend = False
        ),
        row=1, col=1,
    )
    
    fig.add_trace(
        go.Histogram(
            x = data[data['stroke']==i]['bmi'],
            nbinsx  = 50,
            legendgroup = group,
            name = name,
            marker = dict(color=color)
        ),
        row=1, col=2
    )

fig.update_layout(barmode='overlay', bargap=0)
fig.update_xaxes(row=1, col=1, title_text='Glucose level')
fig.update_xaxes(row=1, col=2, title_text='BMI')
fig.update_yaxes(row=1, col=1, title_text='count')
fig.update_layout(legend_title_text='Stroke')

fig.show()

We don't see any clear relation between Glucose level, BMI to stroke. It seems like people can have a stroke at every level of Glucose and BMI.

In [None]:
temp = data.groupby(by='smoking_status')['Stroke?'].agg('mean')*100

fig = make_subplots(
    subplot_titles = ["Smoke and stroke"],
    specs=[[{"secondary_y": True}]]
)

for i in [0,1]:
    
    if i == 0:
        name = 'No'
        color = 'rgb(217,175,107)'
    else:
        name = 'Yes'
        color = 'rgb(204,80,62)'
        
    
    fig.add_trace(
        go.Histogram(x=data[data['stroke']==i]['smoking_status'], 
                     name=name, 
                     marker = dict(color=color)),
        secondary_y=False,
    )

fig.add_trace(
    go.Scatter(x=temp.index, 
               y=temp.values, 
               name="Average", 
               mode='markers', 
               marker=dict(size=20, color='royalblue')),
    secondary_y=True,
)

fig.update_layout(legend_title_text='Stroke', **my_size)
fig.update_yaxes(title_text='count', secondary_y=False)
fig.update_yaxes(title_text='% stroke', secondary_y=True)

We see that people that are ever smoke(both formerly and presently) have a relatively high chance to have a stroke.

In [None]:
px.histogram(
    data[data['gender']!='Other'],  # Because it has only 1 observation
    x = 'gender',
    color = 'Stroke?',
    barmode = 'group',
    color_discrete_sequence = px.colors.qualitative.Antique,
    title = 'Gender and stroke',
    **my_size
)

# 2) Preprocess the data

## - Missing values, Standardize, Encoding <br>
Let's import the dataset again.

In [None]:
data = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv').drop(['id'], axis=1)
data.head(3)

In [None]:
sns.countplot(data=data, x='stroke')
plt.show()

We see that the dataset is very imbalanced. I'll do my best to deal with it later. <br>
But first let's see how many missing values.

In [None]:
data.isnull().sum()

There're only missing values in bmi column. I'll fill the mean to them. <br>
Next step, I'll do preprocessing the data and fitting it to models. Let's import the relevant classes.

In [None]:
# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost

# Evaluation
from sklearn.metrics import classification_report, confusion_matrix

I'll rearrange the columns so that we can easily track the index of columns for ColumnTransformer in next step.

In [None]:
X = data.drop(['stroke'],axis=1)
Y = data['stroke']

X_category = X.select_dtypes(include='object')
X_numeric = X.select_dtypes(exclude='object')

X = pd.concat([X_category, X_numeric], axis=1)

I'll use sklearn.pipeline.Pipeline to sequentially transform the numerical columns by imputting followed by scaling. Then, pass this pipeline along with OneHotEncoder to ColumnsTransformer to do the Preprocessing stuff. 

Of course, we have to split the data into train set and test set. Then we fit the ColumnsTransformer to the train set and transform it to both of them.

In [None]:
# Building the preprocessing pipeline
imp_std = Pipeline(
    steps=[
        ('impute', SimpleImputer(strategy='median')),
        ('scale', StandardScaler()),
    ]
)

ct = ColumnTransformer(
    remainder='passthrough',
    transformers = [
        ("Encoding",OneHotEncoder(),[0,1,2,3,4]),
        ("Scaler", imp_std,[5,6,7,8,9])
    ]
)


# Split the data
X_train_idle, X_test_idle, y_train, y_test = train_test_split(X, Y, 
                                                              test_size=0.2, 
                                                              stratify=Y)

# Fit our transformers to train set
ct.fit(X_train_idle)

# Transform both train and test set
X_train = ct.transform(X_train_idle)
X_test = ct.transform(X_test_idle)

According to the highly imbalance of this dataset, at my first run, the models perform very well in predicting major class (0: not having stroke) but very poorly for minor class. So, I'll try applying SMOTE to oversample the dataset in hope that the models can learn more efficiently.

In [None]:
from imblearn.over_sampling import SMOTE

X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)

# 3) Building models

Building models with their default parameters.

In [None]:
models = dict()
models['Dicision Tree'] = DecisionTreeClassifier(class_weight={0:1,1:2})
models['Random Forest'] = RandomForestClassifier(class_weight={0:1,1:2})
models['Logreg'] = LogisticRegression()
models['GradientBoost'] = GradientBoostingClassifier()
models['AdaBoost'] = AdaBoostClassifier()
models['XGBoost'] = xgboost.XGBClassifier()

Fit the models to the resampled train set.

In [None]:
for model in models:
    models[model].fit(X_train_resampled, y_train_resampled)
    print(model + ' : fit')

See the performance on train set.

In [None]:
print("Train set prediction")
for x in models:
        
    print('------------------------'+x+'------------------------')
    model = models[x]
    y_train_pred = model.predict(X_train_resampled)
    arg_train = {'y_true':y_train_resampled, 'y_pred':y_train_pred}
    print(confusion_matrix(**arg_train))
    print(classification_report(**arg_train))

The performance on train set is (too) good. That's because we use SMOTE. It makes model learn very well because of having a perfect balance dataset. <br>
Next, see the performance in test set.

In [None]:
print("Test set prediction")
for x in models:
        
    print('------------------------'+x+'------------------------')
    model = models[x]
    y_test_pred = model.predict(X_test)
    arg_test = {'y_true':y_test, 'y_pred':y_test_pred}
    print(confusion_matrix(**arg_test))
    print(classification_report(**arg_test))

The metric I give more interest is **"Recall"** rather than accuracy because I don't want the situation like the following: <br>
    - "A person is very likely to have a stroke but the model tells he/she doesn't"

Which is a very bad situation. The model will tell us like that when it has low recall (high False Negative rate). <br>
The True Negative situation (model tells that this a person will have a stroke but he/she actually doesn't) is not that bad compared to the first one. In the second case, a person will have to take a good care of his health.

Inspecting from models' classification report, I would say that Logistic regression model has done the best job here. <br>
**Note:** Furthermore, We can try **tuning models' hyperparameters** to get the better result or **adjusting the probablity threshold** to improve their performance. (*I'll do that in the next update*)

Lastly, let's see the roc curve to compare the performance of different models.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

fig, ax = plt.subplots()
fig.set_size_inches(13,6)

for m in models:
    y_pred = models[m].predict_proba(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_pred[:,1].ravel())
    plt.plot(fpr,tpr, label=m)
plt.xlabel('False-Positive rate')
plt.ylabel('True-Positive rate')
plt.legend()
plt.show()

In [None]:
print('roc_auc_score')
for i in models:
    model = models[i]
    print(i + ' : ',roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]).round(4))