# Stroke Prediction EDA and Prediction with Tree-based methods
We explore, analyse and process the given dataset followed by model design, training and tuning using Tree-based machine learning algorithms.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.head(10)

## Initial Exploration of Dataset

In [None]:
df.info()

We have 5110 samples and 12 dimensions. Among these 12, all except `id` and `stroke` are feature columns (10 in total) and `stroke` is our label column (response). The column `id` can be removed and stored as `Series` object just in case we require it later.

In [None]:
patient_id = df.pop('id')
df.describe()

Among our predictors, we have 3 numerical features; `age, avg_glucose_level, bmi`. The remaining features along with the label are categorical and should be converted to `category` datatype, as done below.

In [None]:
cat_cols = ['gender', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'stroke']
num_cols = ['age', 'avg_glucose_level', 'bmi']

df[cat_cols] = df[cat_cols].astype('category')
df.info()

In [None]:
for column in df.columns:
    if df[column].dtype.name == 'category':
        print(column,": ")
        print(df[column].value_counts())
        print()

#### Observations:
* `age` feature looks pretty evenly distributed but `avg_glucose_level` and `bmi` shows signs of right tail heavy distributions due to large differences between their third quartile and maximum value.
* The minimum age value of 0.08 and maximum bmi of 97 are suspicious.
* There is only one sample with `gender=Other`. This may result in our final model not performing adequately for similar samples when used later with unseen data. It might create further issues depending on which set (training or test) is gets split into. For our case, we will keep it but in real world applications, such an event requires discussion with the data source or domain expert.
* Our binary categorical medical feature columns have more `0`s than `1`s

## Exploratory Data Analysis
First, we plot a Pairs Grid as well as the correlation matrix with the numerical columns along with the label.

In [None]:
sns.pairplot(x_vars=num_cols, y_vars=num_cols, hue='stroke', data=df, palette='bright')

In [None]:
df[num_cols].corr()

Now, we plot individual histograms for all the numeric feature columns.

In [None]:
for column in df.columns:
    if df[column].dtype.name is not 'category':
        plt.hist(df[column], bins=30, edgecolor='black')
        plt.title(column)
        plt.show()

As stated in the observations of the previous section, we can clearly see that `bmi` and `avg_glucose_level` are tail heavy distribtuions while `age` approximates to a Gassian distribution. If we were using parametric classification algorithms like logarithmic regression, we would have to normalize these using Z-score normalization or any other similar method. But, we have chosen to use Tree based method which are non-parametric and hence are uneffected.  
For the categorical features, we will graph categorical plots.

In [None]:
for column in df.columns:
    if (df[column].dtype.name is 'category') and (column != 'stroke'):
        sns.catplot(y=column, hue="stroke", kind="count", edgecolor="black", data=df)
        plt.title(column)
        plt.show()

It is important to interpret each graph keeping in mind the number of samples per category. For example, we can say married people have a higher chance of stroke than unmarried people looking at the `ever_married` plot but this ignores the fact that we have more samples of married people, 3353 vs just 1757 `ever_married=No` samples.

## Handling Erraneuos Data
First, we look at all samples that have an age less than 1

In [None]:
df[ df['age'] < 1 ]

As seen from the `work_type` column, we can safely state that none of these error. We could round them off to zero but it might not be necessary in our case as we plan on using Tree-based methods.  
Now, we look at all samples with a bmi greater than 50. We could look below a lower threshold of 40 as that is considered the maximum usually observed.

In [None]:
df[ df['bmi'] > 60 ]

A look at the above samples and their features does not provide a clearer understanding of whether these are erraneous entries or extreme cases. The only thing common between all of them is that they neither had heart disease or stroke. For our case, we will let the observations be but in the field, this might once again involve discussions and clarification from the data source or domain expert.

## Handling Missing Data

In [None]:
df.isna().sum()

We have 201 missing values in the `bmi` column. If the missing value was in a categorical column, we could ignore it as it would be handled in the encoding of the dataset later.  
Let us have a look at the features of the corresponding samples.

In [None]:
bmifilt = df['bmi'].isna() 
for column in df.columns:
    if df[column].dtype.name is 'category':
        print(column,': ')
        print( df.loc[bmifilt, column].value_counts())
        print()
    else:
        print(column,': ')
        print( df.loc[bmifilt, column].describe())
        print()

Compared to our complete dataset, these samples have a higher average glucose level. The remaining features look usual with nothing particularly standing out. To fill the missing values, we will use the average value of `bmi` column but according to their stroke value.

In [None]:
strokefilt = (df['stroke'] == 1)

s0_mean = df.loc[~strokefilt,'bmi'].mean()
s1_mean = df.loc[strokefilt,'bmi'].mean()

df.loc[(bmifilt) & (~strokefilt), 'bmi'] = s0_mean
df.loc[(bmifilt) & (strokefilt), 'bmi'] = s1_mean

df.isna().sum()

## Splitting the Dataset
We need to split the dataset into training and test set before we start modelling. We will split the dataset into a ratio of 7:3. Validation set is not required because we will be using cross-validation for hyperparameter tuning.  
Before we split, we need to one-hot encode our categorical columns, except the binary ones. The columns with binary string data will be label encoded. This has to be done due to the limitation in Scikit-Learn where categorical variables cannot be passed despite working directly on categorical data being a main advantage of Tree-based methods.  

In [None]:
from sklearn.preprocessing import LabelEncoder

le_gender = LabelEncoder()
le_marr = LabelEncoder()
le_residence = LabelEncoder()

In [None]:
mod_df = df.copy()

mod_df['gender'] = le_gender.fit_transform(df['gender'])
mod_df['ever_married'] = le_marr.fit_transform(df['ever_married'])
mod_df['Residence_type'] = le_residence.fit_transform(df['Residence_type'])

#display the labels to act as reference later
print(le_gender.classes_)
print(le_marr.classes_)
print(le_residence.classes_)

We also label encoded the `gender` column despite it being on-binary because `Other` can be conceptually ignored as there is only one sample with it.  
Now, we one-hot encode using the `.get_dummies()` function in the *Pandas* library.

In [None]:
encdf = pd.get_dummies( df[['work_type', 'smoking_status']] )
mod_df.drop(['work_type', 'smoking_status'], axis=1, inplace=True)
mod_df = mod_df.join(encdf)
mod_df.head(10)

In [None]:
mod_df.info()

Before we split, let us have one look at the correlation matrix with the fully processed dataset

In [None]:
sns.heatmap(mod_df.corr())

`age` and `ever_married` look to have a relation but nothing significant that might require feature engineering.  
Now, we split

In [None]:
from sklearn.model_selection import train_test_split

y = mod_df.pop('stroke')
X = mod_df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train.shape, '\t', X_test.shape)
print(y_train.shape, '\t', y_test.shape)

## Model Design and Training
We will use 3 machine learning algorithms; **Decision Tree** and **Random Forests**. The models with default parameters are run first and compared based on training accuracy, F-score and AUC (Area under ROC Curve) to get an initial understanding. Then, we tune both models using cross-validiation to improve our results. Nowhere during this step should the test dataset be used.

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(random_state=0)
dtree.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

print(accuracy_score(y_train, dtree.predict(X_train)))
print(f1_score(y_train, dtree.predict(X_train)))
print(roc_auc_score(y_train, dtree.predict(X_train)))

Surprisingly, we get 100% for all of our three metrics. This might seem ideal but might be a case of our model overfitting the training data. If that is the case, our Tree will perform horribly on unseen samples making it useless for any real world applications. To go a little further, let us look at the cross-validation score for our model as well. 


In [None]:
from sklearn.model_selection import cross_val_score

dt_cvscore = cross_val_score(dtree, X_train, y_train, cv=10)
print(dt_cvscore)
print(dt_cvscore.mean())

Now, we get a more clear metric eual to 91% that can be used to compare our models.  
We can now try to visualise our tree.

In [None]:
from sklearn import tree

plt.figure(figsize = (20,14))
tree.plot_tree(dtree, rounded=True)
plt.show()

Due to large number of predictors, the visualization does not provide fine details. A better approach would be to save a tree as an image file using the instructions given [here](https://chrisalbon.com/machine_learning/trees_and_forests/visualize_a_decision_tree/) and analyze it using a standard GUI image viewer. We recommend this step because interpretabiliy of the model as provided my the visualized tree is one of the main advantages of using a Decision Tree.

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

In [None]:
print(accuracy_score(y_train,rf.predict(X_train)))
print(f1_score(y_train, rf.predict(X_train)))
print(roc_auc_score(y_train, rf.predict(X_train)))

In [None]:
rf_cvscore = cross_val_score(rf, X_train, y_train, cv=10)
print(rf_cvscore)
print(rf_cvscore.mean())

Here, we see that the original three metrics are still 100% but we get a much better cross-validation score when compared to a Decision Tree.

## Hyperparameter Tuning
We will be using `GridSearchCV` for this.
### Decision Tree tuning

In [None]:
from sklearn.model_selection import GridSearchCV

dtree = DecisionTreeClassifier(random_state=0)
dtparam = {"max_depth":[3,5,10,None], 'max_features':['auto',8,16,None]}
dtcv = GridSearchCV(dtree, param_grid=dtparam, cv=10, scoring=['f1','accuracy'], refit='f1')
dtcv.fit(X_train,y_train)

In [None]:
print(dtcv.best_estimator_)
print(dtcv.best_score_)
print(dtcv.cv_results_['mean_test_accuracy'][dtcv.best_index_])

Our accuracy remained the same. To get a better understanding, we plot a heatmap with all parameters to see how our accuracy improves.

In [None]:
x1 = ['3','5','10','None']
x2 = ['auto','8','16','None']
y = dtcv.cv_results_['mean_test_accuracy']

tempdf = pd.DataFrame(data=np.reshape(y,(4,4)), index=x1, columns=x2)
sns.heatmap(tempdf)

Limiting the depth of our tree to either 3 or even 5 provides the most significant impact to our model. Further minute improvement is by limiting the number of features selected to 16.

### Random Forest tuning

In [None]:
rf = RandomForestClassifier(random_state=0)
rfparam = {"n_estimators":[5,10,50,100,200], "max_depth":[3,5,10,None]}
rfcv = GridSearchCV(rf, param_grid=rfparam, cv=10,  scoring=['f1', 'accuracy'], refit='f1')
rfcv.fit(X_train,y_train)

In [None]:
print(rfcv.best_estimator_)
print(rfcv.best_score_)
print(rfcv.cv_results_['mean_test_accuracy'][rfcv.best_index_])

In [None]:
x1 = ['3','5','10','None']
x2 = ['5','10','50','100','200']
y = rfcv.cv_results_['mean_test_accuracy']

tempdf = pd.DataFrame(data=np.reshape(y,(4,5)), index=x1, columns=x2)
sns.heatmap(tempdf)

Once again, limiting our depth provides the majority of the improvement in our accuracy.

## Test Accuracy

In [None]:
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix

def final_report(model):
    print(accuracy_score(y_test, model.predict(X_test)))
    plot_confusion_matrix(model, X_test, y_test)
    plot_roc_curve(model, X_test, y_test) 
    plt.show()

In [None]:
final_report(dtcv.best_estimator_)

In [None]:
final_report(rfcv.best_estimator_)

# Final Observations:
Even though we get high accuracy values, we get both a low AUC and low F-score with our models. The high accuracy is due to the imbalance in the number of samples with `stroke=0` vs `stroke=1`. This is why the F-score is a better metric for comparison as it somewhat negates this imabalance due to its mathematical formula.  

#### Please do provide suggestions for areas of improvement, both technical and general, in this notebook as this is my first try on Kaggle. It will be greatly appreciated.
