<center><h1>Heart Failure Prediction EDA</h1></center>

<hr/>

<img src="https://storage.googleapis.com/kaggle-datasets-images/727551/1263738/b480e9c8a7b4efd0026dff1a2aeb98df/dataset-cover.png?t=2020-08-18-10-19-56" />

<hr/>

This is EDA on Kaggle [dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data/activity) regarding heart failure diagnosing.

This is the second of two notebooks, in which we are going to build models to predict if patient is in danger of having heart failure.

[First](https://www.kaggle.com/ilijal/heart-failure-prediction-1-2-eda) notebook focused on exploratory data analysis.

Whithout further ado, lets start!

## Table of contents

* [Conclusion from previous part](#Conclusions-from-previous-part)
* [Loading libraries and modules](#Loading-libraries-and-modules)
* [Importing data](#Importing-data)
* [Preparing dataset](#Preparing-dataset)
* [Inspect outliers](#Inspect-outliers)
* [Predictive modeling](#Predictive-modeling)
* [Conclusions](#Conclusions)
* [References](#References)

## Conclusions from previous part

[TOC](#Table-of-contents)

After the exploratory data analysis we have found out that:
- there is **no missing** values,
- there is **no duplicate** entries,
- dataset is **moderately imbalanced**,
- **none** of the **boolean** predictor variables correlates to patient death,
- 4 of the **numerical predictor variables seems to indicate if patient will die**. **Those are `serum_creatinine`, `ejection_fraction`, `serum_sodium` and `age`**, with age having the least effect.
- there are multiple numeric variables that contain **outliers** (`creatinine_phosphokinase`, `platelets`, and `serum_creatinine`). However, we will handle outliers for **`serum_creatinine`** since it indicates patient death.   

Based on these conclusions we want to:
- use 4 numerical predictor variables `serum_creatinine`, `ejection_fraction`, `serum_sodium` and `age`,
- want to handle outliers in `serum_creatinine` variable,
- use scaling of variable values since some of variables have value ranges orders of magnitude bigger than others,
- use Decision Trees, Random Forest and Logistic Regression models for classification.



## Loading libraries and modules

[TOC](#Table-of-contents)

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import plot_roc_curve, accuracy_score, make_scorer
from sklearn import metrics

from sklearn.model_selection import learning_curve, validation_curve, cross_val_score, train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

## Importing data

[TOC](#Table-of-contents)

In [None]:
file_name = '/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv'

df = pd.read_csv(file_name)

df

## Preparing dataset 
[TOC](#Table-of-contents)

In [None]:
# convert variables to bool

df['DEATH_EVENT'] = df.DEATH_EVENT.astype('bool')  

BOOL_COLUMNS = ['anaemia', 'diabetes', 'high_blog_pressure', 'sex', 'smoking']
# df['anaemia'] = df.anaemia.astype('bool')
# df['diabetes'] = df.diabetes.astype('bool')
# df['high_blood_pressure'] = df.high_blood_pressure.astype('bool')
# df['sex'] = df.sex.astype('bool')
# df['smoking'] = df.smoking.astype('bool')

df = df.drop(columns=['time'])  # drop time column

In [None]:
df

In [None]:
df.info()

## Inspect outliers
[TOC](#Table-of-contents)

It is good to view possible outliers in variables again.

In [None]:
r = c = 0

columns = 3
rows = int(np.ceil(len(df.columns)/columns))

fig, axs = plt.subplots(nrows=rows, ncols=columns, figsize=(20, 20))
plt.subplots_adjust(hspace=0.4)

for n, i in enumerate(df.columns):
    if i in BOOL_COLUMNS:
        sns.countplot(x=i, data=df, ax=axs[r, c])
    else: 
        sns.boxenplot(x='DEATH_EVENT', y=i, data=df, ax=axs[r, c])
    axs[r, c].set_title(i.upper(), y=1.02)
    axs[r, c].set_ylabel(None)
    
    c += 1
    if (n + 1) % columns == 0:
        r += 1
        c = 0


fig.suptitle('Box plots and count plots for variables grouped by DEATH_EVENT', size=20, y=0.95)
plt.show()

By looking at couple of articles on the web, we have found out that:
- Ejection Fraction (EF) has normal range 55%-70%, slightly below normal 40%-54%, moderately below normal 35%-39% and **severely below normal less than 35%**. More [here](https://my.clevelandclinic.org/health/articles/16),
- Normal serum creatinine levels are 0.9-1.3 mg/dL for males and 0.6-1.1 mg/dL for females. More [here](https://www.medicalnewstoday.com/articles/322380#what-does-the-test-involve) and [here](https://www.urmc.rochester.edu/encyclopedia/content.aspx?ContentTypeID=167&ContentID=creatinine_serum),
- Serum sodium has normal levels between 135 to 145 mEq/L. Values **below 135 mEq/L may indicate health issues**. More [here](https://www.mayoclinic.org/diseases-conditions/hyponatremia/symptoms-causes/syc-20373711#:~:text=A%20normal%20blood%20sodium%20level,Certain%20medications)


In [None]:
df.serum_creatinine.describe()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 5))

sns.distplot(a=df.serum_creatinine, color='blue', hist=True, kde=False, ax=axs[0]);
axs[0].set_title('serum_creatinine distribution plot');
sns.distplot(a=df.loc[df.DEATH_EVENT == True, 'serum_creatinine'], color='red', hist=True, kde=False, ax=axs[1])
sns.distplot(a=df.loc[df.DEATH_EVENT == False, 'serum_creatinine'], color='blue', hist=True, kde=False, ax=axs[1])
axs[1].set_title('serum_creatinine distribution plot per DEATH_EVENT');

Lets use `median + 3*std` as a threshold.

In [None]:
serum_creatinine_thresh = df.serum_creatinine.median() + df.serum_creatinine.std() * 3

print('Threshold value:', serum_creatinine_thresh)

These are the records to be excluded, as they are marked as outliers by `serum_creatinine` values.

In [None]:
df[df.serum_creatinine > serum_creatinine_thresh]

## Predictive modeling

[TOC](#Table-of-contents)

Here we want to handle everything along with hyperparameter tuning, comparing models and choosing best one.

Here are the things we will do:
- Separate predictor variable,
- Divide dataset in training and test sets,
- Define CV scheme and use traning data to find best hyperparameters and compare algorithms,
- Use best performing model and tune hyperparameters with whole training data, then test generalization error on test set,
- Display accuracy of classifier.

In [None]:
CHOSEN_COLUMNS = ['age', 'ejection_fraction', 'serum_creatinine', 'serum_sodium']

class FeaturesChooser(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        X = X.copy()
        return X[CHOSEN_COLUMNS]
    
class CreatinineOutlierHandler(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        X = X.copy()
        serum_creatinine_thresh = X.serum_creatinine.median() + X.serum_creatinine.std() * 3
        return X[X.serum_creatinine <= serum_creatinine_thresh]

### Separate predictor label

In [None]:
HANDLE_OUTLIERS = True

_df = df.copy()

if HANDLE_OUTLIERS is True:
    _df = CreatinineOutlierHandler().fit_transform(df)

X, y = _df.drop(columns=['DEATH_EVENT']), _df.DEATH_EVENT

X.shape, y.shape

### Separate trainin and test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y, random_state=42)

X.shape, X_train.shape, X_test.shape

Lets briefly display each variable distributions per created dataset

In [None]:
fig, axs = plt.subplots(ncols=3, nrows=4, figsize=(20, 20))

r, c = 0, 0

for n, i in enumerate(X_train.columns):
    X_train[i].hist(ax=axs[r, c])
    X_test[i].hist(ax=axs[r, c])
    axs[r, c].set_title('Distribution of {}'.format(i))

    c += 1
    if (n + 1) % columns == 0:
        r += 1
        c = 0
        
fig.suptitle('Distributions of variables for train and test sets.', size=20, y=0.925);

### Defining CV scheme and hyperparameter tuning

The idea is to compare results of:
- LogisticRegression,
- DecisionTreeClassifier and
- RandomForestClassifier,

as introductory algorithms for classification.

What we want to do here is to:
- tune hyperparameters and take best ones by measuring how stable each model is,
- compare each model based on stability,
- use test set to estimage how good is final model at generalizing.

So basically, we want to:
- take part of the data as test set (20%),
- take rest of the data as train set (80%),
- use nested 5x4 CV scheme. 5 folds of outter loop for estimating generalization and 4 folds of inner loop for hyperparameter tuning. 


In [None]:
INNER_SPLITS = 4
OUTTER_SPLITS = 5

print(f"Train set size {X_train.shape[0]}\n Test set size {X_test.shape[0]}\n")

print('Train/test size = {}/{} in outer splits.'.format(int(np.ceil(X_train.shape[0]*(OUTTER_SPLITS-1)/OUTTER_SPLITS)), X_train.shape[0]//OUTTER_SPLITS))
outter_train_size = int(np.ceil(X_train.shape[0]*(OUTTER_SPLITS-1)/OUTTER_SPLITS))

inner_train_size = int(np.ceil(outter_train_size*(INNER_SPLITS-1)/INNER_SPLITS))
inner_test_size = int(outter_train_size-inner_train_size)
print('Train/test size = {}/{} in inner splits.'.format(inner_train_size, inner_test_size))

In [None]:
# Choose scoring function for CV scheme
scoring_function = 'roc_auc'

In [None]:
def get_gridsearchcvs(stratified_folds=INNER_SPLITS):
    names, pipes, params = [], [], []
    
    pipe = Pipeline((
                    ('stds', StandardScaler()),
                    ('est', LogisticRegression(solver='liblinear'))))

    pipe_params = {
        'est__C': [0.0001, 0.001, .01, 0.1, 1, 10, 100]
    }
    
    names.append('LogisticRegression_1')
    pipes.append(pipe)
    params.append(pipe_params)

    pipe = Pipeline((
                    ('stds', MinMaxScaler()),
                    ('est', LogisticRegression(solver='liblinear'))))

    pipe_params = {
        'est__C': [0.0001, 0.001, .01, 0.1, 1, 10, 100]
    }
    
    names.append('LogisticRegression_2')
    pipes.append(pipe)
    params.append(pipe_params)
    
    pipe = Pipeline((('less_fts', FeaturesChooser()),
                    ('stds', StandardScaler()),
                    ('est', LogisticRegression(solver='liblinear'))))

    pipe_params = {
        'est__C': [0.0001, 0.001, .01, 0.1, 1, 10, 100]
    }

    names.append('LogisticRegression_3')
    pipes.append(pipe)
    params.append(pipe_params)

    
    pipe = Pipeline((
        ('est', DecisionTreeClassifier(criterion='entropy', random_state=42)),
    ))

    pipe_params = {
        'est__max_depth': [3, 4, 5, 6], 
        'est__min_samples_leaf': [5, 10, 15, 20, 30]
    }

    names.append('DecisionTree_1')
    pipes.append(pipe)
    params.append(pipe_params)

    pipe = Pipeline((
        ('poly', PolynomialFeatures(interaction_only=True, include_bias=False)),
        ('est', DecisionTreeClassifier(criterion='entropy', random_state=42)),
    ))

    pipe_params = {
        'poly__degree': [1, 2, 3],
        'est__max_depth': [3, 4, 5, 6], 
        'est__min_samples_leaf': [5, 10, 15, 20, 30]
    }

    names.append('DecisionTree_2')
    pipes.append(pipe)
    params.append(pipe_params)
    
    pipe = Pipeline((
        ('est', RandomForestClassifier(criterion='entropy', random_state=42)),
    ))

    pipe_params = {
        'est__n_estimators': [50, 100, 150, 200],
        'est__max_depth': [3, 4],
        'est__min_samples_leaf': [6, 10, 13]
    }

    names.append('RandomForest_1')
    pipes.append(pipe)
    params.append(pipe_params)
    
    gcvs = []

    for name, pipe, params in zip(names, pipes, params):
        print(f'Adding GridSearchCV for {name} estimator...')
        gcv = GridSearchCV(pipe, 
                           param_grid=params, 
                           scoring=scoring_function, 
                           refit=True, 
                           n_jobs=-1, 
                           cv=stratified_folds, 
                           return_train_score=True)
        gcvs.append({'name': name, 'gs': gcv})
    
    print('\n')
    return gcvs

In [None]:
gcvs = get_gridsearchcvs()

for gcv in gcvs:
    scores = cross_val_score(gcv['gs'], X_train, y_train, scoring=scoring_function, cv=OUTTER_SPLITS)
    print('{:>20s}, {:.2f}, std +/- {:.2f}%'.format(gcv['name'], 100 * np.mean(scores), 100 * np.std(scores)))

In [None]:
gcvs = get_gridsearchcvs(stratified_folds=4)

stats = []
    
for gcv in gcvs:
    gcv['gs'].fit(X_train, y_train)
    print(f"Estimator: {gcv['name']}, params: {gcv['gs'].best_params_}")
    cols = ['mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']
    cv_stats = pd.DataFrame(gcv['gs'].cv_results_)[cols].describe().loc['mean', :]

    print(f"CV train score {cv_stats['mean_train_score']:.5f} +/- {cv_stats['std_train_score']:.2f}%")
    print(f"CV validation score {cv_stats['mean_test_score']:.5f} +/- {cv_stats['std_test_score']:.2f}%")

    whole_train_score = gcv['gs'].score(X_train, y_train)
    test_score = gcv['gs'].score(X_test, y_test)
    
    stats.append({'name': gcv['name'],
                  'folds': 4,
                  'train_score_mean': cv_stats['mean_train_score'], 
                  'train_score_std': cv_stats['std_train_score'],
                  'train_score': whole_train_score,
                  
                  'validation_score_mean': cv_stats['mean_test_score'], 
                  'validation_score_std': cv_stats['std_test_score'],
                  'test_score': test_score})
    
    print(f"Whole train score: {whole_train_score:.5f}")
    print(f"Whole test score: {test_score:.5f}")
    print()

In [None]:
gcvs = get_gridsearchcvs(stratified_folds=5)

for gcv in gcvs:
    gcv['gs'].fit(X_train, y_train)
    print(f"Estimator: {gcv['name']}, params: {gcv['gs'].best_params_}")
    cols = ['mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']
    cv_stats = pd.DataFrame(gcv['gs'].cv_results_)[cols].describe().loc['mean', :]

    print(f"CV train score {cv_stats['mean_train_score']:.5f} +/- {cv_stats['std_train_score']:.2f}%")
    print(f"CV validation score {cv_stats['mean_test_score']:.5f} +/- {cv_stats['std_test_score']:.2f}%")

    whole_train_score = gcv['gs'].score(X_train, y_train)
    test_score = gcv['gs'].score(X_test, y_test)
    
    stats.append({'name': gcv['name'],
                  'folds': 5,
                  'train_score_mean': cv_stats['mean_train_score'], 
                  'train_score_std': cv_stats['std_train_score'],
                  'train_score': whole_train_score,
                  
                  'validation_score_mean': cv_stats['mean_test_score'], 
                  'validation_score_std': cv_stats['std_test_score'],
                  'test_score': test_score})
    
    print(f"Whole train score: {whole_train_score:.5f}")
    print(f"Whole test score: {test_score:.5f}")
    print()


In [None]:
df_stats = pd.DataFrame(stats)
df_stats

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=4, sharey=True, figsize=(20, 5));

sns.barplot(y='name', x='train_score_mean', hue='folds', data=df_stats, ax=axs[0])
axs[0].set_xlim(min(df_stats['train_score_mean'] - 0.05), 1);

sns.barplot(y='name', x='train_score_std', hue='folds', data=df_stats, ax=axs[1])
axs[1].set_xlim(min(min(df_stats['train_score_std']), 0));

sns.barplot(y='name', x='validation_score_mean', hue='folds', data=df_stats, ax=axs[2])
axs[2].set_xlim(min(df_stats['validation_score_mean'])- 0.1, 1);

sns.barplot(y='name', x='validation_score_std', hue='folds', data=df_stats, ax=axs[3])
axs[3].set_xlim(min(0, min(df_stats['validation_score_std'])-0.005));

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(20, 5));
sns.barplot(x='train_score', y='name', hue='folds', data=df_stats, ax=axs[0]);
axs[0].set_xlim(.5, .9);
sns.barplot(x='test_score', y='name', hue='folds', data=df_stats, ax=axs[1]);
axs[1].set_xlim(.5, .9);

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=len(gcvs), figsize=(20, 5), sharey=True)

cv=10

for i, gcv in enumerate(gcvs):
#     print(gcv['name'])
    x, y1, y2 = learning_curve(gcv['gs'].best_estimator_, X_train, y_train, cv=cv, scoring=scoring_function)
    axs[i].plot(x, y1.mean(1), label='train');
    axs[i].plot(x, y2.mean(1), label='test');
    axs[i].set_title(f"{gcv['name']}")
    axs[i].axhline((y1.mean(1)[-1]+y2.mean(1)[-1])/2, c='r', linewidth=.5, linestyle='--')
    axs[i].legend()
    
fig.suptitle('Learning curves for our models', size=20, y=1.02);

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=len(gcvs), figsize=(20, 4))

for i, gcv in enumerate(gcvs):
    ac_score = accuracy_score(y_test, gcv['gs'].predict(X_test))
    plot_confusion_matrix(confusion_matrix(y_test, gcv['gs'].predict(X_test)), axis=axs[i]);
    axs[i].set_title(f"{gcv['name']}\n{ac_score:.4f}")
    axs[i].set_ylabel(None)
    axs[0].set_ylabel('true label')
    axs[i].set_xlabel('predicted label')

fig.suptitle('Confusion matrices for our models', size=20, y=1.02);

## Conclusions

[TOC](#Table-of-contents)

* Removing outliers improves accuracy score for almost all classifiers,
* Logistic Regression happens to be the best algorithm for this problem with maximal accuracy of 0.763 accuracy score,
* Logistic Regression is more stable in its predictions than the others. We can see this by comparing how similar validation and test scores for each of them is. 

## References

[TOC](#Table-of-contents)

- Research [paper](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5) pointed in Kaggle dataset description,
- [Sebastian Raschka's great paper on **Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning**](https://arxiv.org/pdf/1811.12808.pdf),
- [Great presentation on Bias-Variance tradeoff](https://stdm.github.io/downloads/courses/ML/V06_BiasVariance-LearningCurves.pdf)

<hr>

I hope you have enjoyed this EDA notebook and that you have learned something new about the dataset and Data Science in general. 

Cheers!