<center><h1>Heart Failure Prediction EDA</h1></center>

<hr/>

<img src="https://storage.googleapis.com/kaggle-datasets-images/727551/1263738/b480e9c8a7b4efd0026dff1a2aeb98df/dataset-cover.png?t=2020-08-18-10-19-56" />

<hr/>

This is EDA on Kaggle [dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data/activity) regarding heart failure diagnosing.

This is the first of two notebooks, in which we are going to conduct EDA and check which predictor variables indicate undesirable event of patient's death.

[Second](https://www.kaggle.com/ilijal/heart-failure-prediction-1-2-eda) notebook will focus on using this notebook's insights for creating predictive models.

Whithout further ado, lets start!

## Table of contents

* [Loading libraries and modules](#Loading-libraries-and-modules)
* [Importing data](#Importing-data)
* [Brief data overview](#Brief-data-overview)
* [Detailed data overview](#Detailed-data-overview)
* [Conclusions](#Conclusions)
* [References](#References)

## Loading libraries and modules

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import chi2_contingency
from scipy.stats import ks_2samp
from scipy.spatial.distance import jensenshannon

from IPython.display import display, Markdown, Latex

## Importing data
[TOC](#Table-of-contents)

In [None]:
file_name = '/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv'

df = pd.read_csv(file_name)

df

There are 13 columns, 12 of which are predictor and 1 that is predicted one (DEATH_EVENT).

One of predictor variables (`time`) can be missleading since it is something that was known for the time of conducting the experiment. Therefore, it is obviously highly indicative of predicted variable. However, this is not something we will know for any future patient that we want to diagnose, and that is why we do not want to use it for predictive model.

One idea is to use other 11 predictor variables to predict this numerical variable (`time`), through regression model. Food for thought.

## Brief data overview
[TOC](#Table-of-contents)

In [None]:
df.describe().T

In [None]:
df.info()

Obviously some of variable types can be more specific e.g. boolean ones.

It would be nice to convert this for further using.

In [None]:
print('Number of missing values per column:')
df.isna().sum()

Apparently, there is no missing values at all.

Lets check for duplicated ones.

In [None]:
print('Is there any duplicated rows:', any(df.duplicated()))

There is no duplicated rows. Great. Very well prepared data set so far.

Lets now check how balanced this set is.

In [None]:
print(df.DEATH_EVENT.value_counts())
sns.countplot(x='DEATH_EVENT',data=df);

We can conclude that this set is imbalanced, but its not extremly.

Based on this imbalance we will be careful how we build and choose models and how we assess their accuracy.

[TOC](#Table-of-contents)

## Detailed data overview

[TOC](#Table-of-contents)

Lets first convert data types to more usable ones.

As you can remember, there are 6 variables of boolean type. 

Lets convert them.

In [None]:
df['anaemia'] = df.anaemia.astype('bool')
df['diabetes'] = df.diabetes.astype('bool')
df['high_blood_pressure'] = df.high_blood_pressure.astype('bool')
df['sex'] = df.sex.astype('bool')
df['smoking'] = df.smoking.astype('bool')
df['DEATH_EVENT'] = df.DEATH_EVENT.astype('bool')

In [None]:
df.info()

Column `time` is somethig we should not use for predicting death event since this variable is obviously indicative. 

More so, its not clear wheter this variable will be known for future patients.

In [None]:
time_column = df['time']
df.drop(columns=['time'], inplace=True)

In [None]:
bool_columns = df.columns[df.dtypes.apply(pd.api.types.is_bool_dtype)].values
num_columns = df.columns[~df.columns.isin(bool_columns)].values.tolist() + ['DEATH_EVENT']

In [None]:
g = sns.pairplot(data=df[num_columns], hue='DEATH_EVENT', height=1.5);
g.map_lower(sns.kdeplot, levels=4, color=".2");
g.fig.suptitle('Pairplot of numerical variables with scatter and KDE plots\nabove and below diagonal.', size=15, y=1.05);

Distribution plots on diagonal might prove to be useful. Lets plot those separately.

In [None]:
r = 0
c = 0

fig, axs = plt.subplots(2, 3, figsize=(20, 10))
plt.subplots_adjust(hspace=0.3)

for n, i in enumerate(num_columns[:-1]):
    class0 = df.loc[df.DEATH_EVENT == False, i].sort_values()
    class1 = df.loc[df.DEATH_EVENT == True, i].sort_values()
    
    sns.distplot(class0, color='blue', ax=axs[r, c]);
    sns.distplot(class1, color='red', ax=axs[r, c]);
    
    sns.rugplot(class0, color='blue', alpha=.3, ax=axs[r, c])
    sns.rugplot(class1, color='red', alpha=.3, ax=axs[r, c])
    
    axs[r, c].legend(title='DEATH_EVENT', labels=['False', 'True']);
    axs[r, c].set_title(i.upper(), y=1.02)
    axs[r, c].set_ylabel(None)
    c += 1
    if (n + 1) % 3 == 0:
        r += 1
        c = 0
    
fig.suptitle('Distribution plots for numerical variables by DEATH_EVENT', size=20)
plt.show()

As we can see, there are some wide tails that show potential outliers.

In [None]:
r = 0
c = 0

fig, axs = plt.subplots(2, 3, figsize=(20, 10))
plt.subplots_adjust(hspace=0.3)

for n, i in enumerate(num_columns[:-1]):
    class0 = df.loc[df.DEATH_EVENT == False, i].sort_values()
    class0N = len(class0)
    class1 = df.loc[df.DEATH_EVENT == True, i].sort_values()
    class1N = len(class1)
    
    axs[r, c].plot(class0, np.array(range(class0N))/float(class0N), c='b')
    axs[r, c].plot(class1, np.array(range(class1N))/float(class1N), c='r')
    axs[r, c].set_ylim(0, 1)
    
    axs[r, c].legend(title='DEATH_EVENT', labels=['False', 'True']);

    axs[r, c].set_title(i.upper(), y=1.02)
    axs[r, c].set_ylabel(None)
    c += 1
    if (n + 1) % 3 == 0:
        r += 1
        c = 0
    
fig.suptitle('Cumulative distribution plots for numerical variables by DEATH_EVENT', size=20)
plt.show()

These [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function) functions can be used to inspect difference between multiple distributions. Of course, this is only for visual inspection.

We can use more rigorous methods like performing [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) test and see if there is a difference and how significant is it.

In [None]:
r = []

for n, i in enumerate(num_columns[:-1]):
    s, p = ks_2samp(df.loc[df.DEATH_EVENT == 0, i], df.loc[df.DEATH_EVENT == 1, i])
    if p < 0.05: a= '--> '
    else: a = '    '
    print('{}KS s={:.2f}, p={:.3f} for {}'.format(a, s, p, i))
    r.append({'predictor': i, 'statistic': s, 'p': p})
    
print()
pd.DataFrame(r).sort_values(by=['p']).query('p < 0.05')

Based on results of Kolmogorov-Smirnov tests and p-values, only 4 predictor variables appear to have different distributions between groups based on DEATH_EVENT, where difference is statistically significant. 

Those are: `serum_creatinine`, `ejection_fraction`, `serum_sodium` and `age`. We will have those in mind.

Lets try using [Jensen-Shannon](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) distribution distances for numerical variables and see what value per variable we get. 

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(20, 10))
plt.subplots_adjust(hspace=0.4)

r, c = 0, 0
distances = []

for i, col in enumerate(num_columns[:-1]):
    _, bins = np.histogram(df[col], bins='fd')
    a, _ = np.histogram(df.loc[df.DEATH_EVENT == True, col], bins=bins)
    b, _ = np.histogram(df.loc[df.DEATH_EVENT == False, col], bins=bins)

    jsd = jensenshannon(a, b)
    
    sns.barplot(x=bins[:-1], y=a, color='b', alpha=0.2, ax=axs[r, c]);
    g = sns.barplot(x=bins[:-1], y=b, color='r', alpha=0.3, ax=axs[r, c]);
    g.set_title('Distance for {} is {:.3f}'.format(col, jsd), size=12, y=1.05);
    g.set_xticklabels(['%.2f' %(float(z.get_text()),) for z in g.get_xticklabels()], rotation=75)
   
    distances.append({'column': col, 'distance': jsd})
    c += 1
    if (i + 1) % 3 == 0:
        r += 1
        c = 0
    
fig.suptitle('Jensen-Shannon distances for numerical variables grouped by DEATH_EVENT', size=20)
plt.show()

print('Jensen-Shannon distances for numerical variables, sorted in descending order:')
pd.DataFrame(distances).sort_values(by='distance', ascending=False)

Apparently, above mentioned variables also have distances bigger than the rest. JS distance may be useful in similar scenarios.

Anyway, we proceed with 4 above mentioned variables.

Lets see how numerical variables are distributed by using box plots. 

This will give us additional perspective of comparing medians, quantiles and outliers. 

In [None]:
r = c = 0

fig, axs = plt.subplots(2, 3, figsize=(20, 10))
plt.subplots_adjust(hspace=0.3)

for n, i in enumerate(num_columns[:-1]):
    sns.boxenplot(x='DEATH_EVENT', y=i, data=df, ax=axs[r, c])
    axs[r, c].set_title(i.upper(), y=1.02)
    axs[r, c].set_ylabel(None)
    c += 1
    if (n + 1) % 3 == 0:
        r += 1
        c = 0
    
fig.suptitle('Box plots for numerical variables grouped by DEATH_EVENT', size=20)
plt.show()

In [None]:
df.groupby('DEATH_EVENT').describe().T

Again, here we can see outliers too. We should probably do something about it.

Lets check correlations between these numerical predictors and DEATH_EVENT.

We will use Pearson, Spearman and Kendall coefficients.

In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(df[num_columns].corr(method='pearson'), 
            annot=True,
            fmt='.2f');
plt.title('Pearson correlation between numeric predictor variables\nand with predicted variable.', size=15, y=1.05);

In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(df[num_columns].corr(method='spearman'), 
            annot=True,
            fmt='.2f');

plt.title('Spearman correlation between numeric predictor variables\nand with predicted variable.', size=15, y=1.05);

In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(df[num_columns].corr(method='kendall'), 
            annot=True,
            fmt='.2f');

plt.title('Kendall correlation between numeric predictor variables\nand with predicted variable.', size=15, y=1.05);

All of the correlation coefficients show pretty much the same thing.

Lets get this verbosely.

In [None]:
s_p_corr_variables = df.corr(method='pearson')['DEATH_EVENT'].abs().sort_values(ascending=False)
s_s_corr_variables = df.corr(method='spearman')['DEATH_EVENT'].abs().sort_values(ascending=False)
s_k_corr_variables = df.corr(method='kendall')['DEATH_EVENT'].abs().sort_values(ascending=False)

In [None]:
print('Pearson correlations')
print(s_p_corr_variables[s_p_corr_variables >= .2])
print('')
print('Spearman correlations')
print(s_s_corr_variables[s_s_corr_variables >= .2])
print('')
print('Kendall correlations')
print(s_k_corr_variables[s_k_corr_variables >= .2])

After this we may take couple of variables into consideration, based on correlation coefficients above. These variables might be indicative of death event.

These variables are: `serum_creatinine`, `ejection_fraction`, `age` and `serum_sodium`.

Same to what we already discovered. This looks promising.

Lets take boolean predictor variables and calculate Pearson correlation coefficient with predicted variable.

In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(df[bool_columns].corr(method='pearson'), 
            annot=True,
            fmt='.2f');

plt.title('Pearson correlation between boolean predictor variables\nand with predicted variable.', size=15, y=1.05);

There's no any correlation between boolean predictor variables and predicted one. Interesting...

How about we compare proportions of patients per boolean predictor variable, based on the class.

In [None]:
r = 0
c = 0

fig, axs = plt.subplots(2, 3, figsize=(20, 10))
plt.subplots_adjust(hspace=0.3)

for n, i in enumerate(bool_columns[:-1]):
    ct = pd.crosstab(columns=df[i], index=df.DEATH_EVENT, normalize="columns")
    
    ct.T.plot(kind='bar', stacked=True, ax=axs[r, c]);

    axs[r, c].set_title(i.upper(), y=1.02)
    axs[r,c].set_ylabel("% of observations")

    c += 1
    if (n + 1) % 3 == 0:
        r += 1
        c = 0
    
axs[r,c].axis("off")

fig.suptitle('Stacked bar plots for boolean variables grouped by DEATH_EVENT', size=20)
plt.show()

By inspecting visually, only `high_blood_pressure` might be indicative of death event, since there is higher percent of death events when patient suffers from high blood pressure. 

However, this analysis should be conducted more rigorously. 

Lets check contigency tables for each variable.

In [None]:
for c in bool_columns[:-1]:
    ct = pd.crosstab(df[c], df.DEATH_EVENT)
    print('Contigency table for {}'.format(c))
    print(ct)
    print('\n')

There are couple of options for testing associations between categorical variables in contigency tables.

Some of the tests that can be used for this purpose are:
- Chi2 contigency test,
- Fisher's test

There are certain pros and cons for each of them, but most notable are:
- Chi2 provides an aproximation and gets more precise when sample is bigger. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the “expected values”) being low.
- Fishers test is exact, but is used with smaller samples. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the Chi2 test is appropriate.

Based on this, we will use Chi2 contigency test.

For more info on this, check Wiki [article](https://en.wikipedia.org/wiki/Fisher%27s_exact_test).

In [None]:
h0 = 'H0: There is no relationship between {} and DEATH_EVENT'
h1 = 'H1: There is a relationship between {} and DEATH_EVENT'

for c in bool_columns[:-1]:
    ct = pd.crosstab(columns=df[c],index=df.DEATH_EVENT)
    stat, p, dof, expected = chi2_contingency(ct) 

    print('Contigency table for {}\n'.format(c))
    print(ct)
    print()
    print(h0.format(c))
    print(h1.format(c))
    print('\np value:', np.round(p,2))
    if p < 0.05:
        message = Markdown('<b style="color: red;">Reject H0</b>')
    else:
        message = Markdown('<b style="color: green;">Failed to reject H0</b>')
    display(message)
    print('\n')
    

Results show that none of the binary predictor variables has significant effect on DEATH_EVENT.

## Conclusions

After the exploratory data analysis we have found out that:
- there is **no missing** values,
- there is **no duplicate** entries,
- dataset is **moderately imbalanced**,
- **none** of the **boolean** predictor variables correlates to patient death,
- 4 of the **numerical predictor variables seems to indicate if patient will die**. **Those are `serum_creatinine`, `ejection_fraction`, `serum_sodium` and `age`**, with age having the least effect.
- there are multiple numeric variables that contain **outliers** (`creatinine_phosphokinase`, `platelets`, and `serum_creatinine`). However, we will handle outliers for **`serum_creatinine`** since it indicates patient death.   

[TOC](#Table-of-contents)

## References

- Research [paper](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5) pointed in Kaggle dataset description
- Expected values of serum creatinine [calculator](http://www.scymed.com/en/smnxps/psxdf212_c.htm)
- Another [description](https://www.mayoclinic.org/tests-procedures/creatinine-test/about/pac-20384646) on serum creatinine expected test values
- [Article](https://medium.com/datalab-log/measuring-the-statistical-similarity-between-two-samples-using-jensen-shannon-and-kullback-leibler-8d05af514b15) on measuring the statistical similarity between two samples using Jensen-Shannon and Kullback-Leibler divergences
- [Article](https://towardsdatascience.com/how-to-compare-two-distributions-in-practice-8c676904a285) on how to compare two distributions in practice
- Great [tutorial](https://machinelearningmastery.com/divergence-between-probability-distributions/) on how to calculate the KL divergence for machine learning 
- [Wiki](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) article on Fishers test and when to use it.

[TOC](#Table-of-contents)


<hr>

I hope you have enjoyed this EDA notebook and that you have learned something new about the dataset and Data Science in general. 

Cheers!