# Data Visualization and Modeling of the Heart Failure Dataset





## Outline

- Data Visualization

- Recursive feature elimination with cross validation

- Tree based feature selection methods with random forest

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Analysis

### Heart Failure Prediction Dataset

We are creating a model to predict the mortality caused by heart failure. 

### About this dataset

"Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help." 

Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (link)

https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

### Feature Information

1) age - age of the patient

2) anaemie - decrease of red blood cells or hemoglobin

3) creatinine phosphokinase - Level of the CPK enzyme in the blood (mcg/L)

4) diabetes - if they have it or not (boolean, 0 - no / 1 - yes)

5) ejection fraction - Percentage of blood leaving the heart at each contraction (percentage)

6) high blood pressure - if they have it or not (boolean, 0 - no / 1 - yes)

7) platelets - Platelets in the blood (kiloplatelets/mL)

8) serum creatine - Level of serum creatinine in the blood (mg/dL)

9) serum sodium - Level of serum sodium in the blood (mEq/L)

10) sex - (0 - woman / 1 - man)

11) smoking - (0 - no / 1 - yes)

12) time - Follow-up period (days)

13) DEATH EVENT - If the patient deceased during the follow-up period (boolean) (0 - no / 1 - yes)


In [None]:
heart_dat = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
heart_dat.head()

In [None]:
heart_dat.columns

In [None]:
# y includes our labels and x includes our features
dlist = ['DEATH_EVENT']
x = heart_dat.drop(dlist, axis=1)
y = heart_dat['DEATH_EVENT']
x.head()

In [None]:
ax = sns.countplot(y,label="Count")       # Deaths = 96, Alive = 203
N, Y = y.value_counts()                   # 1 - Deaths, 0 - Alive
print('Number of Deaths: ',Y)
print('Number Alive : ',N)

In [None]:
x.describe()

# Visualization

### Standarization

In order to see how each feature compares to another we must make sure they are all on the same "playing field" and standardize/ normalize the data througgh the following formula:

the formula $$ Z = \frac{x - \mu}{\sigma} $$

This just puts every feature on the same scale for cleaner more informative data vizualization.

### 0  -  Alive / 1 - Death


In [None]:
heart_dat_dia = y
heart_dat_n_2 = (x - x.mean()) / (x.std())              # standardization
data = pd.concat([y,heart_dat_n_2.iloc[:,0:12]],axis=1)
data = pd.melt(data,id_vars="DEATH_EVENT",
                    var_name="features",
                    value_name='value') 
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="DEATH_EVENT", data=data,split=True, inner="quart")
plt.xticks(rotation=90)

Here we can see that *age, creatine_phosphokinase, ejection_fraction, serum_creatine, serum_sodium, and time* all appear to be seperated enough to be useful in our calculations. This is just face value interpretation and must be explored further.

In [None]:
plt.figure(figsize=(10,10))
sns.swarmplot(x="features", y="value", hue="DEATH_EVENT", data=data)
plt.xticks(rotation=90)

Here we can see *age and time* seem to be seperated the best out of all of the features.

In [None]:
plot_1 = heart_dat.drop(['anaemia',
       'diabetes', 'high_blood_pressure', 'sex', 'smoking', 'age', 'creatinine_phosphokinase','ejection_fraction'],axis=1)
pd.plotting.scatter_matrix(plot_1, c=y, figsize=(15, 15),
 marker='o', hist_kwds={'bins': 20}, s=60,
 alpha=.8)

In [None]:
plot_2 = heart_dat.drop(['anaemia',
       'diabetes', 'high_blood_pressure', 'sex', 'smoking','platelets', 'serum_creatinine', 'serum_sodium','time'],axis=1)
pd.plotting.scatter_matrix(plot_2, c=y, figsize=(15, 15),
 marker='o', hist_kwds={'bins': 20}, s=60,
 alpha=.8)

## Recursive feature elimination with cross validation and random forest classification

Here I created a Random Forest Classifier using all of the features to start with.

In [None]:
# Creating a Random Forest Classifier with all features

X_train, X_test, y_train, y_test = train_test_split(
 x, y, test_size=0.2, random_state=10110)

#random forest classifier with n_estimators=10
clf_rf = RandomForestClassifier(random_state=10110)      
clr_rf = clf_rf.fit(X_train,y_train)

ac = accuracy_score(y_test,clf_rf.predict(X_test))
print('Accuracy is: ',ac)
cm = confusion_matrix(y_test,clf_rf.predict(X_test))
sns.heatmap(cm,annot=True,fmt="d")

Here we implement the recursive feature elimination with cross validation to determine the optimal number of features to be used in our random forest classifier and which features should be used.

In [None]:
# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_2 = RandomForestClassifier(random_state=10110, n_estimators=30) 
rfecv = RFECV(estimator=clf_rf_2, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(X_train, y_train)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X_train.columns[rfecv.support_])

In [None]:
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score of number of selected features")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

Here I created another model using the determined 4 best features and we can see that the model performed just as well with an accuracy of 86.6%.

In [None]:
# Creating a Random Forest Classifier with the 4 best features and n_estimators = 30

# y includes our labels and x includes our features
dlist = ['DEATH_EVENT', 'age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
        'high_blood_pressure','serum_sodium', 'sex', 'smoking']
x_2 = heart_dat.drop(dlist,axis=1)
y_2 = heart_dat['DEATH_EVENT']

X_train, X_test, y_train, y_test = train_test_split(
 x_2, y_2, test_size=0.2, random_state=10110)

#random forest classifier with n_estimators=30
clf_rf_3 = RandomForestClassifier(random_state=10110, n_estimators=30)      
clr_rf_3 = clf_rf_3.fit(X_train,y_train)

ac = accuracy_score(y_test,clf_rf_3.predict(X_test))
print('Accuracy is: ',ac)
cm = confusion_matrix(y_test,clf_rf_3.predict(X_test))
sns.heatmap(cm,annot=True,fmt="d")

## Final Model

Here we use all of the data to train the model to get the most use out of the data we are given. It is common practice to create the final model using all of the data after the training and validation has been completed.

In [None]:
# Creating the final model, a Random Forest Classifier with the 4 best features and n_estimators = 30

# y includes our labels and x includes our features
dlist = ['DEATH_EVENT', 'age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
        'high_blood_pressure','serum_sodium', 'sex', 'smoking']
x_2 = heart_dat.drop(dlist,axis=1)
y_2 = heart_dat['DEATH_EVENT']


#random forest classifier with n_estimators=30
clf_rf_final = RandomForestClassifier(random_state=10110, n_estimators=30)      
clr_rf_final = clf_rf_final.fit(x_2,y_2)