# **Analysing the Heart Failure prediction dataset.**


We have been provided 12 key features of a set of patients who then may or may not have died during their followup period.
Our task is to study these features as to how are they correlated to the target and create a model to predict the probability of death of a patients given his current condition/status.

The primary purpose of this notebook is to gain an understanding of how the features are related to our Target (Death Event) and then we move on to the model.

I have avoided putting in visualizations just for the sake of it. I have put in visualizations in such a way that post studying all the visualizations we have an understanding of all the features and how are they relevant in the dataset.

To keep this notebook simple I haven't added any automated hyperparameter optimization techniques. In case you are interested in those methods please check my notebook here - > [Click Here](https://www.kaggle.com/ankur123xyz/advanced-hyperparameter-tuning-techniques).

For anyone looking to gain further understanding into this dataset, there is an intersesting study I found:-
[Click Here](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5)

A snapshot of the data before we start so that it is easier to delve in to the data.

![](https://www.linkpicture.com/q/Screenshot-2020-08-25-at-5.39.20-PM.png)
Source:- [Link](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/1)

Please do upvote if you learn something out of this notebook. Cheers !


Importing all the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,auc
from sklearn.model_selection import GridSearchCV

Import the dataset and check the first few rows to get a glimpse at the data

In [None]:
dataset = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
dataset.head()

Check for missing data.
None of the data is missing, so we can proceed further.

In [None]:
dataset.isnull().sum()

Distribution plot for age
The distribution for age has a mean of 60.8 and S.D. of 11.89. The distribution is close to a normal distribution

In [None]:
f,ax = plt.subplots(figsize=(6,6))
sns.distplot(dataset["age"],kde=True,axlabel="Age",fit=norm)
ax.text(0.8,0.9,"Mean -" + str("{:.2f}".format(dataset["age"].mean())),transform=ax.transAxes)
ax.text(0.8,0.85,"SD -" + str("{:.2f}".format(dataset["age"].std())),transform=ax.transAxes)
ax.set_title("Age Distribution")

Let us see the significance of age for fatal heart attacks.

We can see that age is defintely a factor in fatal heart attacks

In [None]:
f,ax = plt.subplots(figsize=(6,6))
sns.boxplot(x="DEATH_EVENT",y="age",data=dataset)
ax.set_title("Age by Death Event")
ax.set_xlabel("Death Event (0-No, 1-Yes)")
ax.set_ylabel("Age")

% of Anaemic and Non Anaemic patients.
Higher percentage of our dataset are non Anaemic

In [None]:
data_anaemia = dataset["anaemia"].value_counts()
plt.pie(data_anaemia,explode=(0,0.1),labels=["Non Anaemic","Anaemic"],shadow=True,startangle=90,autopct="%1.1f%%")

Being anaemic may have an effect on your heart also. Let us find out.

Patients with anaemia are at a higher risk of death - 36% of patients who died had anaemia, while only 29% of the patients who are still alive have anaemia

In [None]:
f,ax = plt.subplots(figsize=(6,6))
ax=sns.countplot(x="anaemia",hue="DEATH_EVENT",data=dataset)
patch = ax.patches
half = int(len(patch)/2)

for i in range(half):
    pat_1= patch[i]
    pat_2 = patch[i+half]
    height_1 = pat_1.get_height()
    height_2 = pat_2.get_height()
    total = height_1  + height_2
    width_1 = pat_1.get_x()+pat_1.get_width()/2-.05
    width_2 = pat_2.get_x()+pat_2.get_width()/2-.05
    ax.text(width_1,height_1+2,"{:.0%}".format(height_1/total))
    ax.text(width_2,height_2+2,"{:.0%}".format(height_2/total))

% of Diabetic and Non Diabetic patients

Higher percentage of our datset is non diabetic

In [None]:
data_diabetes = dataset["diabetes"].value_counts()
plt.pie(data_diabetes,explode=(0,0.1),labels=["Non Diabetic","Diabetic"],shadow=True,startangle=90,autopct="%1.1f%%")

Let us study the impact of diabetes on fatal heart attacks

As far as fatal heart attacks are concerned we can see that diabetes does not play a part according to our sample.

In [None]:
f,ax = plt.subplots(figsize=(6,6))
ax=sns.countplot(x="diabetes",hue="DEATH_EVENT",data=dataset)
patch = ax.patches
half = int(len(patch)/2)

for i in range(half):
    pat_1= patch[i]
    pat_2 = patch[i+half]
    height_1 = pat_1.get_height()
    height_2 = pat_2.get_height()
    total = height_1  + height_2
    width_1 = pat_1.get_x()+pat_1.get_width()/2-.05
    width_2 = pat_2.get_x()+pat_2.get_width()/2-.05
    ax.text(width_1,height_1+2,"{:.0%}".format(height_1/total))
    ax.text(width_2,height_2+2,"{:.0%}".format(height_2/total))

Let us see how significant is the ejection fraction

We can observe that patients with a lower ejection fraction are at higher risk

In [None]:
f,ax = plt.subplots(figsize=(6,6))
sns.boxplot(x="DEATH_EVENT",y="ejection_fraction",data=dataset)
ax.set_title("Ejection Fraction by Death Event")
ax.set_xlabel("Death Event (0-No, 1-Yes)")
ax.set_ylabel("Ejection Fraction")

Let's remove some outliers which i observed through a boxplot

In [None]:
dataset=dataset[dataset["creatinine_phosphokinase"]<3000]

The CP Distribution plot is similar for both death event values which signifies it has lower impact on the death event

In [None]:
f,ax=plt.subplots(1,2,figsize=(15,7))
sns.distplot(dataset.loc[dataset["DEATH_EVENT"]==0]["creatinine_phosphokinase"],kde=True,axlabel="CP- Death(No)",fit=norm,ax=ax[0])
sns.distplot(dataset.loc[dataset["DEATH_EVENT"]==1]["creatinine_phosphokinase"],kde=True,axlabel="CP - Death(Yes)",fit=norm,ax=ax[1])
ax[0].text(.8,.9,"Mean - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==0]["creatinine_phosphokinase"].mean())),transform=ax[0].transAxes)
ax[0].text(.8,.8,"S.D. - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==0]["creatinine_phosphokinase"].std())),transform=ax[0].transAxes)
ax[1].text(.8,.9,"Mean - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==1]["creatinine_phosphokinase"].mean())),transform=ax[1].transAxes)
ax[1].text(.8,.8,"S.D. - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==1]["creatinine_phosphokinase"].std())),transform=ax[1].transAxes)

Let us study the impact of high blood pressure on fatal heart attacks

Death rate in higher in patients with high blood pressure

In [None]:
f,ax = plt.subplots(figsize=(6,6))
ax=sns.countplot(x="high_blood_pressure",hue="DEATH_EVENT",data=dataset)
patch = ax.patches
half = int(len(patch)/2)

for i in range(half):
    pat_1= patch[i]
    pat_2 = patch[i+half]
    height_1 = pat_1.get_height()
    height_2 = pat_2.get_height()
    total = height_1  + height_2
    width_1 = pat_1.get_x()+pat_1.get_width()/2
    width_2 = pat_2.get_x()+pat_2.get_width()/2
    ax.text(width_1,height_1+2,"{:.0%}".format(height_1/total))
    ax.text(width_2,height_2+2,"{:.0%}".format(height_2/total))

Let us study the impact of smoking on fatal heart attacks

Compared to men, women are at greater risk from smoking

In [None]:
f,(ax1,ax2) = plt.subplots(1,2,figsize=(7,8))
data_female = dataset[dataset["sex"]==0]
data_male = dataset[dataset["sex"]==1]
sns.countplot(x="DEATH_EVENT",hue="smoking",data=data_female,ax=ax1)
sns.countplot(x="DEATH_EVENT",hue="smoking",data=data_male,ax=ax2)
ax1.set_title("Female")
ax2.set_title("Male")
patch_1 = ax1.patches
patch_2 = ax2.patches
half = int(len(patch_1)/2)

for i in range(half):
    pat_1= patch_1[i]
    pat_2 = patch_1[i+half]
    height_1 = pat_1.get_height()
    height_2 = pat_2.get_height()
    total = height_1  + height_2
    width_1 = pat_1.get_x()+pat_1.get_width()/2-.05
    width_2 = pat_2.get_x()+pat_2.get_width()/2-.05
    ax1.text(width_1,height_1+1,"{:.0%}".format(height_1/total))
    ax1.text(width_2,height_2+1,"{:.0%}".format(height_2/total))

for i in range(half):
    pat_1= patch_2[i]
    pat_2 = patch_2[i+half]
    height_1 = pat_1.get_height()
    height_2 = pat_2.get_height()
    total = height_1  + height_2
    width_1 = pat_1.get_x()+pat_1.get_width()/2-.05
    width_2 = pat_2.get_x()+pat_2.get_width()/2-.05
    ax2.text(width_1,height_1+1,"{:.0%}".format(height_1/total))
    ax2.text(width_2,height_2+1,"{:.0%}".format(height_2/total))

Platelet count has lower impact on the death rate. The mean and SD for both cases where there has been a fatality and non fatality are comparable

In [None]:
f,ax=plt.subplots(1,2,figsize=(15,7))
sns.distplot(dataset.loc[dataset["DEATH_EVENT"]==0]["platelets"],kde=True,axlabel="Platelet- Death(No)",fit=norm,ax=ax[0])
ax[0].text(.8,.9,"Mean - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==0]["platelets"].mean())),transform=ax[0].transAxes)
ax[0].text(.8,.8,"S.D. - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==0]["platelets"].std())),transform=ax[0].transAxes)
sns.distplot(dataset.loc[dataset["DEATH_EVENT"]==1]["platelets"],kde=True,axlabel="Platelet - Death(Yes)",fit=norm,ax=ax[1])
ax[1].text(.8,.9,"Mean - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==1]["platelets"].mean())),transform=ax[1].transAxes)
ax[1].text(.8,.8,"S.D. - " + str("{:.0f}".format(dataset.loc[dataset["DEATH_EVENT"]==1]["platelets"].std())),transform=ax[1].transAxes)

Let us see how significant is the ejection fraction

We can see that it is better to maintain lower Serum Ceatanine to stay healthy

In [None]:
f,ax = plt.subplots(figsize=(6,6))
sns.boxplot(x="DEATH_EVENT",y="serum_creatinine",data=dataset)
ax.set_title("Serum Cretanine by Death Event")
ax.set_xlabel("Death Event (0-No, 1-Yes)")
ax.set_ylabel("Serum Creatinine")

Let us see how significant is the ejection fraction

We can see that it is better to maintain a higher Serum Sodium to stay healthy

In [None]:
f,ax = plt.subplots(figsize=(6,6))
sns.boxplot(x="DEATH_EVENT",y="serum_sodium",data=dataset)
ax.set_title("Serum Sodium by Death Event")
ax.set_xlabel("Death Event (0-No, 1-Yes)")
ax.set_ylabel("Serum Sodium")

Let us create a heatmap of the correlation matrix of the features to compare it to our EDA so far.

In [None]:
corr_matrix = np.triu(dataset.corr())
f,ax = plt.subplots(figsize=(10,10))
sns.heatmap(dataset.corr(),cmap="coolwarm",mask=corr_matrix,vmin=-1,annot=True)

Now let us see if the findings from the heatmap corroborate to the EDA that we have done so far.
1. We have seen higher age patients are at higher risk and in the heatmap also we can see there is a strong positive correlation between age and Death_Event
2. Anaemic patients are at higher risk as per our findings earlier and we can see a postive correlation in the heatmap
3. For Creatanine Phosphokinase we were unable to find a definite relation to the death rate, but from the heatmap we can find it has a negative correlation with the death rate, albeit a small one.
4. For diabetes we concluded earlier that we could not find a strong correlation to the death rate. The heatmap also shows a very small negative correlation to the detah rate.
5. For ejection fraction both our heatmap and analysis earlier are the same, higher the ejection rate lower the risk
6. Higher the blood pressure higher the risk of death event, again our heatmap and analysis are in sync
7. Higher platelet count is better for health, evident from the heatmap. Our distribution was ambiguous but the mean pointed out that higher platelet count is better.
8. Serum Cretanine and Serum Sodium are positively and negatively correlated to the death rate both according our analysis and the heatmap
9. For smoking we failed to find any correlation with the death rate

Now let us build a model which can predict the death event basis the data provided. We will drop the followup time since it is natural for patients with death event as yes to have lower followup periods.

Let us split the data between training and test sets


Please note we are dropping the time feature as it is evident that the followup period of many of the patients may have been cutoff who died due to heart failure. We may not have the followup period when we are trying to make predictions in the real world.

In [None]:
y=dataset["DEATH_EVENT"]
X = dataset.drop(["time","DEATH_EVENT"],axis=1)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

We would apply the Random Forest model on our data and do a grid search to tune our hyperparameters.

Evaluating the model gives us a AUC of 0.75

In [None]:


rf= RandomForestClassifier()

n_estimators = [100,200,300]
max_depth = [4,5,6,7]
min_samples_split = [4,5,6,7]

params = dict(n_estimators = n_estimators, max_depth = max_depth, min_samples_split = min_samples_split)

grid = GridSearchCV(rf, params, cv = 5, verbose = 1, n_jobs = -1)

grid_r = grid.fit(X_train, y_train)

y_scores = grid_r.best_estimator_.predict_proba(X_test)
y_scores = y_scores[:,1]
fpr,tpr,threshold = roc_curve(y_test,y_scores)

roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()