# **Introduction**

The heart is the muscular organ made of cardiac muscles and other tissues. The primary function of the heart is to pump the blood through the vessels of circulatory system. There are many ailments which a human can suffer because of disfunctioning of this vital organ. One of such ailments is the **heart failure**. Heart failure happens when the heart fails to do its function of pumping blood.

There could be multiple reasons which can leave heart in such a condition like narrowing of the arteries or high blood pressure. Certain treatments and change in lifestyle can help cure this disease.


# **Symptoms**

The symptoms of heart failure are including but not limited to,
- Chest Pain
- Fatigue
- Swelling in body parts
- Fainting
- Shortness in breath

# **Objective**

The primary objective of this notebook is to visualize the trends in the dataset and to predict the heart failure of different patients. There could be many reasons behind the heart failure. We'll use Python's visualization libraries to understand the trends lying in the patient dataset and then build a model to predict the heart failure.

# **Prerequisites**

To get the most out of this notebook, make sure that you know the basics of Pandas, Plotly and machine learning algorithms. This notebook is aimed to help the beginners understand the machine learning flow, hence it is prepared simple. No advanced knowledge is required.

# **Dataset**

The dataset we have consist of different attributes of the heart patients as follows.
- Age (Age of the patient)
- Creatinine (Level of the CPK enzyme in the blood (mcg/L))
- Aneamia (Decrease of red blood cells or hemoglobin)
- Diabetes (If the patient has diabetes)
- Ejection Fraction (Percentage of blood leaving the heart at each contraction (percentage))
- Hypertension (If the patient has hypertension)
- Platelets (Platelets in the blood (kiloplatelets/mL))
- Serum Creatinine (Level of serum creatinine in the blood (mg/dL))
- Serum Sodium (Level of serum sodium in the blood (mEq/L))
- Sex (Gender of the patient)
- Smoking (If the patient smokes or not (boolean))
- Time (Follow-up period (days))
- Death Event (If the patient deceased during the follow-up period)

# **Table of Contents**

- Importing Libraries and Dataset
- Quick Inspection of Dataset
- Exploratory Data Analysis
- Feature Engineering
- Building Predictive Model

# **Importing Libraries and Dataset**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Reading the dataset into a dataframe and storing the original copy for later reference

df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
original = df.copy()

# **Quick Inspection of Dataset**

In [None]:
# Printing the shape of the dataset

print('Dataset has', df.shape[0], 'rows and', df.shape[1], 'columns')

In [None]:
# Printing the info of the dataset

df.info()

- This dataset is fairly clean and does not require any imputation as it does not have any missing values. 
- The datatypes of the features are also meaningful and does not require any type conversion.

In [None]:
# Printing the head of the dataset

df.head()

In [None]:
# Printing the descriptive stats of the dataset

df.describe()

# **Exploratory Data Analysis**

In this section, we'll try to understand the factors which cause the heart failure by plotting various graphs. The inferences drawn out of this graphs would highly help us in building the predictive model. We'll try to answer different questions about the causal factors which pop in our mind.

In [None]:
init_notebook_mode()

In [None]:
#Splitting up the dataset based on the death event feature

failed = df.loc[df['DEATH_EVENT'] == 1, :]
not_failed = df.loc[df['DEATH_EVENT'] == 0, :]

In [None]:
print(failed.shape)
print(not_failed.shape)

## **Are elder patients more prone to heart failure**

In [None]:
# Create two traces each with the target class label and plot boxplots by considering age

trace1 = go.Box(y = failed['age'], 
             name = 'Failed',
             marker = dict(color = 'black'))

trace2 = go.Box(y = not_failed['age'], 
             name = 'Not Failed',
             marker = dict(color = '#eb2862'))


layout = go.Layout(title = 'Failure Rate by Age Group',
                  xaxis = dict(title = 'Heart Failure'),
                  yaxis = dict(title = 'Age'))

fig = go.Figure(data = [trace1, trace2], layout = layout)
iplot(fig)

**Inference:**
- The median age of patients whose heart failed is higher than others.
- This makes sense as the elder patients tend to suffer from heart failure.

## **Is a decrease in heamoglobin cause heart failure**

In [None]:
#Splitting up the dataset based on the 'anaemia' feature

anaemic = df.loc[df['anaemia'] == 1, :]
not_anaemic = df.loc[df['anaemia'] == 0, :]

In [None]:
# Calculate the percentage of patients in each of the class (Anaemic or Not)

failed_anaemia = anaemic['DEATH_EVENT'].value_counts(normalize = True).reset_index()
failed_not_anaemia = not_anaemic['DEATH_EVENT'].value_counts(normalize = True).reset_index()

In [None]:
# Create two traces of bar plots

trace1 = go.Bar(x = failed_anaemia.index,
                y = failed_anaemia.DEATH_EVENT,
                name = "Anaemic",
                marker = dict(color = '#eb2862',
                             line=dict(color='rgb(0,0,0)',width=1.5)))


trace2 = go.Bar(x = failed_not_anaemia.index,
                y = failed_not_anaemia.DEATH_EVENT,
                name = "Not Anaemic",
                marker = dict(color = '#615a5c',
                             line=dict(color='rgb(0,0,0)',width=1.5)))

In [None]:
# Add appropriate titles and labels

data = [trace1, trace2]
layout = go.Layout(title = 'Failure Rate by Anaemia',
                   xaxis = dict(title = 'Heart Failed'),
                   yaxis = dict(title = '% Patients'),
                   barmode = "group")
fig = go.Figure(data = data, layout = layout)
iplot(fig)

**Inference**
- 35% of the anaemic patients have faced heart failure while the other 65% did not.

## **Influence of CPK Enzyme in Heart Failure**

In [None]:
# Create two traces each with the target class label and plot boxplots by considering creatinine phosphokinase

trace1 = go.Box(y = failed['creatinine_phosphokinase'], 
             name = 'Failed',
             marker = dict(color = 'black'))

trace2 = go.Box(y = not_failed['creatinine_phosphokinase'], 
             name = 'Not Failed',
             marker = dict(color = '#eb2862'))


layout = go.Layout(title = 'Failure Rate by CPK Level',
                  xaxis = dict(title = 'Heart Failure'),
                  yaxis = dict(title = 'CPK'))

fig = go.Figure(data = [trace1, trace2], layout = layout)
iplot(fig)

**Inference**
- The CPK enzyme level in the blood of affected patients is more than the others.
- Eventhough the difference in enzyme level between both the groups tend to be less, this may also be one of the factors.

## **Diabetes and Heart Failure**

In [None]:
#Splitting up the dataset based on the diabetes feature

diabetic = df.loc[df['diabetes'] == 1, :]
not_diabetic = df.loc[df['diabetes'] == 0, :]

In [None]:
# Calculate the percentage of patients in each of the class (Diabetic or Not)

failed_diabetic = diabetic['DEATH_EVENT'].value_counts(normalize = True).reset_index()
failed_not_diabetic = not_diabetic['DEATH_EVENT'].value_counts(normalize = True).reset_index()

In [None]:
# Create two traces of bar plots

trace1 = go.Bar(x = failed_diabetic.index,
                y = failed_diabetic.DEATH_EVENT,
                name = "Diabetic",
                marker = dict(color = '#eb2862',
                             line=dict(color='rgb(0,0,0)',width=1.5)))


trace2 = go.Bar(x = failed_not_diabetic.index,
                y = failed_not_diabetic.DEATH_EVENT,
                name = "Not Diabetic",
                marker = dict(color = '#615a5c',
                             line=dict(color='rgb(0,0,0)',width=1.5)))

In [None]:
# Add appropriate titles and labels

data = [trace1, trace2]
layout = go.Layout(title = 'Failure Rate by Diabetes',
                   xaxis = dict(title = 'Heart Failed'),
                   yaxis = dict(title = '% Patients'),
                   barmode = "group")
fig = go.Figure(data = data, layout = layout)
iplot(fig)

**Inference**
- 35% of the diabetic patients faced heart failure.
- Only 29% of the non-diabetic patients suffered heart failure.

## **Does the fluctuation in percentage of blood leaving the heart makes it to fail**

In [None]:
# Create two traces each with the target class label and plot boxplots by considering ejection fraction

trace1 = go.Box(y = failed['ejection_fraction'], 
             name = 'Failed',
             marker = dict(color = 'black'))

trace2 = go.Box(y = not_failed['ejection_fraction'], 
             name = 'Not Failed',
             marker = dict(color = '#eb2862'))


layout = go.Layout(title = 'Failure Rate by Ejection Fraction',
                  xaxis = dict(title = 'Heart Failure'),
                  yaxis = dict(title = 'Ejection Fraction'))

fig = go.Figure(data = [trace1, trace2], layout = layout)
iplot(fig)

**Inference**
- The ejection percentage of blood in patients who suffered heart failure is more than others.

## **Are BP patients more prone to heart failure**

In [None]:
#Splitting up the dataset based on the 'high_blood_pressure' feature

bp = df.loc[df['high_blood_pressure'] == 1, :]
normal = df.loc[df['high_blood_pressure'] == 0, :]

In [None]:
# Calculate the percentage of patients in each of the class (BP or Normal)

failed_bp = bp['DEATH_EVENT'].value_counts(normalize = True).reset_index()
failed_normal = normal['DEATH_EVENT'].value_counts(normalize = True).reset_index()

In [None]:
# Create two traces of bar plots

trace1 = go.Bar(x = failed_bp.index,
                y = failed_bp.DEATH_EVENT,
                name = "High Blood Pressure",
                marker = dict(color = '#eb2862',
                             line=dict(color='rgb(0,0,0)',width=1.5)))


trace2 = go.Bar(x = failed_normal.index,
                y = failed_normal.DEATH_EVENT,
                name = "Normal",
                marker = dict(color = '#615a5c',
                             line=dict(color='rgb(0,0,0)',width=1.5)))

In [None]:
# Add appropriate titles and labels

data = [trace1, trace2]
layout = go.Layout(title = 'Failure Rate by High BP',
                   xaxis = dict(title = 'Heart Failed'),
                   yaxis = dict(title = '% Patients'),
                   barmode = "group")
fig = go.Figure(data = data, layout = layout)
iplot(fig)

**Inference**
- 38% of the patients who have high BP also suffered heart failure.

## **Does platelets count really matter**

In [None]:
# Create two traces each with the target class label and plot boxplots by considering platelets

trace1 = go.Box(y = failed['platelets'], 
             name = 'Failed',
             marker = dict(color = 'black'))

trace2 = go.Box(y = not_failed['platelets'], 
             name = 'Not Failed',
             marker = dict(color = '#eb2862'))


layout = go.Layout(title = 'Failure Rate by Platelets Count',
                  xaxis = dict(title = 'Heart Failure'),
                  yaxis = dict(title = 'Platelets'))

fig = go.Figure(data = [trace1, trace2], layout = layout)
iplot(fig)

**Inference**
- Even though the difference in platelet counts is subtle between two groups, the affected patients have comparatively less platelets.

## **What difference does serum creatinine make**

In [None]:
# Create two traces each with the target class label and plot boxplots by considering creatinine

trace1 = go.Box(y = failed['serum_creatinine'], 
             name = 'Failed',
             marker = dict(color = 'black'))

trace2 = go.Box(y = not_failed['serum_creatinine'], 
             name = 'Not Failed',
             marker = dict(color = '#eb2862'))


layout = go.Layout(title = 'Failure Rate by Serum Creatinine Level',
                  xaxis = dict(title = 'Heart Failure'),
                  yaxis = dict(title = 'Serum Creatinine'))

fig = go.Figure(data = [trace1, trace2], layout = layout)
iplot(fig)

**Inference**
- The serum creatinine level in the patients who suffered heart failure is more than the others.

## **Impact of Sodium level in Heart Failure**

In [None]:
# Create two traces each with the target class label and plot boxplots by considering sodium

trace1 = go.Box(y = failed['serum_sodium'], 
             name = 'Failed',
             marker = dict(color = 'black'))

trace2 = go.Box(y = not_failed['serum_sodium'], 
             name = 'Not Failed',
             marker = dict(color = '#eb2862'))


layout = go.Layout(title = 'Failure Rate by Serum sodium Level',
                  xaxis = dict(title = 'Heart Failure'),
                  yaxis = dict(title = 'Serum Sodium'))

fig = go.Figure(data = [trace1, trace2], layout = layout)
iplot(fig)

**Inference**
- There is no significant difference observed in the two groups.

## **Is a particular gender more prone to heart failure**

In [None]:
#Splitting up the dataset based on the 'sex' feature

men = df.loc[df['sex'] == 1, :]
women = df.loc[df['sex'] == 0, :]

In [None]:
# Calculate the percentage of patients in each of the gender

failed_men = men['DEATH_EVENT'].value_counts(normalize = True).reset_index()
failed_women = women['DEATH_EVENT'].value_counts(normalize = True).reset_index()

In [None]:
# Create two traces of bar plots

trace1 = go.Bar(x = failed_men.index,
                y = failed_men.DEATH_EVENT,
                name = "Men",
                marker = dict(color = '#eb2862',
                             line=dict(color='rgb(0,0,0)',width=1.5)))


trace2 = go.Bar(x = failed_women.index,
                y = failed_women.DEATH_EVENT,
                name = "Women",
                marker = dict(color = '#615a5c',
                             line=dict(color='rgb(0,0,0)',width=1.5)))

In [None]:
# Add appropriate titles and labels

data = [trace1, trace2]
layout = go.Layout(title = 'Failure Rate by Gender',
                   xaxis = dict(title = 'Heart Failed'),
                   yaxis = dict(title = '% Patients'),
                   barmode = "group")
fig = go.Figure(data = data, layout = layout)
iplot(fig)

**Inference**
- This graph suggests that the failure rate is equal among both the genders.

## **Impact of smoking on the functioning of heart**

In [None]:
#Splitting up the dataset based on the 'smoking' feature

smoking = df.loc[df['smoking'] == 1, :]
not_smoking = df.loc[df['smoking'] == 0, :]

In [None]:
# Calculate the percentage of patients in each of the class

failed_smoking = smoking['DEATH_EVENT'].value_counts(normalize = True).reset_index()
failed_no_smoking = not_smoking['DEATH_EVENT'].value_counts(normalize = True).reset_index()

In [None]:
# Create two traces of bar plots

trace1 = go.Bar(x = failed_smoking.index,
                y = failed_smoking.DEATH_EVENT,
                name = "Smoker",
                marker = dict(color = '#eb2862',
                             line=dict(color='rgb(0,0,0)',width=1.5)))


trace2 = go.Bar(x = failed_no_smoking.index,
                y = failed_no_smoking.DEATH_EVENT,
                name = "Non Smoker",
                marker = dict(color = '#615a5c',
                             line=dict(color='rgb(0,0,0)',width=1.5)))

In [None]:
# Add appropriate titles and labels

data = [trace1, trace2]
layout = go.Layout(title = 'Failure Rate by Smoking',
                   xaxis = dict(title = 'Heart Failed'),
                   yaxis = dict(title = '% Patients'),
                   barmode = "group")
fig = go.Figure(data = data, layout = layout)
iplot(fig)

**Inference**
- 31% of the smokers have faced heart failure.

In [None]:
fig, ax = plt.subplots(figsize = (14, 10))

sns.heatmap(df.corr(), annot = True, cmap = 'summer')
plt.show()

# **Feature Engineering**

In [None]:
df.head()

Let's split up the features and then scale them since the features are of different scales.

In [None]:
scale = df.drop(columns = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking', 'DEATH_EVENT'])
no_scale = df[['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking', 'DEATH_EVENT']]

In [None]:
# Import StandardScaler for standardizing the features

from sklearn.preprocessing import StandardScaler

**Standardization** is a technique which is aimed at scaling the values in such a way that the standard deviation is around 1. When this is performed over a dataframe, all the features would be rescaled with an unit standard deviation.

In [None]:
# Fitting the scaler on the training data and transform both the sets

sc = StandardScaler()

sc.fit(scale)

scaled = pd.DataFrame(sc.transform(scale), columns = scale.columns)

In [None]:
scaled.head(2)

In [None]:
no_scale.head(2)

In [None]:
scaled_df = pd.concat([scaled, no_scale], axis = 1)

scaled_df.head(3)

# **Model Building**

Let's build a baseline model with all the features in the dataset. The dataset has 12 independant features. We'll then reduce the model complexity by reducing the number of independant features.

In [None]:
# Importing required libraries

from sklearn.model_selection import train_test_split

The independant and dependant features need to be split into two different dataframes.

In [None]:
X = scaled_df.drop(columns = ['DEATH_EVENT'])
y = scaled_df['DEATH_EVENT']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y, random_state = 36)

We've split in such a way that the testing set has 30% of the data. The parameter 'stratify' ensures that the equal distribution of the dependant feature is split across different sets.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

## **Logistic Regression**

Logistic Regression calculates the probabilities of an observation falling under a particular class by using a sigmoid curve.

In [None]:
lr = LogisticRegression(solver = 'liblinear', random_state = 42)

lr_model = lr.fit(X_train, y_train)
print('Training Score:', lr_model.score(X_train, y_train))
print('Testing Score:', lr_model.score(X_test, y_test))

The training and test scores both look to be decent. Let's try other models as well to conclude on the best approach.

## **Decision Tree Classifier**

Decision Tree algorithm creates a model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training dataset.

In [None]:
dt = DecisionTreeClassifier(random_state = 42)

dt_model = dt.fit(X_train, y_train)
print('Training Score:', dt_model.score(X_train, y_train))
print('Testing Score:', dt_model.score(X_test, y_test))

Training score is 1.0, which is ideal. But the testing score is 0.7 which clearly says that the model is overfitting on the training dataset.

## **Random Forest Classifier**

Random Forest is an ensemble method which is collection of multiple decision trees. The predictions from different decision trees are aggregated to obtain the final prediction.

In [None]:
rf = RandomForestClassifier(random_state = 42)

rf_model = rf.fit(X_train, y_train)
print('Training Score:', rf_model.score(X_train, y_train))
print('Testing Score:', rf_model.score(X_test, y_test))

The training score is 1.0 whereas the testing score is 0.77.

## **K-Neighbors Classifier**

The K-Neighbors Classifier predicts the target label by considering the k-neighbors.

In [None]:
knn = KNeighborsClassifier()

knn_model = knn.fit(X_train, y_train)
print('Training Score:', knn_model.score(X_train, y_train))
print('Testing Score:', knn_model.score(X_test, y_test))

The accuracy on the training set is high and testing dataset is pretty low.

## **Support Vector Classifier**

The SVM classifier forms a hyperplane to differentiate between different target labels and predict the outcomes.

In [None]:
svm = SVC(kernel = 'linear')

svm_model = svm.fit(X_train, y_train)
print('Training Score:', svm_model.score(X_train, y_train))
print('Testing Score:', svm_model.score(X_test, y_test))

The SVM model gives pretty decent training and testing score.

## **Naive Bayes Classifier**

The Naive Bayes Classifier predicts the outcomes by calculating the conditional probabilities for each of the target label.

In [None]:
nb = GaussianNB()

nb_model = nb.fit(X_train, y_train)
print('Training Score:', nb_model.score(X_train, y_train))
print('Testing Score:', nb_model.score(X_test, y_test))

The training and testing score are low.

As the Logistic Regression model has the decent training and testing score, let's check how it could be further improved. We'll also try to reduce the number of features to reduce the model complexity.

In [None]:
# Check the weights given to features in the LR model

imp = lr_model.coef_

This gives out the array of weights given by the logistic regression to each of the features.

In [None]:
imp

In [None]:
data = [0.86033582,  0.07359915, -0.88052799, -0.29623866,  0.35454231,
        -0.26178752, -1.57917285, -0.10954753,  0.16821838, -0.60792902,
        -0.69118234,  0.13141275]
cols = X_train.columns

In [None]:
# Plotting the coefficients of LR Model to observe visually

fig, ax = plt.subplots(figsize = (14, 7))

sns.barplot(x = cols, y = data, palette = 'winter')
plt.title('Coefficients of LR Model')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.xticks(rotation = 90)
plt.show()

Let's try with reduced number of features to avoid the model getting more complex. We'll try different combinations and conclude the best one.

In [None]:
features = ['age', 'ejection_fraction', 'time', 'serum_creatinine', 'serum_sodium']

In [None]:
lr = LogisticRegression(solver = 'liblinear', random_state = 42, C = 0.05, max_iter = 300)

lr_model = lr.fit(X_train[features], y_train)
print('Training Score:', lr_model.score(X_train[features], y_train))
print('Testing Score:', lr_model.score(X_test[features], y_test))

The model performs decently in both training and testing datasets with an accuracy of ~84%. This tells us that the model is neither underfitted nor overfitted.

# **Further Steps**

The baseline model was first built and then it was fine tuned. The more complex model may lead to overfitting whereas the least complex model may lead to underfitting. Hence it is very important to make sure that the model learnt the underlying patterns in the data. The model can be further improved by,

- Hyperparamater tuning 
    - By performing GridSearchCV or RandomStratifiedCV
- Averaging techniques
    - By using either hard or soft Voting Classifier
- Bagging techniques
    - Bagging Classifier
- Boosting techniques
    - AdaBoost Classifier
    - Gradient Boosting Classifier
    - Extreme Gradient Boosting

# **Conclusion**

I'll update this notebook in regular intervals with more interesting insights and sophisticated models. The points mentioned in the further steps can be carried out by yourself, which can help you gain more knowledge in predictive modelling.

Hope you enjoyed reading this notebook. Please upvote and leave your comments if you like my work. Thanks!