# HEART FAILURE PREDICTION

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.

Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

First of all I will upload the dataset and make a exploratory analysis

In [12]:
import pandas as pd

df = pd.read_csv('heart_failure_clinical_records_dataset.csv',sep=',')

display(df.sample(5))
print('Dataset length = '+str(len(df))+' rows')

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
245,61.0,1,80,1,38,0,282000.0,1.4,137,1,0,213,0
201,45.0,0,308,1,60,1,377000.0,1.0,136,1,0,186,0
48,80.0,1,553,0,20,1,140000.0,4.4,133,1,0,41,1
38,60.0,0,2656,1,30,0,305000.0,2.3,137,1,0,30,0
188,60.667,1,151,1,40,1,201000.0,1.0,136,0,0,172,0


Dataset length = 299 rows


Now I will describe each variable:

- <u>Age</u>: age of the patient
- <u>Anaemia</u>: decrease of red blood cells or hemoglobin
- <u>Creatinine phosphokinase</u>: level of the CPK enzyme in the blood (mcg/L)
- <u>Diabetes</u>: if the patient has diabetes
- <u>Ejection fraction</u>: percentage of blood leaving the heart at each contraction
- <u>High bloodpressure</u>: if the patient has hypertension
- <u>Platelets</u>: platelets in the blood (kiloplatelets/mL)
- <u>Serum creatinine</u>: level of serum creatinine in the blood (mg/dL)
- <u>Serum sodium</u>: level of serum sodium in the blood (mEq/L) 
- <u>Sex</u>: woman or man
- <u>Smoking</u>: if the patient smokes or not
- <u>Time</u>: follow-up period (days)
- <u>DEATH EVENT</u>: if the patient deceased during the follow-up period

Lets see the data type for each one:

In [16]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


None

As we can see there aren't any NaN values, and every value is numeric. 

Lets make an stadistical exploration:

In [17]:
display(df.describe())

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


As we can se on the table above, there are 6 boolean values, because the min value for each one is 0 and the max value for them is 1. Those features are *anemia*, *diabetes*, *high_blood_pressure*, *sex*, *smoking* and *DEATH_EVENT*.

Aditionaly, we can deduct that the label