# Heart Failure Death Prediction

*Data Source* - https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records#

*Features* - There are total thirteen (13) columns (clinical features):

- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- high blood pressure: if the patient has hypertension (boolean)
- creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction (percentage)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- death event (target/label): if the patient deceased during the follow-up period (boolean)

*Method* - We use PyCaret classification module to predict the DEATH_EVENT of heart failure.

*Reference* - https://www.pycaret.org/tutorials/html/CLF101.html

## Step 1 - Data Preparation

Check for null values and keep a small portion of data for prediction. 

In [12]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [13]:
df = pd.read_csv("data/heart_failure.csv")

print(df.shape)
df.head()

(299, 13)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [14]:
# check for null values

df.isna().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

In [15]:
# Convert the ejection_fraction to decimal.

df["ejection_fraction"] = df["ejection_fraction"] / 100

df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,0.2,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,0.38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,0.2,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,0.2,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,0.2,0,327000.0,2.7,116,0,0,8,1


In [16]:
# randomly select 10 observations for prediction using the learned model

df_predict = df.sample(10, random_state=123)
df_predict

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
202,70.0,0,97,0,0.6,1,220000.0,0.9,138,1,0,186,0
114,60.0,1,754,1,0.4,1,328000.0,1.2,126,1,0,91,0
164,45.0,0,2442,1,0.3,0,334000.0,1.1,139,1,0,129,1
277,70.0,0,582,1,0.38,0,25100.0,1.1,140,1,0,246,0
292,52.0,0,190,1,0.38,0,382000.0,1.0,140,1,1,258,0
266,55.0,0,1199,0,0.2,0,263358.03,1.83,134,1,1,241,1
11,62.0,0,231,0,0.25,1,253000.0,0.9,140,1,1,10,1
171,52.0,0,3966,0,0.4,0,325000.0,0.9,140,1,1,146,0
238,65.0,1,720,1,0.4,0,257000.0,1.0,136,0,0,210,0
263,68.0,1,157,1,0.6,0,208000.0,1.0,140,0,0,237,0


In [17]:
# The remaining 289 observations are used for machine learning (training and testing)

df_learn = df.drop(df_predict.index)
print(df_learn.shape)

(289, 13)


In [18]:
df_learn.to_csv("data/learn_dataset.csv", index=False)
df_predict.to_csv("data/predict_dataset.csv", index=False)

## The End