# Heart Failure Death Prediction

*Data Source* - https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records#

*Features* - There are total thirteen (13) columns (clinical features):

- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- high blood pressure: if the patient has hypertension (boolean)
- creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction (percentage)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- death event (target/label): if the patient deceased during the follow-up period (boolean)

*Method* - We use PyCaret classification module to predict the DEATH_EVENT of heart failure.

*Reference* - https://www.pycaret.org/tutorials/html/CLF101.html


In [1]:
import pandas as pd
from pycaret.classification import *

import warnings
warnings.filterwarnings('ignore')

## Step 1 - Data Preparation

Check for null values and keep a small portion of data for prediction. 

In [2]:
df = pd.read_csv("heart_failure_clinical_records_dataset.csv")

print(df.shape)
df.head()

(299, 13)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [3]:
df.isna().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

In [4]:
# Convert the ejection_fraction to decimal.

df["ejection_fraction"] = df["ejection_fraction"] / 100

df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,0.2,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,0.38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,0.2,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,0.2,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,0.2,0,327000.0,2.7,116,0,0,8,1


In [5]:
# randomly select 10 observations for prediction using the learned model
# The remaining 289 observations are used for machine learning (training and testing)

df_unseen = df.sample(10, random_state=123)
df_learn = df.drop(df_unseen.index)

print(df_unseen.shape)
print(df_learn.shape)

(10, 13)
(289, 13)


## Step 2 - Model Training and Selection

This step also involve model evaluation and tuning.

In [6]:
s = setup(data=df_learn, silent=True, target="DEATH_EVENT",
    log_experiment=True, experiment_name="gold")

Unnamed: 0,Description,Value
0,session_id,328
1,Target,DEATH_EVENT
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(289, 13)"
5,Missing Values,False
6,Numeric Features,7
7,Categorical Features,5
8,Ordinal Features,False
9,High Cardinality Features,False


2022/05/20 16:01:09 INFO mlflow.tracking.fluent: Experiment with name 'gold' does not exist. Creating a new experiment.


In [7]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8364,0.9288,0.7,0.8094,0.7391,0.6224,0.6355,0.124
gbc,Gradient Boosting Classifier,0.8117,0.9066,0.6524,0.7457,0.6828,0.5535,0.5628,0.041
lda,Linear Discriminant Analysis,0.811,0.879,0.6238,0.7595,0.6768,0.5481,0.5579,0.006
lightgbm,Light Gradient Boosting Machine,0.8064,0.9009,0.6333,0.6952,0.658,0.5316,0.5311,0.014
ridge,Ridge Classifier,0.7917,0.0,0.5643,0.7555,0.6353,0.496,0.513,0.006
lr,Logistic Regression,0.7869,0.8535,0.6095,0.7208,0.6432,0.4968,0.5125,0.515
ada,Ada Boost Classifier,0.7812,0.867,0.5714,0.7448,0.6202,0.475,0.498,0.041
nb,Naive Bayes,0.7676,0.8174,0.4738,0.745,0.5589,0.4167,0.4467,0.007
dt,Decision Tree Classifier,0.7667,0.7491,0.6905,0.6745,0.6585,0.4882,0.509,0.008
et,Extra Trees Classifier,0.7624,0.8811,0.4786,0.7583,0.5509,0.4109,0.4484,0.118


In [8]:
model_rf = create_model("rf")

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8095,0.8724,0.5714,0.8,0.6667,0.5385,0.5534
1,0.9048,0.9847,0.7143,1.0,0.8333,0.7692,0.7906
2,0.8,0.9048,0.5,0.75,0.6,0.4737,0.491
3,0.95,1.0,0.8333,1.0,0.9091,0.875,0.8819
4,0.75,0.8452,0.8333,0.5556,0.6667,0.4792,0.5044
5,0.9,0.9881,0.8333,0.8333,0.8333,0.7619,0.7619
6,0.7,0.8297,0.7143,0.5556,0.625,0.3814,0.3898
7,0.8,0.9121,0.5714,0.8,0.6667,0.5294,0.5447
8,0.95,1.0,0.8571,1.0,0.9231,0.8864,0.8921
9,0.8,0.9505,0.5714,0.8,0.6667,0.5294,0.5447


In [9]:
tuned_rf = tune_model(model_rf)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8571,0.8776,0.7143,0.8333,0.7692,0.6667,0.6708
1,0.9048,0.9694,0.8571,0.8571,0.8571,0.7857,0.7857
2,0.9,0.8929,0.8333,0.8333,0.8333,0.7619,0.7619
3,0.95,1.0,1.0,0.8571,0.9231,0.8864,0.8921
4,0.75,0.869,0.8333,0.5556,0.6667,0.4792,0.5044
5,0.95,0.9881,0.8333,1.0,0.9091,0.875,0.8819
6,0.75,0.8462,0.8571,0.6,0.7059,0.5,0.5241
7,0.75,0.9231,0.5714,0.6667,0.6154,0.4318,0.4346
8,0.95,1.0,1.0,0.875,0.9333,0.8936,0.8987
9,0.9,0.9341,0.8571,0.8571,0.8571,0.7802,0.7802


In [10]:
evaluate_model(tuned_rf)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

In [11]:
# Use the hold-out observations for testing the model

predict_model(tuned_rf)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.7586,0.8599,0.7407,0.5882,0.6557,0.4736,0.4811


Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,time,anaemia_1,diabetes_1,high_blood_pressure_1,sex_1,smoking_1,DEATH_EVENT,Label,Score
0,68.0,577.0,0.25,166000.00000,1.00,138.0,43.0,1.0,0.0,1.0,1.0,0.0,1,1,0.7944
1,40.0,101.0,0.40,226000.00000,0.80,141.0,187.0,1.0,0.0,0.0,0.0,0.0,0,0,0.9574
2,60.0,2261.0,0.35,228000.00000,0.90,136.0,115.0,0.0,0.0,1.0,1.0,0.0,0,0,0.9432
3,50.0,54.0,0.40,279000.00000,0.80,141.0,250.0,1.0,0.0,0.0,1.0,0.0,0,0,0.9533
4,65.0,113.0,0.25,497000.00000,1.83,135.0,67.0,0.0,1.0,0.0,1.0,0.0,1,1,0.7589
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,69.0,582.0,0.20,266000.00000,1.20,134.0,73.0,0.0,0.0,0.0,1.0,1.0,1,1,0.6407
83,75.0,582.0,0.45,263358.03125,1.18,137.0,87.0,0.0,0.0,1.0,1.0,0.0,0,0,0.7288
84,65.0,224.0,0.50,149000.00000,1.30,137.0,72.0,0.0,1.0,0.0,1.0,1.0,0,0,0.6431
85,50.0,159.0,0.30,302000.00000,1.20,138.0,29.0,1.0,1.0,0.0,0.0,0.0,0,1,0.8350


In [12]:
# Finalize the model

final_rf = finalize_model(tuned_rf)
print(final_rf)

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                       class_weight='balanced_subsample', criterion='entropy',
                       max_depth=5, max_features='log2', max_leaf_nodes=None,
                       max_samples=None, min_impurity_decrease=0.0005,
                       min_impurity_split=None, min_samples_leaf=2,
                       min_samples_split=10, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=-1, oob_score=False,
                       random_state=328, verbose=0, warm_start=False)


In [13]:
# Test the finalized model on hold-out data
predict_model(final_rf);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.9195,0.9765,0.8519,0.8846,0.8679,0.8101,0.8104


## Step 3 - Perform Prediction

Use the finalized model to predict on unseen data.

In [14]:
unseen_predictions = predict_model(final_rf, data=df_unseen)
unseen_predictions

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.9,0.9524,0.6667,1.0,0.8,0.7368,0.7638


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT,Label,Score
202,70.0,0,97,0,0.6,1,220000.0,0.9,138,1,0,186,0,0,0.8913
114,60.0,1,754,1,0.4,1,328000.0,1.2,126,1,0,91,0,0,0.6879
164,45.0,0,2442,1,0.3,0,334000.0,1.1,139,1,0,129,1,0,0.7671
277,70.0,0,582,1,0.38,0,25100.0,1.1,140,1,0,246,0,0,0.7983
292,52.0,0,190,1,0.38,0,382000.0,1.0,140,1,1,258,0,0,0.9075
266,55.0,0,1199,0,0.2,0,263358.03,1.83,134,1,1,241,1,1,0.5795
11,62.0,0,231,0,0.25,1,253000.0,0.9,140,1,1,10,1,1,0.8287
171,52.0,0,3966,0,0.4,0,325000.0,0.9,140,1,1,146,0,0,0.8681
238,65.0,1,720,1,0.4,0,257000.0,1.0,136,0,0,210,0,0,0.877
263,68.0,1,157,1,0.6,0,208000.0,1.0,140,0,0,237,0,0,0.8468


## Step 4 - Saving the Model

Save the model for future deployment and operation.

In [15]:
save_model(final_rf,'gold-pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[],
                                       target='DEATH_EVENT', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric...
                  RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                         class_weight='balanced_subsample',
                                         criterion='entropy', max_depth=5,
                                         max_featu

## Step 5 - Load and Prefdict

Load the saved model and perform prediction.

In [16]:
#saved_rf = load_model("gold-pipeline")

In [17]:
#predict_model(saved_rf, df_unseen)