<a href="https://colab.research.google.com/github/rhiats/Heart-Failure-Clinical-Records/blob/main/Heart_Failure_Clinical_Records.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Heart Failure Clinical Records**

Replicate paper:

Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16. https://doi.org/10.1186/s12911-020-1023-5

Dataset:
Heart Failure Clinical Records [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5Z89R.

In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [3]:
import pandas as pd

**Load Data from UCI ML Repo**

In [2]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
heart_failure_clinical_records = fetch_ucirepo(id=519)

# data (as pandas dataframes)
X = heart_failure_clinical_records.data.features
y = heart_failure_clinical_records.data.targets

# metadata
print(heart_failure_clinical_records.metadata)

# variable information
print(heart_failure_clinical_records.variables)

{'uci_id': 519, 'name': 'Heart Failure Clinical Records', 'repository_url': 'https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records', 'data_url': 'https://archive.ics.uci.edu/static/public/519/data.csv', 'abstract': 'This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate'], 'num_instances': 299, 'num_features': 12, 'feature_types': ['Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['death_event'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2020, 'last_updated': 'Mon Feb 26 2024', 'dataset_doi': '10.24432/C5Z89R', 'creators': [], 'intro_paper': {'ID': 286, 'type': 'NATIVE', 'title': 'Machine learning can predict survival of patients with heart failure f

In [25]:
df = pd.concat([X, y], axis=1)
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death_event
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


**Continuous Features EDA**

In [11]:
continuous_features_df = X[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']]
summary_continuous_full_sample_df = pd.DataFrame({
    'Median': continuous_features_df.median(),
    'Mean': round(continuous_features_df.mean(),2),
    'Standard Deviation': round(continuous_features_df.std(),2)
})

In [12]:
summary_continuous_full_sample_df

Unnamed: 0,Median,Mean,Standard Deviation
age,60.0,60.83,11.89
creatinine_phosphokinase,250.0,581.84,970.29
ejection_fraction,38.0,38.08,11.83
platelets,262000.0,263358.03,97804.24
serum_creatinine,1.1,1.39,1.03
serum_sodium,137.0,136.63,4.41
time,115.0,130.26,77.61


In [31]:
continuous_features_dead_df = df[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']][df['death_event']==1]

In [32]:
summary_continuous_dead_df = pd.DataFrame({
    'Median': continuous_features_dead_df.median(),
    'Mean': round(continuous_features_dead_df.mean(),2),
    'Standard Deviation': round(continuous_features_dead_df.std(),2)
})

In [33]:
summary_continuous_dead_df

Unnamed: 0,Median,Mean,Standard Deviation
age,65.0,65.22,13.21
creatinine_phosphokinase,259.0,670.2,1316.58
ejection_fraction,30.0,33.47,12.53
platelets,258500.0,256381.04,98525.68
serum_creatinine,1.3,1.84,1.47
serum_sodium,135.5,135.38,5.0
time,44.5,70.89,62.38


**Categorical Features EDA**

**Mann–Whitney U test**

**Pearson correlation coefficient**

**Scatterplot comparing Serum Creatine v. Ejection Fraction**

**Survival Month**

**Survival Month v. Survival**

**Survival Prediction on all Clinical Features**



1. Random Forests
2. Decision Tree
3. Gradient Boosting
4. One Rule
5. Artificial Neural Network
6. Naive Bayes
7. SVM Radial
8. SVM Linear
9. K-nearest Neighbor

Found Random Forest performed the best.







**Feature Ranking**


1. RReliefF
2. Max-Min Parents and Children
3. Random Forest
4. One Rule
5. Recursive Partitioning and Regression Trees
6. Support Vector Machines with linear kernel
7. eXtreme Gradient Boosting

11 features (everything but Time)

Borda list: lower the score,the more important the feature

The authors found the top–two features to be serum
creatinine and ejection fraction. They used these 2 features in 3 classifiers:
Random Forests (RF),
Support Vector Machine with Gaussian Kernel (GSVM)
and eXtreme Gradient Boosting (XGB).


**Random Forests (RF)**

**Support Vector Machine with Gaussian Kernel (GSVM)**

**eXtreme Gradient Boosting (XGB)**

**Stratified Logistic Regression All Features**

**Stratified Logistic Regression Top 3 Features**