Data taken from Kaggle https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
Originally from https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

Extra: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/1

Thirteen (13) clinical features:

- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- high blood pressure: if the patient has hypertension (boolean)
- creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction (percentage)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- [target] death event: if the patient deceased during the follow-up period (boolean)

In [1]:
pip install imblearn

Note: you may need to restart the kernel to use updated packages.


In [2]:
#import necessary libraries 
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
from imblearn.under_sampling import RandomUnderSampler
import numpy as np
import pandas as pd

In [3]:
#import the dataset and view
dataset = 'C:/Users/Sydney/Desktop/heart_failure_clinical_records_dataset.csv'
dataset = pd.read_csv(dataset).reset_index(drop=True)

dataset.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


### Data preprocessing and data cleaning

In [4]:
#get idea of the mean and stds within the dataset
dataset.describe()


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.4 KB


In [6]:
#check for N/A values within the dataset
dataset.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

### Make sample + Train/Test Sets

In [7]:
#grab random sample from the dataset that can then be used for training and testing sets
X = dataset.drop('DEATH_EVENT', axis=1)
y = dataset.DEATH_EVENT

#randomundersampler()
model = RandomUnderSampler(random_state=42)
sampleX, sampley = model.fit_resample(X,y)

In [8]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [9]:
#normalize the data 
processor = MinMaxScaler()
scaled_X = processor.fit_transform(sampleX)

In [10]:
#create training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(scaled_X, sampley, test_size = 0.25)

# Develop Pipelines

In [11]:
#import all necessary libraries for pipeline development
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.svm import OneClassSVM, SVC, LinearSVC, SVR
from sklearn.decomposition import PCA
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif, RFE, VarianceThreshold
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report, confusion_matrix

Each pipeline built and then tested will inclue anomoly detection to determine outliers, feature selection, then classification methods to predict yes or no to if someone experienced heart failure.

### Pipeline 1

In [12]:
#anomoly detection with elliptic envelope
#fit the model and then identify outliers via preciton = -1 (outlier)
#new x and y values will be used in the pipeline for model fit
env = EllipticEnvelope().fit(X_train, y_train)
env_outliers = env.predict(X_train)==-1
X_new = X_train[~env_outliers]
y_new = y_train[~env_outliers]

In [13]:
#build pipeline 1

#PCA used to reduce dimensionality of data + preserve variability 
#perform LinearSVC for to predict possible outcomes
pipe1 = Pipeline([
    ('PCA', PCA()),
    ('LSVC', LinearSVC(max_iter=5000))
])

param_grid = {
    'PCA__n_components':[0.25,0.5,0.75,1,3,5],
    'PCA__tol':[0.0,0.1,1,1.5],
    'LSVC__C':[0.25,1,1.5,5]
}

#model
model_grid1 = GridSearchCV(pipe1, param_grid, cv=5)
model_grid1.fit(X_new, y_new)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('PCA', PCA()),
                                       ('LSVC', LinearSVC(max_iter=5000))]),
             param_grid={'LSVC__C': [0.25, 1, 1.5, 5],
                         'PCA__n_components': [0.25, 0.5, 0.75, 1, 3, 5],
                         'PCA__tol': [0.0, 0.1, 1, 1.5]})

In [14]:
#model_grid1.get_params().keys()

In [15]:
#look at the best estiamtor for pipeline 1
model_grid1.best_estimator_

Pipeline(steps=[('PCA', PCA(n_components=5)),
                ('LSVC', LinearSVC(C=0.25, max_iter=5000))])

In [16]:
#see best parameters individually 
model_grid1.best_params_

{'LSVC__C': 0.25, 'PCA__n_components': 5, 'PCA__tol': 0.0}

In [17]:
#best score
model_grid1.best_score_

0.5895384615384616

In [18]:
print(model_grid1.fit(X_new, y_new).cv_results_)

{'mean_fit_time': array([0.00470748, 0.00343523, 0.00821829, 0.00316052, 0.00456223,
       0.00667229, 0.00041738, 0.00313268, 0.00579495, 0.00651703,
       0.00039897, 0.00457902, 0.00429134, 0.0028491 , 0.00504198,
       0.00565639, 0.0037838 , 0.00354352, 0.00405798, 0.004352  ,
       0.00224571, 0.00426836, 0.00383039, 0.0010303 , 0.00571766,
       0.00282102, 0.00369563, 0.00367064, 0.0033051 , 0.0039844 ,
       0.00401478, 0.00612741, 0.00293264, 0.00252843, 0.00019937,
       0.00727391, 0.00019937, 0.00810442, 0.00838099, 0.        ,
       0.00647082, 0.00617418, 0.00632706, 0.00314851, 0.00670619,
       0.00669656, 0.00313101, 0.        , 0.00236616, 0.        ,
       0.00311847, 0.00019946, 0.        , 0.00973473, 0.00312619,
       0.0002049 , 0.00363145, 0.00020556, 0.00314493, 0.0061172 ,
       0.00671439, 0.00670357, 0.00316114, 0.00060349, 0.00079618,
       0.00419416, 0.00625758, 0.00354962, 0.        , 0.00316291,
       0.003127  , 0.00019937, 0.01017923, 0

In [19]:
#find the predicted y based on the pipeline
pred_y = model_grid1.predict(X_test)

#find the precision, recall, f1 score, and accuracy over the model
model_grid1.score(X_test, y_test)
print(classification_report(y_test, pred_y))
pd.DataFrame(confusion_matrix(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.45      0.59      0.51        22
           1       0.53      0.38      0.44        26

    accuracy                           0.48        48
   macro avg       0.49      0.49      0.48        48
weighted avg       0.49      0.48      0.47        48



Unnamed: 0,0,1
0,13,9
1,16,10


### Pipeline 2

In [20]:
#anomoly detection via isolation forest
#find outliers via predict = -1
#set new x and y 
iso = IsolationForest().fit(X_train, y_train)
iso_outliers = iso.predict(X_train)==-1
X_iso = X_train[~iso_outliers]
y_iso = y_train[~iso_outliers]

In [21]:
#build pipeline 2

#RFE used for feature selection to find ideal ones
#perform logistic regression for classification
pipe2 = Pipeline([
    ('RFE', RFE(estimator=SVR(kernel='linear'))),
    ('LR', LogisticRegression())
])

param_grid = {
    'RFE__n_features_to_select':[2,4,6,8,10,11],
    'LR__tol':[1e-4, 1e-5, 1e-3, 2e-4, 3e-4],
    'LR__C':[1.0,1.1,1.2,1.5,1.8,2.0]
}

#model
model_grid2 = GridSearchCV(pipe2, param_grid, cv=5)
model_grid2.fit(X_iso, y_iso)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('RFE',
                                        RFE(estimator=SVR(kernel='linear'))),
                                       ('LR', LogisticRegression())]),
             param_grid={'LR__C': [1.0, 1.1, 1.2, 1.5, 1.8, 2.0],
                         'LR__tol': [0.0001, 1e-05, 0.001, 0.0002, 0.0003],
                         'RFE__n_features_to_select': [2, 4, 6, 8, 10, 11]})

In [22]:
#look at the best estiamtor for pipeline 1
model_grid2.best_estimator_

Pipeline(steps=[('RFE',
                 RFE(estimator=SVR(kernel='linear'), n_features_to_select=2)),
                ('LR', LogisticRegression())])

In [23]:
#best score
model_grid2.best_score_

0.844155844155844

In [24]:
#find the predicted y based on the pipeline
pred_y = model_grid2.predict(X_test)

#find the precision, recall, f1 score, and accuracy over the model
model_grid2.score(X_test, y_test)
print(classification_report(y_test, pred_y))
pd.DataFrame(confusion_matrix(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.65      0.91      0.75        22
           1       0.88      0.58      0.70        26

    accuracy                           0.73        48
   macro avg       0.76      0.74      0.73        48
weighted avg       0.77      0.73      0.72        48



Unnamed: 0,0,1
0,20,2
1,11,15


Pipeline 2 performed much better than pipeline 1 in all aspects of precision, recall, f1-score and overall accuracy. The confusion matrix shows 13 incorrect predictions out of 48. Model best score 0.84.

### Pipeline 3

In [25]:
#local outlier factor via isolation forest
#find outliers via predict = -1
#set new x and y 
lof = LocalOutlierFactor().fit(X_train, y_train)
lof_outliers = lof.fit_predict(X_train)==-1
X_lof = X_train[~lof_outliers]
y_lof = y_train[~lof_outliers]

In [26]:
#build pipeline 3

#select K best used for feature selection
#DTC classification
pipe3 = Pipeline([
    ('SKB', SelectKBest(chi2)),
    ('DTC', DecisionTreeClassifier())
])


param_grid = {
    'DTC__max_features':[2,4,6,8,10],
    'DTC__min_samples_split':[2,3,4]
}

#model
model_grid3 = GridSearchCV(pipe3,param_grid, cv=5)
model_grid3.fit(X_lof, y_lof)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('SKB',
                                        SelectKBest(score_func=<function chi2 at 0x09C6F9C0>)),
                                       ('DTC', DecisionTreeClassifier())]),
             param_grid={'DTC__max_features': [2, 4, 6, 8, 10],
                         'DTC__min_samples_split': [2, 3, 4]})

In [27]:
#look at the best estiamtor for pipeline 3
model_grid3.best_estimator_

Pipeline(steps=[('SKB', SelectKBest(score_func=<function chi2 at 0x09C6F9C0>)),
                ('DTC',
                 DecisionTreeClassifier(max_features=10, min_samples_split=3))])

In [28]:
#best score
model_grid3.best_score_

0.8327586206896551

In [29]:
#find the predicted y based on the pipeline
pred_y = model_grid3.predict(X_test)

#find the precision, recall, f1 score, and accuracy over the model
model_grid3.score(X_test, y_test)
print(classification_report(y_test, pred_y))
pd.DataFrame(confusion_matrix(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.69      0.91      0.78        22
           1       0.89      0.65      0.76        26

    accuracy                           0.77        48
   macro avg       0.79      0.78      0.77        48
weighted avg       0.80      0.77      0.77        48



Unnamed: 0,0,1
0,20,2
1,9,17


Pipeline 3 had better results than pipeline 2 with an accuracy of 0.77 and 11 false predictions. Best score was 0.83.

### Pipeline 4

In [30]:
#anomoly detection via isolation forest
#find outliers via predict = -1
#set new x and y 
iso = IsolationForest().fit(X_train, y_train)
iso_outliers = iso.predict(X_train)==-1
X_iso = X_train[~iso_outliers]
y_iso = y_train[~iso_outliers]

In [31]:
#build pipeline 4

#SKB via chi2 and LR for the pipeline
pipe4 = Pipeline([
    ('SKB', SelectKBest(chi2)),
    ('LR', LogisticRegression())
])

param_grid = {
    'SKB__k':[1,5,10],
    'LR__tol':[1e-4, 1e-5, 1e-3, 2e-4, 3e-4],
    'LR__C':[1.0,1.1,1.2,1.5,1.8,2.0]
}

#model
model_grid4 = GridSearchCV(pipe4, param_grid, cv=5)
model_grid4.fit(X_iso, y_iso)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('SKB',
                                        SelectKBest(score_func=<function chi2 at 0x09C6F9C0>)),
                                       ('LR', LogisticRegression())]),
             param_grid={'LR__C': [1.0, 1.1, 1.2, 1.5, 1.8, 2.0],
                         'LR__tol': [0.0001, 1e-05, 0.001, 0.0002, 0.0003],
                         'SKB__k': [1, 5, 10]})

In [32]:
#look at the best estiamtor for pipeline 4
model_grid4.best_estimator_

Pipeline(steps=[('SKB',
                 SelectKBest(k=1, score_func=<function chi2 at 0x09C6F9C0>)),
                ('LR', LogisticRegression())])

In [33]:
#best score
model_grid4.best_score_

0.8095238095238095

In [34]:
#find the predicted y based on the pipeline
pred_y = model_grid4.predict(X_test)

#find the precision, recall, f1 score, and accuracy over the model
model_grid4.score(X_test, y_test)
print(classification_report(y_test, pred_y))
pd.DataFrame(confusion_matrix(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.63      0.86      0.73        22
           1       0.83      0.58      0.68        26

    accuracy                           0.71        48
   macro avg       0.73      0.72      0.71        48
weighted avg       0.74      0.71      0.70        48



Unnamed: 0,0,1
0,19,3
1,11,15


Pipleline 4 had similar accuracy to pipeline 2 with 0.71 or 14 misclassified cases. Pipeline 3 still performing the best v. others.

### Pipeline 5 

In [35]:
#local outlier factor via isolation forest
#find outliers via predict = -1
#set new x and y 
lof = LocalOutlierFactor().fit(X_train, y_train)
lof_outliers = lof.fit_predict(X_train)==-1
X_lof = X_train[~lof_outliers]
y_lof = y_train[~lof_outliers]

In [36]:
#build pipeline 5

#RFE used for feature selection to find ideal ones
pipe5 = Pipeline([
    ('VT',VarianceThreshold()),
    ('RFC',RandomForestClassifier())
])

param_grid = {
    'RFC__n_estimators':[10,50,100],
    'RFC__min_samples_split':[2,3,4]
}

#model
model_grid5 = GridSearchCV(pipe5, param_grid, cv=5)
model_grid5.fit(X_lof, y_lof)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('VT', VarianceThreshold()),
                                       ('RFC', RandomForestClassifier())]),
             param_grid={'RFC__min_samples_split': [2, 3, 4],
                         'RFC__n_estimators': [10, 50, 100]})

In [37]:
#model_grid5.get_params().keys()

In [38]:
#look at the best estiamtor for pipeline 5
model_grid5.best_estimator_

Pipeline(steps=[('VT', VarianceThreshold()),
                ('RFC',
                 RandomForestClassifier(min_samples_split=4, n_estimators=50))])

In [39]:
#best score
model_grid5.best_score_

0.846551724137931

In [40]:
#find the predicted y based on the pipeline
pred_y = model_grid5.predict(X_test)

#find the precision, recall, f1 score, and accuracy over the model
model_grid5.score(X_test, y_test)
print(classification_report(y_test, pred_y))
pd.DataFrame(confusion_matrix(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.76      0.86      0.81        22
           1       0.87      0.77      0.82        26

    accuracy                           0.81        48
   macro avg       0.81      0.82      0.81        48
weighted avg       0.82      0.81      0.81        48



Unnamed: 0,0,1
0,19,3
1,6,20


Model 5 had best accuracy and precision of all models (0.81 accuracy) and best score of 0.84.

# Retrained Model Using Pipeline 5

In [52]:
#retrain pipeline with best pipeline on the fully sampled set
trained_model = model_grid5.fit(X_train, y_train)

In [53]:
#scale X test
scaled_X_test = MinMaxScaler().fit_transform(X_test)

In [54]:
#prediction for y
pred_y = trained_model.predict(scaled_X_test)

In [55]:
#test it all
model_grid5.score(scaled_X_test, y_test)

print(classification_report(y_test, pred_y))
pd.DataFrame(confusion_matrix(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.74      0.91      0.82        22
           1       0.90      0.73      0.81        26

    accuracy                           0.81        48
   macro avg       0.82      0.82      0.81        48
weighted avg       0.83      0.81      0.81        48



Unnamed: 0,0,1
0,20,2
1,7,19


### Retrained Model, Pipeline 4, Removed Outliers

In [56]:
#run test with outliers removed
lof = LocalOutlierFactor().fit(scaled_X_test, y_test)
lof_outliers = lof.fit_predict(scaled_X_test)==-1
X_lof_test = scaled_X_test[~lof_outliers]
y_lof_test = y_test[~lof_outliers]

In [57]:
#prediction for y
pred_y = trained_model.predict(X_lof_test)

In [58]:
#test it all
model_grid5.score(X_lof_test, y_lof_test)

print(classification_report(y_lof_test, pred_y))
pd.DataFrame(confusion_matrix(y_lof_test, pred_y))

              precision    recall  f1-score   support

           0       0.74      0.91      0.82        22
           1       0.90      0.73      0.81        26

    accuracy                           0.81        48
   macro avg       0.82      0.82      0.81        48
weighted avg       0.83      0.81      0.81        48



Unnamed: 0,0,1
0,20,2
1,7,19


Same outcome with test outliers removed. Model 5 had an overall accuracy of 0.81.

### Train Test Split Overall Sample to Assess Performance

In [59]:
#split overall data into train and test sets
X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X,y,test_size=0.25)

In [60]:
#standardize x
scaled_X_train = MinMaxScaler().fit_transform(X_train_n)
scaled_X_test = MinMaxScaler().fit_transform(X_test_n)

#train using best pipeline model
trained_model = model_grid5.fit(scaled_X_train, y_train_n)

In [61]:
#prediction for y
pred_y = trained_model.predict(scaled_X_test)

In [62]:
#test it all
model_grid5.score(scaled_X_test, y_test_n)

print(classification_report(y_test_n, pred_y))
pd.DataFrame(confusion_matrix(y_test_n, pred_y))

              precision    recall  f1-score   support

           0       0.85      0.87      0.86        52
           1       0.68      0.65      0.67        23

    accuracy                           0.80        75
   macro avg       0.77      0.76      0.76        75
weighted avg       0.80      0.80      0.80        75



Unnamed: 0,0,1
0,45,7
1,8,15


On the overall testing dataset, the model produced performed with an accuracy of 80%. 15 outcomes were misclassified out of 75.  