# Destined for (Heart <3) Failure? 

# Heart Failure Predictions
Cardiovascular diseases are gradually becoming one of the most prominent causes of death in the modern world. Common knowledge dictates that this disease is highly linked with habits such as smoking or sedentary behaviour; other diseases such as obesity and diabetes and also genetics. Predicting the likelihood of a person having heart failure would be extremely valuable to the medical field as early diagnosis would enhance the possibility of sound recovery. Machine learning would be a tool to forecast whether a particular patient is likely to suffer from heart failure and hence obtain more resources for treatment.

This dataset that concerns this notebook was provided in the journal article [Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5) by Davide Chicco & Giuseppe Jurman. It contains a set of features that can be used to predict whether a fatal heart failure will occur in that person. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df= pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df

In [None]:
df.describe()

# 1. Exploratory Data Analysis

We find that the data is pre-cleaned, with categorial columns such as sex, presence of anaemia, diabetes and high blood pressure pre-labelled, which leads itself to the high usability of data. The one concern is that there are only 299 instances in this dataset, which might lead to some overfitting of data. 

## 1.1 Features and Target
### Features and their description
We notice a set of inherent, habitual and biologically measurable factors available at our disposal to use. 
* age: The person's age (numeric) 
* anaemia: Decrease of red blood cells or hemoglobin (boolean)
* creatine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
* diabetes: If the patient has diabetes (boolean)
* ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)
* high_blood_pressure: If the patient has hypertension (boolean)
* platelets: Platelets in the blood (kiloplatelets/mL)
* serum_creatinine: Level of serum creatinine in the blood (mg/dL)
* serum_sodium: Level of serum sodium in the blood (mEq/L)
* gender: Woman or man (binary)
* smoking: If the patient smokes or not (boolean)
* time: Follow-up period (days)

### Target
* DEATH_EVENT: If the patient deceased during the follow-up period (boolean)

## 1.2 Correlation heatmap

The correlation heatmap shows the strength of linear correlation between each variable, from a matrix of univariate regressions. It gives us a simple yet powerful overview of the underlying relationships between the variables. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot=True, square=True)

The series below is a sorted vesion of the relevant part of the correlation matrix above, indicating each factor's salience in successfully predicting the death event.

In [None]:
df.corr()['DEATH_EVENT'].apply(np.abs).sort_values(ascending=False)

#### Data Leakage
Note that the 'time' feature has an unusually large correlation with the DEATH_EVENT target and constitutes as a form of data leakage. This signifies that those who have longer (perhaps earlier) treatment are more likely to find a suitable cure for heart disease. We won't be able to use this attribute for prediction as it is only available when we know the outcome of DEATH_EVENT.

#### Promising Features
It seems that serum_creatinine, ejection_fraction, age and serum_sodium are promising features to use in this task. We will confirm this with the pairplot.

## 1.3 Pairplot

The useful aspects of this plot is the kernel density estimation as it separates the distribution of features into the two target classes. This shows that those who die from heart disease tend to have
* higher age
* higher creatinine values
* but lower ejection fractions

This correlation is also captured by the heat map above. 

In [None]:
sns.pairplot(df, vars=['serum_creatinine', 'ejection_fraction', 'age', 'serum_sodium', 'platelets'], hue = 'DEATH_EVENT')

## 1.4 Variable frequencies

We note that the target class distribution is skewed, indicating that we will have to perform stratified sampling of our data.

In [None]:
df['DEATH_EVENT'].mean()

## 1.5 Sidetrack: Smoking, sex and heart disease

The following tables gives insight that, in our sample
* a lot more males smoke compared to females
* there is a similar number of non-smoking males and females who die due to heart disease
* there is no obvious difference in fractions and hence correlation between smoking/sex and presence of heart disease

In [None]:
#smoke_sex = df.groupby(['smoking','sex', 'DEATH_EVENT'])['DEATH_EVENT'].count().unstack()
pd.crosstab([df.smoking, df.sex ], df['DEATH_EVENT']) 

In [None]:
# normalise index because we want to see the proportion of each group
pd.crosstab([df.smoking, df.sex], df['DEATH_EVENT'], normalize='index') 

---
# 2. Modelling

Now as we have rougly understood the data, we can move onto the next step. 

We would like to apply machine learning models to this context with the aim of trying to predict whether a person is likely to have heart disease based on some knowledge of predetermining factors.

In [None]:
from sklearn.model_selection import train_test_split

# Select subset of predictors
cols_to_use = df.columns[: - 2]
X = df[cols_to_use]

# Select target
y = df.iloc[:,-1]

# Separate data into training and validation sets
# as splits have randomness, we apply a random_state seed for reproducibility
# stratified as imbalanced classes
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=20, stratify = y)
    

## 2.1 Model Selection

We would like to try a few classification models on our data and evaluate their success in predicting our target variable.

In [None]:
# define useful helper function for displaying cv scores.

def display_scores(scores, return_scores = False):
    if return_scores: 
        print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
# import models
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, NuSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# import method for selection criteria
from sklearn.model_selection import cross_val_score


# create list of Classification models to choose from
models = [XGBClassifier(),
          SVC(C= 0.1),
          NuSVC(),
          LogisticRegression(),
          RandomForestClassifier(),
          DecisionTreeClassifier()]

# loop through models and evaluate their performance based on 10-fold cross validation 
def loopthruModels(models, X, y, cv=10): 
    for model in models: 
        print(str(model))
        scores = cross_val_score(model, X,y, scoring="f1_macro", cv= cv)
        display_scores(scores)
        print('---')
        
loopthruModels(models, X_train, y_train)

The F1-score was used instead of accuracy to evaluate the effectiveness of our classification models, because our dataset is skewed towards negative classes. It is given by the formula
$$F1 = {{2 (prec \times rec)} \over {prec + rec} }$$

where $prec$ is the precision (how many selected items are relevant) and $rec$ is the recall (how many relevant items are selected). 

The three best models with highest performance, based on their mean F1-score is the XGBoost Classifier, the Random Forest Classifier and the Logistic Regression model.

However, do keep in mind that there is a certain degree of randomness to such computations, mainly based on what the outcome of the `train_test_split` function is, because our dataset is relatively small. It was noted that the Logistic Regression model sometimes failed to converge in some particular test splits.



---
# 3. Hyperparameter Optimisation

We now focus on the XGBoost and Random Forest models because they gave the best F1-scores in the previous section.

Using `GridSearchCV`, we can inspect the parameter space and do a grid search for the best hyperparameters, which are specifications of the model that are provided at the start and do not change as the model is trained. 


### 3.1 XGBoost
XGBoost is an ensemble method that bunches together several weak learning models to create a stronger one, iteratively correcting the previous one. XGBoost refers to Extreme Gradient Boost, which fits successive predictors to the *residual errors* made, and improves each round because This library is aimed to be fast and scalable. 

A good mathematical description of the principles of XGBoost can be found [here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) in XGBoost's documentation.

A good balance between the `learning_rate` and `n_estimators` was key to balance between the overfitting and underfitting of our data. 

In [None]:
from sklearn.model_selection import GridSearchCV

xg_param_grid = [
    {'n_estimators': [50, 70, 100, 500, 1000], 
     'learning_rate':[0.001, 0.01, 0.05, 0.1,  0.5],
     'n_jobs': [4]}
]

xgmodel = XGBClassifier()

grid_search = GridSearchCV(xgmodel, xg_param_grid, cv=10, scoring="f1_macro", return_train_score= True)
grid_search.fit(X_train,y_train)
print(grid_search.best_params_)

cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(mean_score, params)

### 3.2 Random Forest

A Random Forest is another powerful ensemble method that bunches together Decision Trees.

In [None]:
forest_param_grid = [
    {'n_estimators': [200, 700, 1000], 
     'max_features':[0.2, 0.5, None],
    'n_jobs': [4]}
]

forest = RandomForestClassifier()

grid_search = GridSearchCV(forest, forest_param_grid, cv=10, scoring="f1_macro", return_train_score=True)
grid_search.fit(X_train,y_train)
print(grid_search.best_params_)

cvres = grid_search.cv_results_

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(mean_score, params)

As our dataset is a bit on the smaller side, we would like to explore whether all the features are explicitly needed in determining whether a person is predisposed to the fatal heart failure. The next section concerns such exploration.

## 4. Evaluating feature importances

We used L1 regularisation and a random forest to find the metrics which have highest feature importance and interestingly they gave slightly different results. 

### 4.1 Forest Feature evaluation

In [None]:
forest_model = RandomForestClassifier()

forest_model.fit(X_train, y_train)

# evaluate each feature's importance
importances = forest_model.feature_importances_

# sort by descending order of importances
indices = np.argsort(importances)[::-1]

#create sorted dictionary
forest_importances = {}

print("Feature ranking:")
for f in range(X.shape[1]):
    forest_importances[X.columns[indices[f]]] = importances[indices[f]]
    print("%d. %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]]))

### 4.2 L1 feature importance evaluation

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

# Set the regularization parameter C=1
logistic = LogisticRegression(C = .1, penalty="l1", solver='liblinear', random_state=7).fit(X_train, y_train)
model = SelectFromModel(logistic, prefit=True)

X_new = model.transform(X)
X_new

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                 index=X.index,
                                 columns=X.columns)

# Dropped columns have values of all 0s, keep other columns 
reg_selected_columns = selected_features.columns[selected_features.var() != 0]
reg_selected_columns

## 4.3 Reevaluating models having selected features

Apparently the models with the Random Forest evaluated feature importances perform the best-- even better than the model using all the features for both the XGBoost and Random Forest model. 

In [None]:
# define function for evaluating different feature selections

def eval_features(model, param_grid, feature_list): 
    for feature in feature_list: 
        grid_search = GridSearchCV(model, param_grid, cv=10, scoring="f1_macro", return_train_score= True)
        grid_search.fit(X_train[feature],y_train)
        print(grid_search.best_params_)

        cvres = grid_search.cv_results_
        max_score = 0 

        for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
            if mean_score > max_score:
                max_score = mean_score
                best_params = params    
        print(max_score, best_params)

In [None]:
xgmodel = XGBClassifier()

# parameter grid for XGBoost
xg_param_grid = [
    {'n_estimators': [25, 50,  500, 1000], 
     'learning_rate':[0.001, 0.01, 0.05, 0.1,  0.5],
     'n_jobs': [4]}
]

# we want to test this series of features
feature_list = [['serum_creatinine', 'ejection_fraction'],
                ['serum_creatinine', 'ejection_fraction', 'age'], 
                ['serum_creatinine', 'ejection_fraction', 'age', 'creatinine_phosphokinase'],
                ['serum_creatinine', 'ejection_fraction', 'age', 'creatinine_phosphokinase', 'platelets'],
                reg_selected_columns]

eval_features(xgmodel, xg_param_grid, feature_list)    

In [None]:
forest = RandomForestClassifier()

forest_param_grid = [
    {'n_estimators': [200, 700, 1000], 
     'max_features':[0.2, 0.5, None],
    'n_jobs': [4]}
]

# use same feature list
eval_features(forest, forest_param_grid, feature_list)   

It is particularly interesting to see that the use of two features already gives a pretty decent F1 score (about 0.70). The more features we use, the higher the change that overfitting is going to impact our model.

## 5. Final Model

From section 4, it was observed that reducing the number of features indeed gave way to a higher F1-score. I will select the set with three features `['serum_creatinine', 'ejection_fraction', 'age']` for the construction of our final model, as the low number of features will make our model quite resistant to overfitting but it still offers a slightly better score than simply using two features, as the original study suggested.

I am more keen to use the XGBoost Classifier for the final model as it is much quicker but offers basically the same level of accuracy. 

In [None]:
# defining useful functions
from sklearn.metrics import confusion_matrix
def getScore(model, X_train= X_train, y_train= y_train): 
    my_pipeline = Pipeline(steps=[
                                  ('model', model)
                                 ])
    # Preprocessing of training data, fit model 
    my_pipeline.fit(X_train, y_train)

    # Preprocessing of validation data, get predictions
    preds = my_pipeline.predict(X_valid[sel_cols])
    score = my_pipeline.score(X_valid[sel_cols], y_valid)
    return score

def getconfusion(model, y_valid, y_pred, f1= True): 
    cf_matrix = confusion_matrix(y_valid, y_pred)

    f1 = model.score(X_valid[sel_cols], y_valid)
    print('The F1 score is:', f1)
    sns.heatmap(cf_matrix, annot=True)
    plt.xlabel('Predicted values')
    plt.ylabel('True values')


## Final Pipeline

In [None]:
from sklearn.pipeline import Pipeline

# relevant columns for a good prediction
sel_cols = ['serum_creatinine', 'ejection_fraction', 'age']

# model with optimised hyperparameters
model = XGBClassifier(n_estimators=1000, learning_rate= 0.5)

steps=[
       ('model', model)
      ]

# Bundle preprocessing and modeling code in a pipeline
pipe = Pipeline(steps)

# Preprocessing of training data, fit model 
pipe.fit(X_train[sel_cols], y_train)

# Preprocessing of validation data, get predictions
preds = pipe.predict(X_valid[sel_cols])

# Evaluate the model using the confusion matrix
getconfusion(pipe, y_valid, preds)


As we can see, our model is not perfect but its value is undeniable! Notice that for this particular split, the model gives more false negatives than false positives.

# 6. Conclusion

An XGBoost model was used to model whether a person would suffer from fatal heart failure and an F1-score of ~73% was consistently achieved.

Three predictors were more than sufficient in making this model function with a high score, namely the `ejection_fraction`-- the percentage of blood leaving the heart at each contraction, `serum_creatinine`-- the level of creatinine in blood and the `age`. 

The main source of improvement for this project would be to obtain more data. Perhaps, we could also try obtaining a probabilistic value for the likelihood of someone developing a fatal heart failure, which might be more useful for medical professionals to identify marginal cases!