# Heart Failure Prediction with and without time - XGB model (95%) and interpretability of results

## Importing libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='svg'
%matplotlib inline 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split#, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score

## Loading and inspecting data

In [None]:
clinical_data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
clinical_data.head()

In [None]:
clinical_data.info()

The data has 299 samples and 13 column. As can be seen there is no null value in the data and all the columns has numerical values, from which `anaemia`, `diabetes`, `high_blood_pressure`, `sex`, `smoking` and `DEATH_EVENT` has binary values. Below some statistics measures for the non-binary columns are given.

In [None]:
clinical_data[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']].describe()

The objective of the task is to predict the target column `DEATH_EVENT`, which indicates death due to a heart failure, based on the information provided by the other 12 columns. Pearson correlation indicator can be a good start for this purpose.  Below a graph show us the result of this indicator when applied between the target column and all the other variables. As we can see `time` has a strong negative correlation with the target column. Followed by it `serum_creatinine`, `ejection_fraction`, `age` and `serum_sodium` also have considerable correlation with the target, when compared with the other columns.

In [None]:
plt.figure()
corr_death = clinical_data.corr('pearson')['DEATH_EVENT'].drop('DEATH_EVENT')
sorted_idx_corr_death = corr_death.argsort()
plt.barh(clinical_data.columns[sorted_idx_corr_death], corr_death[sorted_idx_corr_death])
plt.title('Pearson correlation with DEATH_EVENT')
plt.xlabel('corr_value')
plt.ylabel('variable')
plt.show()

Next we will see if these correlations are identified and/or useful for the machine learning algorithm.

## Machine Learning Process

First we separe the target and the feature columns and then split all the data as train and test sets. The test size was setted as 20% of the original data.

In [None]:
X = clinical_data.drop(['DEATH_EVENT'], axis = 1)
y = clinical_data['DEATH_EVENT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

print(X_train.shape)
print(X_test.shape)

Then we use a GradientBoostClassifier to train with the train set. The hyperparameters of this classifier was found using a GridSearch (the parameter grid used is shown at the end of this document).

In [None]:
xgb = GradientBoostingClassifier(max_depth=2, min_samples_split=0.5, n_estimators=50,random_state=1)
xgb.fit(X_train, y_train)

The accuracy and the confusion matrix of this classifier are showed below.

In [None]:
y_pred = xgb.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_pred, y_test)

def create_confusion_graph(cm, title='Confusion matrix'):
    fig, ax = plt.subplots()
    im = ax.imshow(cm, cmap='Blues')
    
    ax.set_xticks([0,1])
    ax.set_yticks([0,1])
    ax.set_xticklabels(['True','False'])
    ax.set_yticklabels(['True','False'])

    plt.xlabel('Predicted')
    plt.ylabel('Real')

    for i in range(len(cm)):
        for j in range(len(cm[0])):
            text = ax.text(j, i, cm[i, j],
                           ha="center", va="center", color="black")

    plt.title(title)
    
    return fig

create_confusion_graph(cm, 'Confusion matrix for the first gradient boost classifier')
plt.show()

### Interpretability of the first model

Next we will see a analysis of the results of the first model. Gradient boost models in `scikit_learn` has an attribute called `feature_importance`, that tell us "how much that feature reduced the criterion of a split", (this is known as the Gini importance). The graph below show us this measure.

In [None]:
sorted_idx = xgb.feature_importances_.argsort()

plt.barh(y=X.columns[sorted_idx], width=xgb.feature_importances_[sorted_idx])

plt.title('Gini importance')
plt.xlabel('Gini importance')
plt.ylabel('feature')
plt.show()

As can been seen, not coincidentally, the variables with the greater Pearson correlation are the more importants, according with de Gini measure. To confirm this, a permutation importance inspection was done, and the results, which confirms the Gini measure, are seen below.

In [None]:
from sklearn.inspection import permutation_importance
result = permutation_importance(xgb, X, y, scoring='accuracy', random_state=1)
sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.set_size_inches((7,7))
plt.show()

With the results confirmed, we can say that, for this classifier, the most important feature was `time` with a huge advantage if compared with the others features. Both `ejection_faction` and `serum_creatinine` also seems considerable important to this model. Using these three variables a partial dependece plot, which show us the average effect of the features on the target variable, is shown below. From him we can see clearly that the model learn what the Pearson correlation measure told us in the past.

In [None]:
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(xgb, X, features=['time','ejection_fraction','serum_creatinine'], n_cols=3, response_method='predict_proba', method='brute')
plt.title('Partial dependece plot for the first classifier')
plt.show()

From this graph is also shown: 
* Patients with follow-up period (`time`) less than 50, `ejection_fraction < 20` and/or `serum_creatinine > 7.5` has a higher probability of dying.
* There is an increase in the curve of the `time` variable between the values of 150 and 170. This happens because all of the patients who had a quantity of days of follow-up period at this interval died. This is shown below.

In [None]:
clinical_data[(clinical_data['time'] >= 150) & (clinical_data['time'] <= 170)]['DEATH_EVENT'].value_counts()

* The only 3 samples that had a wrong prediction in the test set are showed below. All of them have a `DEATH_EVENT` value of 1 and get a predicition value of 0. We can see below that they did not match with what was shown in the partial dependent plot and, because of that, the model could not predict correctly.

In [None]:
X_test[xgb.predict(X_test) != y_test][['time', 'ejection_fraction', 'serum_creatinine']]

## Is that all?
The model created in the last section did a pretty good job (95% accuracy). But, we saw that his predictions was based mostly on the `time` feature. Although this is not necessarily bad, the `time` feature, which tell a patient's follow-up period (in days), is not a biological feature. If we remove the `time` feature, would the model work well? And would it make its predictions based on which variables?

## A new gradient model without `time`

First we remove the `time` feature from the previous X data. Then we split the new set.

In [None]:
X_no_time = X.drop(['time'], axis = 1)
X_train_no_time, X_test_no_time, y_train, y_test = train_test_split(X_no_time, y, test_size = 0.2, random_state = 2)
print(X_train_no_time.shape)
print(y_train.shape)

Then we create a new model. The hyperparameters of this model was found with the second grid search from the final of this document.

In [None]:
xgb_no_time = GradientBoostingClassifier(learning_rate=0.01, max_depth=2,
                           min_samples_leaf=0.1, min_samples_split=0.5,
                           random_state=1)
xgb_no_time.fit(X_train_no_time, y_train)

With the best hyperparameters found, the maximum accuracy found was 80%. As expected, without `time`, the model has more difficult learning.

In [None]:
y_pred = xgb_no_time.predict(X_test_no_time)
print('Accuracy: ', accuracy_score(y_test, y_pred))
create_confusion_graph(confusion_matrix(y_pred, y_test))
plt.show()

To answer the second question made before, we going to make the same procedure of analysing the results. Below a Gini importance graph is shown again. Without `time`, the more importants features, according with the Gini measure are `serum_creatinine` and `ejection_fraction`, which were part of the first three previously. Interestingly, now the model seems to consider only three variables to make its predictions, the two already mentioned and `age`.

In [None]:
sorted_idx_no_time = xgb_no_time.feature_importances_.argsort()

plt.barh(y=X.columns[sorted_idx_no_time], width=xgb_no_time.feature_importances_[sorted_idx_no_time])
plt.title('Gini importance')
plt.xlabel('Gini importance')
plt.ylabel('feature')
plt.show()

The permutation importance measurement below confirms what was shown with the Gini importance measurement.

In [None]:
result = permutation_importance(xgb_no_time, X_no_time, y, scoring='accuracy', random_state=1)
sorted_idx_no_time = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx_no_time].T,
           vert=False, labels=X_test_no_time.columns[sorted_idx_no_time])
ax.set_title("Permutation Importances (test set)")
fig.set_size_inches((7,7))
plt.show()

The patial dependence plot also is shown below with these three variables. The results of `ejection_fraction` and `serum_creatinine` look like those obtained earlier, but now much more accentuated. With respect to `age`, the partial dependence plot shows that an age greater than 70 increases the probability of death, for this model.

In [None]:
plot_partial_dependence(xgb_no_time, X_no_time, ['ejection_fraction','age', 'serum_creatinine'], grid_resolution=500, response_method='predict_proba', method='brute')
plt.title('Partial dependence plot for the second classifier')
plt.show()

## Conclusion

With all that has been presented, we can conclude that, for these samples, the factors of follow-up period, level of serum creatinine in the blood, and ejection fraction (percentage of blood leaving the heart at each contraction) were the most important to predict a death event due to a heart failure.

## Grid search parameters

In [None]:
'''param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
    'max_depth': [2,3,5],
    'min_samples_split': [0.01, 0.1, 0.5],    
}

grid_search = GridSearchCV(GradientBoostingClassifier(random_state=1), cv=5, param_grid=param_grid, scoring='roc_auc', verbose = 1, n_jobs = 2)
grid_search.fit(X_train, y_train)

print('Best params found: \n\t', grid_search.best_params_)
print('Best ROC_AUC score found: \n\t', grid_search.best_score_)
print('Best estimator: \n\t', grid_search.best_estimator_)
'''

print('grid search - first model')

In [None]:
'''param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
    'max_depth': [2,3,5,10],
    'min_samples_split': [0.0, 0.1, 0.5],
    'min_samples_leaf': [0.0, 0.1, 0.5],
    'min_impurity_decrease': [0.0, 0.1,0.5]
}

grid_search = GridSearchCV(GradientBoostingClassifier(random_state=1), cv=5, param_grid=param_grid, scoring='roc_auc', verbose = 1, n_jobs = -1)
grid_search.fit(X_train, y_train)

print('Best params found: \n\t', grid_search.best_params_)
print('Best ROC_AUC score found: \n\t', grid_search.best_score_)
print('Best estimator: \n\t', grid_search.best_estimator_)'''

print('grid search - second model')