# Random Forest classification

In this notebook we focus on reproducing the results of heart failure prediction with Random Forest. We will use the same model evaluation metrics as in the original paper, as well as same feature importance assessment methods and compare the results.

The procedure is as follows:
1. Split the data into train and test sets
2. Fit the model on the train set 
3. Calculate evaluation metrics
4. Calculate feature importance
5. Repeat the procedure 100 times and aggregate the results

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.inspection import permutation_importance

In [2]:
# Read data from .csv file
data = pd.read_csv('../data/heart_failure_records.csv')

There are four key issues we faced when reproducing the results of the original paper:
1. Lack of information on hyperparameters
2. Inconsistent train/test splitting during model fitting and feature importance evaluation
3. Calculating feature importance on train set
4. Reporting best result out of 100 for some of the evaluation metrics instead of mean

These will be explained in detail in the next sections.

## Hyperparameter tuning
There is no information on hyperparameter tuning of Random Forest classifier. The authors are aware of the concept, since they apply tuning to SVM and Multi-layer Perceptron, but mention Random Forest with other methods that do not require tuning, such as Logistic Regression. This implies that authors used default hyperparameters specified in the `randomForest` R package.

Based on this we can deduce that the hyperparameters are:
- Number of trees: 500
- Fraction of features sampled at each split: $\sqrt n$, where $n$ is the number of features
- Fraction of observation sampled at each split: 1 
- Maximum leaf nodes in each tree: unlimited
- Minimum size of terminal nodes / minimum samples in a leaf: 1

It should be noted that while the default hyperparameter for `scikit-learn` implementation of Random Forest classifier are mostly the same, the most important hyperparameter, the number of trees grown, is different: 100. This can be only deduced from the code made available by the researchers and is not stated in the paper.

In [3]:
roc_auc_scores = []
pr_auc_scores = []
accuracy_scores = []
f1_scores = []
tp_scores = []
tn_scores = []
mcc_scores = []

# Instantiate series for accumulating feature importance rankings
rf_mdi_importance = pd.Series(0, index=data.drop(columns=['DEATH_EVENT', 'time']).columns, name='mean_impurity_decrease_ranksum')
rf_perm_importance = pd.Series(0, index=data.drop(columns=['DEATH_EVENT', 'time']).columns, name='mean_accuracy_decrease_ranksum')

# We do not need to use the same seed as researchers, since random number generator implementations
# are different in R and NumPy, so the partitions will always be different.
# We set this only once, since we want different partitions in each run.
np.random.seed(12345)

# Both performance assessment and feature importance are averaged over 100 runs
for i in range(100):
    # Partition data into 80/20 training/test sets
    X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT', 'time']), data['DEATH_EVENT'], test_size=0.2)

    # Instantiate, train, predict
    rf = RandomForestClassifier(n_estimators=500)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    # Calculate performance assessment metrics
    roc_auc_scores.append(metrics.roc_auc_score(y_test, y_pred))
    y, x, _ = metrics.precision_recall_curve(y_test, y_pred)
    pr_auc_scores.append(metrics.auc(x, y))
    accuracy_scores.append(metrics.accuracy_score(y_test, y_pred))
    f1_scores.append(metrics.f1_score(y_test, y_pred))
    tp_scores.append(metrics.recall_score(y_test, y_pred))
    tn_scores.append(metrics.recall_score(y_test, y_pred, pos_label=0))
    mcc_scores.append(metrics.matthews_corrcoef(y_test, y_pred))

    # Partition data into 70/30 training/test - researchers used different split than in performance assessment
    X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT', 'time']), data['DEATH_EVENT'], test_size=0.3)

    # Instantiate, train, predict using new partitions
    rf = RandomForestClassifier(n_estimators=500)
    rf.fit(X_train, y_train)

    # Calculate and rank feature importances
    rf_importance_1 = pd.Series(
        rf.feature_importances_,
        index=rf.feature_names_in_,
        name='mean_impurity_decrease'
    ).sort_values().rank(ascending=False)

    # Calculate permutation importance on training data, as in the paper
    result = permutation_importance(rf, X_train, y_train, n_repeats=5)
    rf_importance_2 = pd.Series(
        result['importances_mean'],
        index=rf.feature_names_in_,
        name='mean_accuracy_decrease'
    ).sort_values().rank(ascending=False)

    # Accumulate rankings
    rf_mdi_importance += rf_importance_1
    rf_perm_importance += rf_importance_2

## Performance evaluation

In [4]:
evaluations = {
    'MCC score': np.max(mcc_scores),
    'F1 score': np.mean(f1_scores),
    'Accuracy': np.max(accuracy_scores),
    'TP rate': np.mean(tp_scores),
    'TN rate': np.mean(tn_scores),
    'PR AUC': np.mean(pr_auc_scores),
    'ROC AUC': np.max(roc_auc_scores),
}
pd.DataFrame.from_dict(evaluations, orient='index', columns=['Random Forest'])

Unnamed: 0,Random Forest
MCC score,0.629754
F1 score,0.52671
Accuracy,0.85
TP rate,0.46701
TN rate,0.86636
PR AUC,0.635579
ROC AUC,0.7875


The table above reproduces the Random Forest evaluation metrics from Table 4 in the original metrics. It should be noted that as in the original paper, ROC AUC, Accuracy and MCC are based on the best performing fits from the 100 repetitions, while the rest of the metrics are based on the means. It is not clear why the researchers chose to present the results this way, but given the inherent variability due to random train/test splitting, we consider this a mistake. This also makes it very difficult to reproduce the results. The impact is best seen on MCC score, where we achieve 0.63 score (higher is better), whereas the original paper reported 0.384. Similarily we achieved 0.85 accuracy on best fit while the researchers reported 0.74. The metrics using mean values are much more consistent. While still not exactly the same as in the original paper, this is expected due to different train/test splits.

## Feature importance assessment

Researchers calculated all feature importance rankings on training data. We consider this a methodological mistake - with deep trees grown on such a small number of observations (`n=299`), the results are of little value, since they are likely driven by noise in the data, rather than underlying relationships. The goal of the paper is to use feature importance for selecting the top features, to be then used to make predictions on the test set with a more parsimonious model. With that goal in mind, it would be preferable in our view to split the data into training, validation and test set and calculate relevant importances on the validation set (e.g. permutation importance). Also, the authors state that the training set is 70% of all observations, which is incosistent to 80% stated in model performance assessment. 

Due to differences in random number generation, it is impossible to fully reproduce the figures in another language, however the researchers repeated importance calculation on 100 different splits, which should minimize the differences.

We rank the features by their importance: the higher the decrease in either impurity or accuracy, the lower the rank - i.e. the more important the feature is.

The ranks from individual runs are aggregated using using Borda's method - i.e. sum individual ranks from multiple runs and rank the sums in ascending order. The same method is applied to aggregating the results of two feature importance assessment methods.


In [5]:
importances = pd.DataFrame(rf_mdi_importance).join(rf_perm_importance)

importances['rank_mdi_dec'] = importances['mean_impurity_decrease_ranksum'].rank()
importances['rank_acc_dec'] = importances['mean_accuracy_decrease_ranksum'].rank()
importances['rank_sum'] = importances['rank_mdi_dec'] + importances['rank_acc_dec']
importances['rank_borda'] = importances['rank_sum'].rank(method='min')
importances = importances.drop(columns=['mean_impurity_decrease_ranksum', 'mean_accuracy_decrease_ranksum'])

importances.sort_values('rank_borda')

Unnamed: 0,rank_mdi_dec,rank_acc_dec,rank_sum,rank_borda
ejection_fraction,2.0,1.0,3.0,1.0
serum_creatinine,1.0,2.0,3.0,1.0
age,3.0,3.0,6.0,3.0
platelets,4.0,4.0,8.0,4.0
creatinine_phosphokinase,5.0,5.0,10.0,5.0
serum_sodium,6.0,6.0,12.0,6.0
high_blood_pressure,7.5,7.0,14.5,7.0
anaemia,7.5,8.0,15.5,8.0
diabetes,10.0,9.0,19.0,9.0
sex,9.0,11.0,20.0,10.0


The table above reproduces key results presented in Table 8 in the original paper. 

The results are broadly aligned - top three features are the same. However, in the original paper authros found the Serum creatinine feature to be the most important, with both methods ranking it first. In our results, the two methods disagree, hence both features are considered equally important. We suspect this is due to the fact that the authors used training data to calculate feature importance, which is prone to overfitting and more likely to be driven by random noise.

Outside of top three features, the results diverge, e.g. we found Platelets feature to be the 4th most important, while the authors found it to be the 6th most important, we also found Age to be the 2nd least important feature, while authors placed it in the middle of the pack.

Our results confirm that the authors' choice of the top two features for the final model was justified. However, the misalignment between the rest of the ranking casts doubt on robustness of Random Forest and feature importance ranking as part of the scientific method.

## Summary

Overall, our key results align with authors' findings with Random Forest. Due to a number of decisions by the original paper's authors, it is impossible to fully reproduce the results. We believe that the authors should have provided more information on the hyperparameter tuning and should have used more robust methods for feature importance assessment. We also believe that the performance assessment was flawed, since authors decided to report best results instead of means.