# Logistic Regression Classification

In this notebook we focus on reproducing the results of heart failure prediction with **Logistic Regression**. We will use the same model evaluation metrics as in the original paper and compare the results.

In the paper, logistic regression is seemingly used for different purposes. 
We can see in [Table 4](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/4) that a linear model was used to compare it against other Machine Learning methods.<br>
Also, in [Table 10](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/10) the full model including a temporal component (*follow-up time*) was used to determine the feature ranking.</br>
Based on that the authors decided to compare a full model to a resticed model using only *ejection fraction*, *serum creatinine*, and *follow-up time* and  as shown in [Table 11](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/11).  

In this reproduction we will focus on the comparison of the restricted and unrestricted model.

The procedure is as follows:
1. Transform follow-up time to month as a factor
2. Split the data into train and test sets
3. Fit the model on the train set 
4. Calculate evaluation metrics
5. Calculate feature importance
6. Repeat the procedure 100 times and aggregate the results

In [7]:
# Loading libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.inspection import permutation_importance

In [8]:
# Read data from .csv file
data = pd.read_csv('../data/heart_failure_records.csv')

There are four key issues we faced when reproducing the results of the original paper:
1. Use of Linear Regression instead of Logistic Regression for a binary classification model 
2. Inconsistent train/test splitting during model fitting and feature importance evaluation
3. Calculating feature importance without train/test splitting
4. Using the mean for the evaluation metrics but comparing them to the best results out of 100 for an alternative logistic regression

These will be explained in detail in the next sections.

## Model fitting

In [5]:
roc_auc_scores = []
pr_auc_scores = []
accuracy_scores = []
f1_scores = []
tp_scores = []
tn_scores = []
mcc_scores = []

# Instantiate series for accumulating feature importance rankings
lr_mdi_importance = pd.Series(0, index=data.drop(columns=['DEATH_EVENT', 'time']).columns, name='mean_impurity_decrease_ranksum')
lr_perm_importance = pd.Series(0, index=data.drop(columns=['DEATH_EVENT', 'time']).columns, name='mean_accuracy_decrease_ranksum')

# We do not need to use the same seed as researchers, since random number generator implementations
# are different in R and NumPy, so the partitions will always be different.
# We set this only once, since we want different partitions in each run.
np.random.seed(12345)

# Both performance assessment and feature importance are averaged over 100 runs
for i in range(100):
    # Partition data into 80/20 training/test sets
    X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT', 'time']), data['DEATH_EVENT'], test_size=0.2)

    # Instantiate, train, predict
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    # Calculate performance assessment metrics
    roc_auc_scores.append(metrics.roc_auc_score(y_test, y_pred))
    y, x, _ = metrics.precision_recall_curve(y_test, y_pred)
    pr_auc_scores.append(metrics.auc(x, y))
    accuracy_scores.append(metrics.accuracy_score(y_test, y_pred))
    f1_scores.append(metrics.f1_score(y_test, y_pred))
    tp_scores.append(metrics.recall_score(y_test, y_pred))
    tn_scores.append(metrics.recall_score(y_test, y_pred, pos_label=0))
    mcc_scores.append(metrics.matthews_corrcoef(y_test, y_pred))

#     # Partition data into 70/30 training/test - researchers used different split than in performance assessment
#     X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT', 'time']), data['DEATH_EVENT'], test_size=0.3)

#     # Instantiate, train, predict using new partitions
#     lr = LogisticRegression()
#     lr.fit(X_train, y_train)

#     # Calculate and rank feature importances
#     lr_importance_1 = pd.Series(
#         lr.feature_importances_,
#         index=lr.feature_names_in_,
#         name='mean_impurity_decrease'
#     ).sort_values().rank(ascending=False)

#     # Calculate permutation importance on training data, as in the paper
#     result = permutation_importance(lr, X_train, y_train, n_repeats=5)
#     lr_importance_2 = pd.Series(
#         result['importances_mean'],
#         index=lr.feature_names_in_,
#         name='mean_accuracy_decrease'
#     ).sort_values().rank(ascending=False)

#     # Accumulate rankings
#     lr_mdi_importance += lr_importance_1
#     lr_perm_importance += lr_importance_2

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Performance evaluation

In [6]:
evaluations = {
    'MCC score': np.max(mcc_scores),
    'F1 score': np.mean(f1_scores),
    'Accuracy': np.max(accuracy_scores),
    'TP rate': np.mean(tp_scores),
    'TN rate': np.mean(tn_scores),
    'PR AUC': np.mean(pr_auc_scores),
    'ROC AUC': np.max(roc_auc_scores),
}
pd.DataFrame.from_dict(evaluations, orient='index', columns=['Logistic Regression'])

Unnamed: 0,Random Forest
MCC score,0.629754
F1 score,0.443796
Accuracy,0.883333
TP rate,0.350412
TN rate,0.91606
PR AUC,0.62369
ROC AUC,0.765873


## Feature importance assessment


We rank the features by their importance: the higher the decrease in either impurity or accuracy, the lower the rank - i.e. the more important the feature is.

The ranks from individual runs are aggregated using using Borda's method - i.e. sum individual ranks from multiple runs and rank the sums in ascending order. The same method is applied to aggregating the results of two feature importance assessment methods.

In [35]:
importances = pd.DataFrame(lr_mdi_importance).join(lr_perm_importance)

importances['rank_mdi_dec'] = importances['mean_impurity_decrease_ranksum'].rank()
importances['rank_acc_dec'] = importances['mean_accuracy_decrease_ranksum'].rank()
importances['rank_sum'] = importances['rank_mdi_dec'] + importances['rank_acc_dec']
importances['rank_borda'] = importances['rank_sum'].rank(method='min')
importances = importances.drop(columns=['mean_impurity_decrease_ranksum', 'mean_accuracy_decrease_ranksum'])

importances.sort_values('rank_borda')

Unnamed: 0,rank_mdi_dec,rank_acc_dec,rank_sum,rank_borda
ejection_fraction,2.0,1.0,3.0,1.0
serum_creatinine,1.0,2.0,3.0,1.0
age,3.0,3.0,6.0,3.0
platelets,4.0,4.0,8.0,4.0
creatinine_phosphokinase,5.0,5.0,10.0,5.0
serum_sodium,6.0,6.0,12.0,6.0
high_blood_pressure,7.5,7.0,14.5,7.0
anaemia,7.5,8.0,15.5,8.0
diabetes,10.0,9.0,19.0,9.0
sex,9.0,11.0,20.0,10.0


## Summary
