# Logistic Regression Classification

In this notebook we focus on reproducing the results of heart failure prediction with **Logistic Regression**. We will use the same model evaluation metrics as in the original paper and compare the results.

## Usage of Logistic Regression
In the paper, logistic regression is seemingly used for different purposes. 

We can see in [Table 4](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/4) that a linear model was used to compare it against other Machine Learning methods. Some might assume that for this binary classification problem the logistic regression was used. From the [code](https://github.com/davidechicco/cardiovascular_heart_disease/blob/master/bin/lin_reg_classification.r) the authors made available, it becomes apparend that in [Table 4](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/4) they used linear regression for a binary classification problem and set a threshold of 0.5 to classify the regression outcome. While this is possible way the logistic regression would have been better suited for this task since it is probabilistic and not continuous.</br>

Also, in [Table 10](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/10) the full model with an added temporal component (*follow-up time*) was used to determine the feature ranking. It concluded that *ejection fraction* and *serum creatinine* are the most important.</br>
To draw this conclusion the authors have not worked with a train/test set but shuffeled the dataset 100 times and used and average ranking. </br>

Based on that the authors decided to compare a full model to a resticed model using only *ejection fraction*, *serum creatinine*, and *follow-up time* as shown in [Table 11](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/11) and evaluate their performaces using the already introduced performance measures. </br>

In this reproduction we will focus on the comparison of the restricted and unrestricted model.

The procedure is as follows:
1. Transform time as a factor as proposed by the authors
2. Split the data into train and test sets
3. Fit the models on the train set 
4. Calculate evaluation metrics
5. Repeat the procedure 100 times and aggregate the results

In [1]:
# Loading libraries
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.inspection import permutation_importance

In [2]:
# Read data from .csv file
data = pd.read_csv('../data/heart_failure_records.csv')

In [3]:
data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [4]:
# Transform time as a factor and dummy code it
data['time'] = data['time'].astype('string')
data = pd.get_dummies(data,drop_first=True)
time_col = [col for col in data if col.startswith('time_')]

In [5]:
data

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,...,time_85,time_86,time_87,time_88,time_90,time_91,time_94,time_95,time_96,time_97
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,...,0,0,0,0,0,0,0,0,0,0
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,...,0,0,0,0,0,0,0,0,0,0
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,...,0,0,0,0,0,0,0,0,0,0
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,...,0,0,0,0,0,0,0,0,0,0
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,...,0,0,0,0,0,0,0,0,0,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,...,0,0,0,0,0,0,0,0,0,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,...,0,0,0,0,0,0,0,0,0,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,...,0,0,0,0,0,0,0,0,0,0


There are four key issues we faced when reproducing the results of the original paper:
1. To evaluate the restriced model, an unrestriced model was run first and only if this model had the two choosen features as most important onces they evaluated the restricted model
2. According to the paper the restriced model uses the top results for each score but according to the code the mean results 

These will be explained in detail in the next sections.

## Model fitting

In [7]:
results = pd.DataFrame(columns = [
    'Model',
    'MCC score',
    'F1 score',
    'Accuracy',
    'TP rate',
    'TN rate',
    'PR AUC',
    'ROC AUC'])

In [9]:
def model_fitting(data, mode):
    roc_auc_scores = []
    pr_auc_scores = []
    accuracy_scores = []
    f1_scores = []
    tp_scores = []
    tn_scores = []
    mcc_scores = []
    
    for i in range(100):
        # Partition data into 80/20 training/test sets
        X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT']), data['DEATH_EVENT'], test_size=0.2)

        # Instantiate, train, predict
        lr = LogisticRegression(max_iter=10000)
        lr.fit(X_train, y_train)
        y_pred = lr.predict(X_test)

        # Calculate performance assessment metrics
        roc_auc_scores.append(metrics.roc_auc_score(y_test, y_pred))
        y, x, _ = metrics.precision_recall_curve(y_test, y_pred)
        pr_auc_scores.append(metrics.auc(x, y))
        accuracy_scores.append(metrics.accuracy_score(y_test, y_pred))
        f1_scores.append(metrics.f1_score(y_test, y_pred))
        tp_scores.append(metrics.recall_score(y_test, y_pred))
        tn_scores.append(metrics.recall_score(y_test, y_pred, pos_label=0))
        mcc_scores.append(metrics.matthews_corrcoef(y_test, y_pred))
        
    evaluations = {
    'Model':mode,
    'MCC score': np.mean(mcc_scores),
    'F1 score': np.mean(f1_scores),
    'Accuracy': np.mean(accuracy_scores),
    'TP rate': np.mean(tp_scores),
    'TN rate': np.mean(tn_scores),
    'PR AUC': np.mean(pr_auc_scores),
    'ROC AUC': np.mean(roc_auc_scores),
    }
    return evaluations

    # results.append(evaluations, ignore_index=True)
    # return results

In [10]:
results = results.append(model_fitting(data,'Unrestricted Model'),ignore_index=True)

In [11]:
results = results.append(model_fitting(data[['ejection_fraction','serum_creatinine', 'DEATH_EVENT']+time_col],'Restricted Model'),ignore_index=True)

In [12]:
results

Unnamed: 0,Model,MCC score,F1 score,Accuracy,TP rate,TN rate,PR AUC,ROC AUC
0,Unrestricted Model,0.326399,0.437624,0.731,0.340221,0.92194,0.624319,0.631081
1,Restricted Model,0.425509,0.527581,0.769167,0.422291,0.931022,0.673489,0.676656


In [7]:
roc_auc_scores = []
pr_auc_scores = []
accuracy_scores = []
f1_scores = []
tp_scores = []
tn_scores = []
mcc_scores = []


# We do not need to use the same seed as researchers, since random number generator implementations
# are different in R and NumPy, so the partitions will always be different.
# We set this only once, since we want different partitions in each run.
np.random.seed(12345)

# Both performance assessment and feature importance are averaged over 100 runs
for i in range(100):
    # Partition data into 80/20 training/test sets
    X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT']), data['DEATH_EVENT'], test_size=0.2)

    # Instantiate, train, predict
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    # Calculate performance assessment metrics
    roc_auc_scores.append(metrics.roc_auc_score(y_test, y_pred))
    y, x, _ = metrics.precision_recall_curve(y_test, y_pred)
    pr_auc_scores.append(metrics.auc(x, y))
    accuracy_scores.append(metrics.accuracy_score(y_test, y_pred))
    f1_scores.append(metrics.f1_score(y_test, y_pred))
    tp_scores.append(metrics.recall_score(y_test, y_pred))
    tn_scores.append(metrics.recall_score(y_test, y_pred, pos_label=0))
    mcc_scores.append(metrics.matthews_corrcoef(y_test, y_pred))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

## Performance evaluation

In [8]:
evaluations = {
    'MCC score': np.max(mcc_scores),
    'F1 score': np.mean(f1_scores),
    'Accuracy': np.max(accuracy_scores),
    'TP rate': np.mean(tp_scores),
    'TN rate': np.mean(tn_scores),
    'PR AUC': np.mean(pr_auc_scores),
    'ROC AUC': np.max(roc_auc_scores),
}
pd.DataFrame.from_dict(evaluations, orient='index', columns=['Logistic Regression'])

Unnamed: 0,Logistic Regression
MCC score,0.629754
F1 score,0.448444
Accuracy,0.883333
TP rate,0.353766
TN rate,0.913327
PR AUC,0.618931
ROC AUC,0.765873


## Feature importance assessment


We rank the features by their importance: the higher the decrease in either impurity or accuracy, the lower the rank - i.e. the more important the feature is.

The ranks from individual runs are aggregated using using Borda's method - i.e. sum individual ranks from multiple runs and rank the sums in ascending order. The same method is applied to aggregating the results of two feature importance assessment methods.

In [35]:
importances = pd.DataFrame(lr_mdi_importance).join(lr_perm_importance)

importances['rank_mdi_dec'] = importances['mean_impurity_decrease_ranksum'].rank()
importances['rank_acc_dec'] = importances['mean_accuracy_decrease_ranksum'].rank()
importances['rank_sum'] = importances['rank_mdi_dec'] + importances['rank_acc_dec']
importances['rank_borda'] = importances['rank_sum'].rank(method='min')
importances = importances.drop(columns=['mean_impurity_decrease_ranksum', 'mean_accuracy_decrease_ranksum'])

importances.sort_values('rank_borda')

Unnamed: 0,rank_mdi_dec,rank_acc_dec,rank_sum,rank_borda
ejection_fraction,2.0,1.0,3.0,1.0
serum_creatinine,1.0,2.0,3.0,1.0
age,3.0,3.0,6.0,3.0
platelets,4.0,4.0,8.0,4.0
creatinine_phosphokinase,5.0,5.0,10.0,5.0
serum_sodium,6.0,6.0,12.0,6.0
high_blood_pressure,7.5,7.0,14.5,7.0
anaemia,7.5,8.0,15.5,8.0
diabetes,10.0,9.0,19.0,9.0
sex,9.0,11.0,20.0,10.0


## Summary
