# Logistic Regression Classification

In this notebook we focus on reproducing the results of heart failure prediction with **Logistic Regression**. We will use the same model evaluation metrics as in the original paper and compare the results.

## Usage of Logistic Regression
In the paper, logistic regression is seemingly used for different purposes. 

We can see in [Table 4](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/4) that a linear model was used to compare it against other Machine Learning methods. Some might assume that for this binary classification problem the logistic regression was used. From the [code](https://github.com/davidechicco/cardiovascular_heart_disease/blob/master/bin/lin_reg_classification.r) the authors made available, it becomes apparend that in [Table 4](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/4) they used linear regression for a binary classification problem and set a threshold of 0.5 to classify the regression outcome. While this is possible way the logistic regression would have been better suited for this task since it is probabilistic and not continuous.</br>

Also, in [Table 10](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/10) the full model with an added temporal component (*follow-up time*) was used to determine the feature ranking. It concluded that *ejection fraction* and *serum creatinine* are the most important.</br>
To draw this conclusion the authors have not worked with a train/test set but shuffeled the dataset 100 times and used and average ranking. </br>

Based on that the authors decided to compare a full model to a resticed model using only *ejection fraction*, *serum creatinine*, and *follow-up time* as shown in [Table 11](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/11) and evaluate their performaces using the already introduced performance measures. </br>

In this reproduction we will focus on the comparison of the restricted and unrestricted model.

The procedure is as follows:
1. Transform time as a factor as proposed by the authors
2. Split the data into train and test sets
3. Fit the models on the train set 
4. Calculate evaluation metrics
5. Repeat the procedure 100 times and aggregate the results

In [1]:
# Loading libraries
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.inspection import permutation_importance

In [2]:
# Read data from .csv file
data = pd.read_csv('../data/heart_failure_records.csv')

In [3]:
data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [4]:
# Transform time as a factor and dummy code it
data['time'] = data['time'].astype('string')
data = pd.get_dummies(data,drop_first=True)
time_col = [col for col in data if col.startswith('time_')]

In [5]:
data

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,...,time_85,time_86,time_87,time_88,time_90,time_91,time_94,time_95,time_96,time_97
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,...,0,0,0,0,0,0,0,0,0,0
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,...,0,0,0,0,0,0,0,0,0,0
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,...,0,0,0,0,0,0,0,0,0,0
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,...,0,0,0,0,0,0,0,0,0,0
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,...,0,0,0,0,0,0,0,0,0,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,...,0,0,0,0,0,0,0,0,0,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,...,0,0,0,0,0,0,0,0,0,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,...,0,0,0,0,0,0,0,0,0,0


There are three key issues we faced when reproducing the results of the original paper:

1. R's glm-function and sklearn's logistic regression use different solvers and R does some preprocessing under the hood.
2. To evaluate the restriced model, an unrestriced model was run first. Only if this model had the two choosen features as most important onces they evaluated the restricted model. It becomes not transparent how many runs actually where considered in the restricted model.
3. According to the paper the restriced model uses the top results for each score but according to the code the mean results were used.

These will be explained in detail in the next sections.

## Model fitting

To remedy some of the found issues some modifications had to be done to the code. 
As mentioned as issue number 1. the logistic regession uses different regularization methods and other cost-functions in R and in sklearn (read more [here](https://github.com/scikit-learn/scikit-learn/issues/6595)). Therefore, some further preprocessing needs to be done, like scaling the input data and using another solver, to reproduce the results as close as possible.

As mentioned in issue 2, the code revealed that the restricted model was only run if the unrestricted model yielded the the choosen features as most important it becomes intransparent on how many runs the results are actually based upon. Since the paper did not indicate this and we should recieve a more stable result, we reproduce this by basing the restricted model actually on 100 runs.

In [6]:
results = pd.DataFrame(columns = [
    'Model',
    'MCC score',
    'F1 score',
    'Accuracy',
    'TP rate',
    'TN rate',
    'PR AUC',
    'ROC AUC'])

In [15]:
def model_fitting(data, mode):
    roc_auc_scores = []
    pr_auc_scores = []
    accuracy_scores = []
    f1_scores = []
    tp_scores = []
    tn_scores = []
    mcc_scores = []
    
    for i in range(100):
        # Partition data into 80/20 training/test sets
        X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['DEATH_EVENT']), data['DEATH_EVENT'], test_size=0.2)

        # Scale values
        scaler = StandardScaler()
        scaler.fit(X_train)
        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)
        
        # Instantiate, train, predict
        lr = LogisticRegression(max_iter=10000, solver='newton-cg', fit_intercept=True)
        lr.fit(X_train, y_train)
        y_pred = lr.predict(X_test)

        # Calculate performance assessment metrics
        roc_auc_scores.append(metrics.roc_auc_score(y_test, y_pred))
        y, x, _ = metrics.precision_recall_curve(y_test, y_pred)
        pr_auc_scores.append(metrics.auc(x, y))
        accuracy_scores.append(metrics.accuracy_score(y_test, y_pred))
        f1_scores.append(metrics.f1_score(y_test, y_pred))
        tp_scores.append(metrics.recall_score(y_test, y_pred))
        tn_scores.append(metrics.recall_score(y_test, y_pred, pos_label=0))
        mcc_scores.append(metrics.matthews_corrcoef(y_test, y_pred))
        
    evaluations = {
    'Model':mode,
    'MCC score': np.mean(mcc_scores),
    'F1 score': np.mean(f1_scores),
    'Accuracy': np.mean(accuracy_scores),
    'TP rate': np.mean(tp_scores),
    'TN rate': np.mean(tn_scores),
    'PR AUC': np.mean(pr_auc_scores),
    'ROC AUC': np.mean(roc_auc_scores),
    }
    return evaluations

    # results.append(evaluations, ignore_index=True)
    # return results

In [16]:
# Unrestriced model
results = results.append(model_fitting(data,'Unrestricted Model'),ignore_index=True)

# Restricted model
results = results.append(model_fitting(data[['ejection_fraction','serum_creatinine', 'DEATH_EVENT']+time_col],'Restricted Model'),ignore_index=True)

## Performance evaluation

Coming back to the 3rd identified issue. According to the paper, the authors used the top results for each score for the restricted model. This skews the comparibility of the two models. Nevertheless, in the code it becomes apparent that also for the restricted model the mean was used. Therefore, we will reproduce the results based on the mean results for both models.

In [17]:
results

Unnamed: 0,Model,MCC score,F1 score,Accuracy,TP rate,TN rate,PR AUC,ROC AUC
0,Unrestricted Model,0.458535,0.616076,0.7655,0.586987,0.855422,0.696579,0.721204
1,Restricted Model,0.495961,0.652434,0.766667,0.704014,0.800015,0.718701,0.752015
2,Unrestricted Model,0.443128,0.605842,0.7595,0.581675,0.848596,0.685476,0.715135
3,Restricted Model,0.473181,0.627728,0.757833,0.664513,0.811377,0.708274,0.737945
4,Unrestricted Model,0.461903,0.621638,0.767333,0.601354,0.850261,0.696609,0.725808
5,Restricted Model,0.466696,0.622471,0.755,0.675513,0.797346,0.698947,0.736429


## Summary
