# Model Training

## Testing different models, tuning their parameters and comparing performance.

This is a binary classification problem, so I will consider Logistic Regression, LDA, QDA, and K-NN models. I will also evaluate the use of regularisation terms.

We are building a model to predict heart disease based on a number of indicators, thus **False Negatives are BAD**! We don't want to tell someone they probably don't have heart disease when they do!


In [None]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve

In [None]:
# A utility function to generate performance metrics for each model

def generate_metrics(actual, predicted):
    print(classification_report(actual, predicted))
    
    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (8, 5))

    sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['No Heart Disease', 'Heart Disease'], yticklabels = ['No Heart Disease', 'Heart Disease'])

    plt.ylabel('Actual')

    plt.xlabel('Predicted')
    
    plt.show()


### 1. Logistic Regression

In [None]:
LR_df = pd.read_csv('../data/preprocessed/heart_preprocessed.csv')

X = LR_df.drop(columns=['target'])
y = LR_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(type(X_train), type(y_train))

LR_model = LogisticRegression(max_iter=1000)

LR_model.fit(X_train, y_train)

#### LR Model Analysis
Training Dataset

In [None]:
y_train_predicted = LR_model.predict(X_train)
generate_metrics(y_train, y_train_predicted)

Test Dataset

In [None]:
y_test_predicted = LR_model.predict(X_test)
generate_metrics(y_test, y_test_predicted)

Metrics show 84% recall on test data - pretty good for a first attempt! But let's see if we can do better.

#### Parameter Analysis 

Let's see what parameters have the biggest impact on a heart disease prediction.

In [None]:
LR_model_coeffs = np.exp(LR_model.coef_)

pd.DataFrame(LR_model_coeffs, columns=X_train.columns).T.sort_values(by=0, ascending=False)

<thal_2> is a big predictor of heart disease. I have previously noted that is the feature with the highest VIF! Perhaps it should be removed from the data.

#### Precision-Recall Curve for Log Regression

In [None]:
# compute probability of heart disease yes/no for the training data
LR_y_scores = LR_model.predict_proba(X_train)

print(LR_model.classes_)
print('Heart Disease: \n      No ----- Yes')
print(LR_y_scores)


In [None]:
LR_precisions, LR_recalls, LR_thresholds = precision_recall_curve(
    y_train, LR_y_scores[:,-1]
)

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize = (10, 7))

plt.plot(LR_thresholds, LR_precisions[:-1], 'b--', label = 'precision')

plt.plot(LR_thresholds, LR_recalls[:-1], 'g--', label = 'recall')

plt.xlabel('Threshold')

plt.legend(loc = 'upper left')

plt.ylim([0, 1])

plt.show()

The above figure shows recall drops off quickly after about 0.6 threshold. Because recall is critical, lets lower the threshold to .3 to see if we can increase recall without compromising precision too much.

In [None]:
point_3_threshold = .3

# adjusted decisions using new 0.3 threshold
# maps to a new boolean arr where if P(Heart Disease = true) > .3 -> it maps to true, otherwise false.
y_train_predicted_point_3 = LR_y_scores[:,1] > point_3_threshold
generate_metrics(y_train, y_train_predicted_point_3)

As expected - false negative recall improved, and not a terrible false positive recall either.