# Heart Disease Prediction: get the most out of Logistic Regression
## Heart Disease Data set exploration
### Attribute Information:
* **Age:** Age <br>
* **Sex:** Sex (1 = male; 0 = female) <br>
* **ChestPain:** Chest pain (typical, asymptotic, nonanginal, nontypical) <br>
* **RestBP:** Resting blood pressure <br>
* **Chol:** Serum cholestoral in mg/dl <br>
* **Fbs:** Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) <br>
* **RestECG:** Resting electrocardiographic results <br>
* **MaxHR:** Maximum heart rate achieved <br>
* **ExAng:** Exercise induced angina (1 = yes; 0 = no) <br>
* **Oldpeak:** ST depression induced by exercise relative to rest <br>
* **Slope:** Slope of the peak exercise ST segment <br>
* **Ca:** Number of major vessels colored by flourosopy (0 - 3) <br>
* **Thal:** (3 = normal; 6 = fixed defect; 7 = reversable defect) <br>
* **target:** AHD - Diagnosis of heart disease (1 = yes; 0 = no) <br>

**Source:** https://archive.ics.uci.edu/ml/datasets/Heart+Disease

**Creators:**

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Donor: David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

**Data Set Information:**

This database contains attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. T

The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence from absence (value 0).

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

## Packages Import

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, binarize
from sklearn.metrics import recall_score, roc_auc_score, confusion_matrix
from math import *
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

## Data Import and EDA

Let's import our dataset and have a closer look at it.

In [None]:
df = pd.read_csv('../input/heart-disease/heart.csv')
df.head()

In [None]:
df.describe()

We see that there are no NaN values, no negative values for nonnegative features. The data was collected correctly and no additional cleaning is needed. Let's check if the classes are balanced:

In [None]:
plt.plot()
sns.countplot(x='target', data=df)
plt.title('Target class destribution');

Classes are perfectly balanced. Then investigate a heatmap:

In [None]:
plt.plot()
sns.heatmap(df.corr())
plt.title('Heat Map');

Several features have a good linear correlation with the target: ***slope***, ***cp***, ***exang***, ***oldpeak***, ***ca***. Now we need to draw a pairplot for our features. It could take a while, but give us a better understanding of the data and can provide valuable insights. 

In [None]:
sns.pairplot(df, hue='target', diag_kws={'bw':0.5});

Some of the conclusions from the plot above:
* Consider the destributions of the features. Some of them can differentiate the target rather well: ***thal***, ***slope***, ***exang***, ***restecg***, ***thalach***
* Consider data point. For some pairs the classes can be linearly separated with a good quality.
* The feature ***oldpeak*** plotted against other features gives the best linear separation of the classes.

We will use **Logistic Regression** model. It can provide a sufficient prediction quality for this data, in addition it is easy to train and interprete.

## First try for Logistic Regression: base case

### Pre-processing

Firstly, divide the data into train and test sets:

In [None]:
X = df.drop(['target'], axis=1).to_numpy()
y = df['target'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle=True)


As long as we are going to use Logistics Regression, we need to scale our data:

In [None]:
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### Grid Search

The amount of data is small, thus we can afford an extensive grid search for the best model parameters.

In [None]:
log_reg = LogisticRegression(solver='liblinear')
params = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0, 0.6, 20)
}

In [None]:
scoring_list = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']
fig, axs = plt.subplots(1, len(scoring_list), figsize=(30,5))
for i in range(len(scoring_list)):
    scoring = scoring_list[i]
    grd = GridSearchCV(log_reg, params, scoring=scoring, cv=5)
    grd.fit(X_train, y_train);
    C = [param['C'] for param in grd.cv_results_['params']]
    penalty = [param['penalty'] for param in grd.cv_results_['params']]
    mean_test_score = grd.cv_results_['mean_test_score']
    res = pd.DataFrame({'mean_test_score': mean_test_score, 'C': C, 'penalty': penalty})
    sns.lineplot(x="C", y="mean_test_score", hue="penalty", data=res, ax=axs[i])
    axs[i].set_title('Scoring: ' + scoring)

Conclusions:
* ***l2*** outperform ***l1*** for ***C*** < 0.3
* ***l2*** gives better ***ROC-AUC*** regardless of ***C***
* For this task we must focus on higher ***recall*** score, because in medical diagnostic its more important to reduce False Negative results

Considering these facts, we choose ***l2*** penalty and ***C*** = 0.2. It gives a reasonably good ***ROC-AUC*** and also high ***recall***.

### Logistic Regression training and results

Let's train the model and check the performance on the test data:

In [None]:
log_reg = LogisticRegression(C=0.2, penalty='l2', solver='liblinear')
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

print('Recall: {:.4f}'.format(recall_score(y_test, y_pred)))
print('ROC-AUC: {:.4f}'.format(roc_auc_score(y_test, y_pred)))

Have a look at features importance:

In [None]:
def get_features_importance(df, target, lin_predictor):
    '''
    Input:
        `df` - Pandas DataFrame that was used for model training. It stores features names.
        `target` - target cloumn name.
        `lin_predictor` - linear model that was thained on this data.
    Output:
        `features_importance` - Pandas DataFrame that stores the absolute values of coefficients 
            in linear model which shows features importance.
        barplot of features importance
    '''
    features = list(df.columns.values)
    features.remove(target)
    importance = [abs(coef) for coef in lin_predictor.coef_[0]]
    features_importance = pd.DataFrame({'Feature': features, 'Importance': importance}).sort_values(['Importance'], ascending=False)
    features_importance = features_importance.reset_index().drop('index', axis=1)
    fig, ax = plt.subplots(figsize=(25,5))
    bar = sns.barplot(x='Feature', y='Importance', data=features_importance, ax=ax);
    ax.set_xticklabels(features, rotation=90);
    ax.set_title('Features Importance')
    for index, row in features_importance.iterrows():
        bar.text(index, row['Importance'], round(row['Importance'], 4), color='black', ha="center", fontsize=8)

    return features_importance

In [None]:
features_importance = get_features_importance(df, 'target', log_reg)

## 2nd try for Logistic Regression: adding polynomial features

Now we get some polynomial features. According to the **pair plot** it can help to separate classes. We add interactions of the feature ***oldpeak*** fith the others.

In [None]:
def get_polynomial_features(df, feature, features_interaction):
    df_res = pd.DataFrame()
    for f in features_interaction:
        df_res[feature + '^2'] = df[feature]**2
        df_res[feature + '-' + f] = sqrt(2) * df[feature] * df[f]
        df_res[f + '^2'] = df[f]**2
    return df_res    

In [None]:
all_features = list(df.columns.values)
all_features.remove('target')
df_oldpeak = get_polynomial_features(df, 'oldpeak', all_features)

In [None]:
df_2 = pd.concat([df, df_oldpeak], axis=1)
df_2.head()

In [None]:
X_2 = df_2.drop(['target'], axis=1).to_numpy()
y_2 = df_2['target'].to_numpy()

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.3, random_state=0, shuffle=True)

In [None]:
log_reg_2 = LogisticRegression(C=0.2, penalty='l2', solver='liblinear')
log_reg_2.fit(X_train_2, y_train_2)
y_pred_2 = log_reg_2.predict(X_test_2)

In [None]:
print('Recall: {:.4f}'.format(recall_score(y_test_2, y_pred_2)))
print('ROC-AUC: {:.4f}'.format(roc_auc_score(y_test_2, y_pred_2)))

The result has clearly improved.

In [None]:
features_importance_2 = get_features_importance(df_2, 'target', log_reg_2)

## 3nd try for Logistic Regression: remove unimportant features

However, with so many features we are about to overfit. Let's drop a number of unimportant features and see the model's performance. 

In [None]:
features_to_remove = features_importance_2[features_importance_2['Importance'] < 0.01]['Feature'].values

In [None]:
df_3 = df_2.drop(features_to_remove, axis=1)
df_3.head()

In [None]:
X_3 = df_3.drop(['target'], axis=1).to_numpy()
y_3 = df_3['target'].to_numpy()

X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y_3, test_size=0.3, random_state=0, shuffle=True)

In [None]:
log_reg_3 = LogisticRegression(C=0.2, penalty='l2', solver='liblinear')
log_reg_3.fit(X_train_3, y_train_3)
y_pred_3 = log_reg_3.predict(X_test_3)

In [None]:
print('Recall: {:.4f}'.format(recall_score(y_test_3, y_pred_3)))
print('ROC-AUC: {:.4f}'.format(roc_auc_score(y_test_3, y_pred_3)))

Features selection has slightly improved the results again.

In [None]:
features_importance_3 = get_features_importance(df_3, 'target', log_reg_3)

## Results Evaluation

We need to check if we can tune the model by adjusting a classfication threshold. First of all, have a look at the confusion matrix.

In [None]:
def get_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred,labels = [1,0])
    df = pd.DataFrame({'Actual Positive': cm[0,:], 'Actual Negative': cm[1,:]}, 
                      index=['Predicted\nPositive', 'Predicted\nNegative'])
    sns.heatmap(df, annot=True, cbar=False)
    plt.show()

In [None]:
get_confusion_matrix(y_test_3, y_pred_3)

There are 3 false negative results which is undesirable for medical test. Can we change the threshold to get rid of them?

In [None]:
y_pred_proba = log_reg_3.predict_proba(X_test_3)

In [None]:
thresholds = np.linspace(0, 1, 100)
recall_list = []
roc_auc_list = []
for threshold in thresholds:
    y_pred_shifted = binarize(y_pred_proba, threshold=threshold)[:,1]
    recall_list.append(recall_score(y_test_3, y_pred_shifted))
    roc_auc_list.append(roc_auc_score(y_test_3, y_pred_shifted))

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(thresholds, recall_list, label='Recall')
plt.plot(thresholds, roc_auc_list, label='ROC-AUC')
plt.xticks(np.linspace(0, 1, 25), rotation=90)
plt.legend(title='Metric:', loc=3)
plt.grid()
plt.title('Threshold vs Metrics');

We can see from above that a default threshold of 0.5 provides the best result. Better Recall can be obtained by threshold reduction, but it causes a suffitient ROC-AUC decrease.