# Heart Disease Data Set

by Theodor Lanzer

Task: Find out if Heart Disease is present at the current patient. 

Classification Problem

#1. Problem Definition and description of data

The selected dataset consists of physical attributes of a human. Some features are measured values, others are subjectively determined by the patient. The features will be discussed in more detail in the following. The important target variable is whether a heart disease is present in the patient or not. 

This is a binary classification problem. A heart disease is either present or not. 

The dataset was published by the Medical Center, Long Beach and Cleveland Clinic Foundation. 

Now lets look at the attributes of this data set.

Attribute Information:

The Problem has the following Inputs:

1.   **age**: in years
2.   **sex**:

  *   Value 1: male
  *   Value 0: female


3.   **cp**: chest pain 

  *   Value 1: typical angina
  *   Value 2: atypical angina
  *   Value 3: non-anginal pain
  *   Value 4: asymptomatic 


4.   **trestbps**: resting blood pressure (in mm HG)

5.   **chol**: serum cholestoral in mg/dl
6.   **fbs**: fasting blood sugar > 120 mg/ml

   *    Value 1: true
   *    Value 0: false



7.   **restecg**:  resting electrocardiographic results

  *   Value 0: normal
  *   Value 1: having ST-T wave abnormality
  *   Value 2: showing probale or  definite left ventricular hypertrophy


8.   **thalach**: maximum heart rate achieved



9.   **exang**: exercise induced angina

 *   Value 1: yes
  *   Value 0: no


10.   **oldpeak**: ST depression induced by exercise relative to rest


11.   **slope**: the slope of the peak exercise ST segment
 
  *   Value 1: upsloping
  *   Value 2: flat
  *   Value 3: downsloping


 
12.   **ca**: number of major vessels (Values from 0-3)


13.   **thal**: A blood disorder called 'Thalassemia':

  *   Value 3: normal
  *   Value 6: fixed detected
  * Value 7: reversable detected

Output:



1.  **Heartdisease**:

  *   Value 1: present
  *   Value 0: not present


To start with the topic, I researched which factors promote heart disease.
I found out that risk factors for developing heart disease are the following: high cholesterol, high blood pressure, diabetes, weight, family history and smoking. I would like to find out if these hypotheses  can be confirmed with the present data set. (1).

To measure my success, I decided to use accuracy. This gives me a value of how many percent were correctly classified. Additionally I look at the Area under curve and Precision and Recall.

Source:
(1) https://www.nhs.uk/conditions/cardiovascular-disease/















#2. Preparing the enviroment

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)
import matplotlib as mpl
import plotly.express as px
import matplotlib.cm as cm
import seaborn as sns
sns.set_theme()
import os

In [None]:
# Data Preparation
from sklearn import preprocessing as pp
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error



In [None]:
# ML Algorithms to be used
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as LGBMClassifier
from sklearn.svm import SVR
from sklearn.metrics import log_loss
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.model_selection import learning_curve
from sklearn.neighbors import KNeighborsClassifier

import tensorflow as tf
from tensorflow import keras
from keras import optimizers, models, layers, regularizers
tf.__version__


# 3.1 Import Data from Kaggle

Before I start the research, I want to get an overview of the data. And whether there are any discrepancies. As a first look I take a look at the correlation matrix. Looking at the feature Chest Pain (cp) it has values from 0 to 3, but at the description it takes values from 1 to 4.

In [None]:

data = pd.read_csv('../input/heart-disease-uci/heart.csv')
data.head()

## 3.1.2 Is Kaggle Data Set right? A look at the correlation matrix


In [None]:
correlationMatrix = data.corr() 

f = plt.figure(figsize=(15, 8))
plt.matshow(correlationMatrix, fignum=f.number, cmap='viridis')
plt.xticks(range(data.shape[1]), data.columns, fontsize=14, rotation=75)
plt.yticks(range(data.shape[1]), data.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
correlationMatrix.style.background_gradient(cmap='viridis').set_precision(2)
plt.show()

In [None]:
correlationMatrix.style.background_gradient(cmap='viridis').set_precision(2)

After looking at the correlation matrix, I noticed that the feature exang is negatively correlated with target (heartdisease). Which means that a angina induced by exercise would reduce the risk of heart disease. Angina is a type of chest pain caused by reduced blood flow to the heart(1). This makes no sense! In addition, younger people are more likely to be affected by heart diseases. The feature ca has different values than discribed. Something seems to be wrong. So I did some more research and found out that the target values are reversed. Now I have decided to import the data set from the original source and prepare it myself.


Sorce:

(1) https://www.mayoclinic.org/diseases-conditions/angina/symptoms-causes/syc-20369373#:~:text=But%20when%20you%20increase%20the,arteries%20slow%20down%20blood%20flow.

#3.2 Import Data from original source 
The Kaggle Dataset has a few inconsistencies. Therefore I import the dataset from the original website (https://archive.ics.uci.edu/ml/datasets/Heart+Disease) and prepare it myself.

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
new_names = ['age','sex','cp','trestbps','chol','fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'heartdisease']
dataframe = pd.read_csv(url, names=new_names)
dataframe.head()

#4. Preprocessing Data 

At first I check for missing values and datatypes.

In [None]:
dataframe.info()

The features **ca** and **thal** displayed as object-type, which is wrong. They should be numeric as the other features. After an investigation I found out that there are some lines with **?**. I decicded to remove those lines. 

Remove Questionmarks from **ca**
1. Find those indices with Questionmark and remove whole lines
2. Change datatype to float

In [None]:
index_invalid_ca = dataframe[dataframe.ca == '?'].index
dataframe.drop(index_invalid_ca, inplace = True)
dataframe.ca = pd.to_numeric(dataframe.ca, downcast = 'float')

Remove Questionmarks from **thal**

In [None]:
index_invalid_thal = dataframe[dataframe.thal == '?'].index
dataframe.drop(index_invalid_thal, inplace = True)
dataframe.thal = pd.to_numeric(dataframe.thal, downcast = 'float')

In the original dataset, the target takes integer values between 0 and 4. Here, 0 means no heart disease and 1 to 4 means the severity of the heart disease. Since we only want to find out whether a disease is present or not, the values greater than 0 are combined to 1. This leads to a binary classification problem.


In [None]:
dataframe.heartdisease = dataframe.heartdisease.where(dataframe.heartdisease < 1, 1)

#update index
dataframe = dataframe.reset_index()

## Exploring Data and Visualization

First, I look at the histograms to see how the data is distributed. There are more men than women. 160 participants have no heart disease and 137 have one. This is a balanced problem.

In [None]:
dataframe = dataframe.copy().drop(['index'], axis = 1)

In [None]:
dataframe.describe()

In [None]:
dataframe.hist(bins = 15,figsize= (20,20))
plt.show()

For clarity in the next plots I replace here the values of the attributes sex and heartdisease. Afterwards I change the values back again.

In [None]:
dataframe['sex'].replace({1:'Male',0:'Female'},inplace = True)
dataframe['heartdisease'].replace({1:'Heart_attack - Yes',0:'Heart_attack - No'},inplace = True)

From the plot below, the age range of women is greater for no heart disease than for men. The range of age for heart disease is larger for men.

In [None]:
sns.catplot(x ='age', y ='heartdisease', col = 'sex', data = dataframe, color = 'crimson', kind = 'box')

The table shows the number of heart disease cases by gender and age. This shows that most heart disease is present in men between the ages of 57 and 59. 

In [None]:
s= dataframe.groupby(['sex','age'])['heartdisease'].count().reset_index().sort_values(by='heartdisease',ascending=False)
s.head(10).style.background_gradient(cmap='Purples')

The next plot shows the patient's age on the X-axis and the chest pain attribute on the Y-axis. In addition, the plot shows whether a heart disease is present or not. It can be seen that with a value of 4 ( asymptotic) most of the heart diseases are present. 

In [None]:
p1 = sns.scatterplot(data = dataframe, x = 'age', y = 'cp', hue = "heartdisease", s = 200)
p1.set(xlabel='Age [Years]', ylabel='Chest Pain')

The next plot shows the relationship between thalassemia and age. Having reversable detected thalassemia seems as a pretty strong indicator for heart disease. (Value of 7).

In [None]:
p3 = sns.scatterplot(data = dataframe, x = 'age', y = 'thal', hue = "heartdisease", s = 200)
p3.set(xlabel='Age [Years]', ylabel='thalassemia')

The next plot shows the relationship between resting blood pressure and age. In this plot, no relationship is apparent with respect to heart disease.

In [None]:
p2 = sns.scatterplot(data = dataframe, x = 'age', y = 'trestbps', hue = "heartdisease", s = 200)
p2.set(xlabel='Age [Years]', ylabel='resting blood pressure')

In [None]:
dataframe['sex'].replace({'Male':1,'Female':0},inplace = True)
dataframe['heartdisease'].replace({'Heart_attack - Yes':1,'Heart_attack - No':0},inplace = True)

Correlation Matrix:

From the correlation matrix, age, chest pain, exang, slope, oldpeak, ca, thal correlate positively with heartdisease. Thalach, restecg are negatively correlated. 

Resting blood pressure and cholesterol correlate poorly with heart disease. In my hypthosis from the beginning, however, I assumed this.

The attribute fbs does not correlate with heart disease at all. Therefore, I decide to take it out for the further calculations.

In [None]:
correlationMatrix = dataframe.corr() 

f = plt.figure(figsize=(15, 8))
plt.matshow(correlationMatrix, fignum=f.number, cmap='viridis')
plt.xticks(range(dataframe.shape[1]), dataframe.columns, fontsize=15, rotation=65)
plt.yticks(range(dataframe.shape[1]), dataframe.columns, fontsize=15)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=15)
plt.show()

In [None]:
correlationMatrix.style.background_gradient(cmap='viridis').set_precision(2)

## Preparing Dataset for ML

### Creating Feature Matrix

Creating arrays for X and Y data. As I said, I drop **fbs** on X Data 


In [None]:
dataX = dataframe.copy().drop(['heartdisease', 'fbs'],axis=1)
dataY = dataframe['heartdisease'].astype(int).copy()

In [None]:
dataY.value_counts()

Rescaling data because most ML algorithms work better if the data is normalized around zero.

In [None]:
featuresToScale = dataX.columns
sX = pp.StandardScaler(copy=True)
dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])

Split Data into training and test set. First I chose 20% as test set, but some ML Methods had way better results on test data than on trainings data, so I increased test data to 30%

In [None]:
X_train, X_test, y_train, y_test = train_test_split(dataX,
dataY, test_size=0.3,
random_state=2021, stratify=dataY)

y_test.value_counts()

### Principal Component Analysis

Since I have 12 attributes I am trying out dimensional reduction to see if it improves my accuracy.

#### Functions used for PCA

In [None]:
def anomalyScore (originalDF, reducedDF):
  loss = np.sum((np.array(originalDF)-np.array(reducedDF))**2, axis=1)
  loss = pd.Series(data=loss,index=originalDF.index)
  loss = (loss-np.min(loss))/(np.max(loss)-np.min(loss))
  return loss

In [None]:
def plotResults(trueLabels, anomalyScore, returnPreds = False, plotting = True):
  preds = pd.concat([trueLabels, anomalyScore], axis=1)
  preds.columns = ['trueLabel', 'anomalyScore']
  
  precision, recall, thresholds = \
  precision_recall_curve(preds['trueLabel'],preds['anomalyScore'])
  

  average_precision = average_precision_score(preds['trueLabel'],preds['anomalyScore'])
  if plotting:
    plt.step(recall, precision, color='b', alpha=0.7, where='post')
    plt.fill_between(recall, precision, step='post', alpha=0.3, color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.1])
    plt.xlim([0.0, 1.0])
    plt.title('Average Precision = {0:0.2f}'.format(average_precision))

  if returnPreds==True:
    return preds, average_precision

#### Hyperparameters for PCA

In [None]:
n_components =11
svd_solver = 'auto'
random_state = 2021

In [None]:
from sklearn.decomposition import KernelPCA
pca = KernelPCA(n_components=n_components,kernel = 'rbf', fit_inverse_transform = True ,random_state= 2021)

Model Implementation on X and Y Data

In [None]:
X_train_PCA = pca.fit_transform(X_train)
X_test_PCA = pca.fit_transform(X_test)

Organizing data into pd Framework

In [None]:
X_train_PCA = pd.DataFrame(data=X_train_PCA, index=X_train.index)
X_test_PCA = pd.DataFrame(data=X_test_PCA, index=X_test.index)

In [None]:
#print(pca.explained_variance_ratio_)

In [None]:
#plt.bar(range(len(pca.explained_variance_ratio_)),pca.explained_variance_ratio_)

Anomaly Score

In [None]:
X_train_PCA_inverse = pca.inverse_transform(X_train_PCA)
X_train_PCA_inverse = pd.DataFrame(data=X_train_PCA_inverse, index=X_train.index)

In [None]:
anomalyScorePCA = anomalyScore(X_train, X_train_PCA_inverse)
predsPCA = plotResults(y_train, anomalyScorePCA, False, True)

Desicion if PCA or not.

In [None]:
trigger_PCA = True
if trigger_PCA == True:
  X_train = X_train_PCA
  X_test = X_test_PCA

### Cross Validation

To validate my models I use Cross Validation.

In [None]:
k_fold = StratifiedKFold(n_splits=4, shuffle=True, random_state=2021)

#5. Model Selection 

I have chosen the following as my baseline models, which are well suited for classification problems:
*   Logistic Regression 
*   Random Forests 
* K-Nearest Neighbors

After testing the baseline models I build up a Neural Network
*   Neural Network
*   Fine tuning hyperparameters

Workflow:

My workflow with the Baseline models is always the same. I start with Gridsearch to find the best parameters. Then to validate the model I look at the accuracy in cross validation. After that I train the model with all the test data and look at the results using the correlation matrix, precision and recall, and ROC curve.






## Logistic Regression

Hyperparamters

In [None]:
penalty = 'l2' 
C = 0.1
random_state = 2021
solver = 'liblinear'
logReg = LogisticRegression(penalty=penalty, C=C,random_state=random_state, solver=solver)

Grid Search to find best result.

In [None]:
penalty = ['l2']
C = np.arange(0.01, 1, 0.1 )
random_state = 2021
solver = ['lbfgs', 'liblinear', 'saga']
grid = {'penalty': penalty,'C':C, 'solver': solver}

gridSearch = GridSearchCV(logReg, grid, scoring='accuracy', cv=k_fold, refit=True)
gridSearch.fit(X_train, y_train)
results = gridSearch.cv_results_

print('Best accuracy obtained:', gridSearch.best_score_)
print('C value for the best case:', gridSearch.best_estimator_.C)
print('Penalty value for the best case:', gridSearch.best_estimator_.penalty)
print('Solver value for the best case:', gridSearch.best_estimator_.solver)

In [None]:
#Set Parameters to the values from GridSearch
logReg.set_params(C = gridSearch.best_estimator_.C, solver = gridSearch.best_estimator_.solver )


Cross-validation for validating estimator performance

In [None]:
#Lists for storing scores
trainingScores = []
cvScores = []

for train_index, cv_index in k_fold.split(X_train,y_train):

  #Filtering data based on indices
  X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
  Y_train_fold, Y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]

  #Fitting Model
  logReg.fit(X_train_fold, Y_train_fold)

  #Checking how good the model is on trainingsdata
  accuracy_score_Training = accuracy_score(Y_train_fold,logReg.predict(X_train_fold))
  print('--------------------------------------------------------')
  print('Training accuracy_score: ', accuracy_score_Training)
  #Checking how good the model is on cv data
  accuracy_score_cv = accuracy_score(Y_cv_fold,logReg.predict(X_cv_fold))
  print('CV accuracy_score: ', accuracy_score_cv)

  trainingScores.append(accuracy_score_Training)
  cvScores.append(accuracy_score_cv)

print('--------------------------------------------------------')
print('--------------------------------------------------------')
mean_accuracy_score_training = np.array(trainingScores).mean()
print('mean Accuracy_score Training:', mean_accuracy_score_training )
print('--------------------------------------------------------')
mean_accuracy_score_cv = np.array(cvScores).mean()
print('mean Accuracy_score cv:', mean_accuracy_score_cv )

Train model with all trainings data and hyperparameters from above

In [None]:
logReg.fit(X_train, y_train)

Test model with training and test data set.

In [None]:
# Prediction and accuracy on trainings data
y_pred_train_proba_lg = logReg.predict_proba(X_train)
y_pred_train_proba_lg = pd.DataFrame(data = y_pred_train_proba_lg, index = X_train.index)

y_train_preds_lg = logReg.predict(X_train)
accuracy_training_ges_lg = accuracy_score(y_train,y_train_preds_lg)

In [None]:
# Prediction and accuracy on test data
y_pred_proba_lg = logReg.predict_proba(X_test)
y_pred_proba_lg = pd.DataFrame(data = y_pred_proba_lg, index = X_test.index)

y_preds_lg = logReg.predict(X_test)
accuracy_test_ges_lg = accuracy_score(y_test,y_preds_lg)


Accuracy for training and test data:

In [None]:
print('--------------------------------------------------------')
print('accuracy_score whole trainings set', accuracy_training_ges_lg )
print('--------------------------------------------------------')
print('accuracy_score whole test set', accuracy_test_ges_lg )
print('--------------------------------------------------------')




## Evaluate the results





Confusion Matrix

In [None]:
cm1 = confusion_matrix(y_test,y_preds_lg)
#storing false negatives
fn_lg = cm1[1,0]

In [None]:
plot_confusion_matrix(logReg,X_test,y_test,cmap='Blues')

#### Precision Recall Curve:

**Precision** 

*   Precision = True Positive / ( True Positive  + False Positive)
*   captures how often, when a model makes a positive predeiction, this prediction turns out to be correct.

**Recall** 


*   Recall  = True Positive / (True Positive + False Negative)
*   tells us how confident we can be that all instances with the positive target level have been found the model

go through the treshold as in ROC. Count the Values for every treshold

In [None]:
preds = pd.concat([y_test,y_pred_proba_lg.loc[:,1]], axis=1)
preds.columns = ['trueLabel','prediction']
precision, recall, thresholds = precision_recall_curve(preds['trueLabel'],preds['prediction'])
average_precision = average_precision_score(preds['trueLabel'],preds['prediction'])

plt.step(recall, precision, color='k', alpha=0.7, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='k')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: Average Precision = {0:0.2f}'.format(average_precision))

#### ROC Curve

The True Positives and False Negatives are calculated for each threshold. The curve results from all these points. The closer the curve is to the upper left corner, the better the solution.

In [None]:
fpr, tpr, thresholds = roc_curve(preds['trueLabel'],preds['prediction'])
areaUnderROC = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='r', lw=2, label='ROC curve')
plt.plot([0, 1], [0, 1], color='k', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: Area under the curve = {0:0.2f}'.format(areaUnderROC))
plt.legend(loc="lower right")
plt.show()

## Random Forest

Hyperparameters

In [None]:
n_estimators = 200
random_state = 2021
criterion = 'gini'

max_depth = 3
max_leaf_nodes = None
min_samples_split = 2

max_features = 'sqrt'

RFC = RandomForestClassifier(n_estimators= n_estimators,  random_state= random_state, criterion=criterion, max_features = max_features, max_depth = max_depth, max_leaf_nodes=max_leaf_nodes, min_samples_split=min_samples_split)

In [None]:
n_estimators = [ 100, 150, 200]
criterion = ['gini', 'entropy']
random_state = 2021
max_depth = range(1, 5)
max_features = ['sqrt', 'log2']
grid = {'n_estimators': n_estimators,'criterion':criterion, 'max_depth': max_depth, 'max_features':max_features}

gridSearch = GridSearchCV(RFC, grid, scoring='accuracy', cv=k_fold, refit=True)
gridSearch.fit(X_train, y_train)
results = gridSearch.cv_results_

print('Best accuracy obtained:', gridSearch.best_score_)
print('n_estimators value for the best case:', gridSearch.best_estimator_.n_estimators)
print('criterion value for the best case:', gridSearch.best_estimator_.criterion)
print('max_depth value for the best case:', gridSearch.best_estimator_.max_depth)
print('max_features value for the best case:', gridSearch.best_estimator_.max_features)



In [None]:
#Set Parameters to the values from GridSearch
RFC.set_params(n_estimators = gridSearch.best_estimator_.n_estimators, criterion = gridSearch.best_estimator_.criterion, max_depth = gridSearch.best_estimator_.max_depth, max_features  =  gridSearch.best_estimator_.max_features)

Evaluate Hyperparameters with CV

In [None]:
#Storing Scores
trainingScores = []
cvScores = []


for train_index, cv_index in k_fold.split(X_train,y_train):

  #Filtering data based on indices
  X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
  Y_train_fold, Y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]

  #Fitting Model
  RFC.fit(X_train_fold, Y_train_fold)

  #Checking how good the model is on trainingsdata
  accuracy_score_Training = accuracy_score(Y_train_fold,RFC.predict(X_train_fold))
  print('--------------------------------------------------------')
  print('Training accuracy_score: ', accuracy_score_Training)
  #Checking how good the model is on cv data
  accuracy_score_Test = accuracy_score(Y_cv_fold,RFC.predict(X_cv_fold))
  print('cv accuracy_score: ', accuracy_score_cv)
 
  trainingScores.append(accuracy_score_Training)
  cvScores.append(accuracy_score_cv)

print('--------------------------------------------------------')
print('--------------------------------------------------------')
mean_accuracy_score_training = np.array(trainingScores).mean()
print('Mean Accuracy_score Training:', mean_accuracy_score_training )
print('--------------------------------------------------------')
mean_accuracy_score_cv = np.array(cvScores).mean()
print('Mean Accuracy_score cv:', mean_accuracy_score_cv )


Train this model with training and test data set

In [None]:
RFC.fit(X_train, y_train)

In [None]:
#Predict and Accuracy on trainings data
y_pred_train_proba_rf = logReg.predict_proba(X_train)
y_pred_train_proba_rf = pd.DataFrame(data = y_pred_train_proba_rf, index = X_train.index)


y_train_preds_rf = logReg.predict(X_train)
accuracy_score_training_ges_rf = accuracy_score(y_train,y_train_preds_rf)

In [None]:
#Predict and Accuracy on test data
y_pred_proba_rf = RFC.predict_proba(X_test)
y_pred_proba_rf = pd.DataFrame(data = y_pred_proba_rf, index = X_test.index)

y_preds_rf = RFC.predict(X_test)
accuracy_test_ges_rf = accuracy_score(y_test,y_preds_rf)

Accuracy for training and test set: 

In [None]:
print('--------------------------------------------------------')
print('accuracy_score whole trainings set', accuracy_score_training_ges_rf )
print('--------------------------------------------------------')
print('accuracy_score whole test set', accuracy_test_ges_rf )
print('--------------------------------------------------------')

### Evaluate Results

Confusion Matrix

In [None]:
cm2 = confusion_matrix(y_test,y_preds_rf)
#storing false negatives
fn_rf = cm2[1,0]

In [None]:
plot_confusion_matrix(RFC,X_test,y_test,cmap='Blues')

#### Recall Precision Curve

In [None]:
preds = pd.concat([y_test,y_pred_proba_rf.loc[:,1]], axis=1)
preds.columns = ['trueLabel','prediction']
precision, recall, thresholds = precision_recall_curve(preds['trueLabel'],preds['prediction'])
average_precision = average_precision_score(preds['trueLabel'],preds['prediction'])

plt.step(recall, precision, color='k', alpha=0.7, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='k')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: Average Precision = {0:0.2f}'.format(average_precision))

#### ROC-Curve

In [None]:
fpr, tpr, thresholds = roc_curve(preds['trueLabel'],preds['prediction'])
areaUnderROC = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='r', lw=2, label='ROC curve')
plt.plot([0, 1], [0, 1], color='k', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: Area under the curve = {0:0.2f}'.format(areaUnderROC))
plt.legend(loc="lower right")
plt.show()

## K Nearest Neighbors

In [None]:
n_neighbors = 13
weights = 'uniform'
algorithm = 'auto'
knn = KNeighborsClassifier(n_neighbors =  n_neighbors, algorithm = algorithm, weights = weights)

In [None]:
n_neighbors = np.arange(1,30, 2)
weights = ['uniform', 'distance']
algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']
grid = {'n_neighbors': n_neighbors,'weights':weights, 'algorithm': algorithm}

gridSearch = GridSearchCV(knn, grid, scoring='accuracy', cv=k_fold, refit=True)
gridSearch.fit(X_train, y_train)
results = gridSearch.cv_results_

print('Best accuracy obtained:', gridSearch.best_score_)
print('n_neighbors value for the best case:', gridSearch.best_estimator_.n_neighbors)
print('weights value for the best case:', gridSearch.best_estimator_.weights)
print('algorithm for the best case:', gridSearch.best_estimator_.algorithm)

In [None]:
#Set Parameters to the values from GridSearch
knn.set_params(n_neighbors = gridSearch.best_estimator_.n_neighbors,  weights = gridSearch.best_estimator_.weights , algorithm = gridSearch.best_estimator_.algorithm )

In [None]:
trainingScores = []
cvScores = []


for train_index, cv_index in k_fold.split(X_train,y_train):

  #Filtering data based on indices
  X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
  Y_train_fold, Y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]

  #Fitting Model
  knn.fit(X_train_fold, Y_train_fold)

  #Checking how good the model is on trainingsdata
  accuracy_score_Training = accuracy_score(Y_train_fold,knn.predict(X_train_fold))
  print('--------------------------------------------------------')
  print('Training accuracy_score: ', accuracy_score_Training)
  #Checking how good the model is on cv data
  accuracy_score_cv = accuracy_score(Y_cv_fold,knn.predict(X_cv_fold))
  print('CV accuracy_score: ', accuracy_score_cv)
  trainingScores.append(accuracy_score_Training)
  cvScores.append(accuracy_score_cv)

print('--------------------------------------------------------')
print('--------------------------------------------------------')
gesamt_accuracy_score_training = np.array(trainingScores).mean()
print('Mean accuracy_score Training:', gesamt_accuracy_score_training )
print('--------------------------------------------------------')
mean_accuracy_score_cv = np.array(cvScores).mean()
print('Mean accuracy_score CV:', mean_accuracy_score_cv )

Train and test model with training and test set.

In [None]:
knn.fit(X_train, y_train)

In [None]:
#Prediction and Accuracy on trainings data
y_pred_train_proba_knn = knn.predict_proba(X_train)
y_pred_train_proba_knn = pd.DataFrame(data = y_pred_train_proba_knn, index = X_train.index)
#print(y_preds)

y_train_preds_knn = knn.predict(X_train)
accuracy_training_ges_knn = accuracy_score(y_train,y_train_preds_knn)

In [None]:
#Prediction and Accuracy on test data
y_pred_proba_knn = knn.predict_proba(X_test)
y_pred_proba_knn = pd.DataFrame(data = y_pred_proba_knn, index = X_test.index)
#print(y_preds)

y_preds_knn = knn.predict(X_test)
accuracy_test_ges_knn = accuracy_score(y_test,y_preds_knn)

Accuracy on training und test set:

In [None]:
print('--------------------------------------------------------')
print('accuracy_score whole trainings set', accuracy_training_ges_knn )
print('--------------------------------------------------------')
print('accuracy_score whole test set', accuracy_test_ges_knn )
print('--------------------------------------------------------')

Confusion Matrix

In [None]:
cm3 = confusion_matrix(y_test,y_preds_knn)
#storing false negatives
fn_knn = cm3[1,0]

In [None]:
plot_confusion_matrix(knn,X_test,y_test,cmap='Blues')

#### Recall Precision Curve

In [None]:
preds = pd.concat([y_test,y_pred_proba_knn.loc[:,1]], axis=1)
preds.columns = ['trueLabel','prediction']
precision, recall, thresholds = precision_recall_curve(preds['trueLabel'],preds['prediction'])
average_precision = average_precision_score(preds['trueLabel'],preds['prediction'])

plt.step(recall, precision, color='k', alpha=0.7, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='k')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: Average Precision = {0:0.2f}'.format(average_precision))

#### ROC-Curve

In [None]:
fpr, tpr, thresholds = roc_curve(preds['trueLabel'],preds['prediction'])
areaUnderROC = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='r', lw=2, label='ROC curve')
plt.plot([0, 1], [0, 1], color='k', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: Area under the curve = {0:0.2f}'.format(areaUnderROC))
plt.legend(loc="lower right")
plt.show()

# Neural Network


First, I started with an NN with 2 layers and 32 neurons each and an output layer with one neuron.  As activation function I used **relu** for the first two layers and **sigmoid** for the Output Layer. Since it is a binary classification problem I decided to use **binary_crossentropy** as loss. And **accuracy** I used as metric. As optimizer I took **adam**.

In [None]:
def build_model():
  #Sequential API
  model = models.Sequential()
  #Defining the first hidden layer:
  model.add(layers.Dense(units = 32, activation='relu', input_shape=(X_train.shape[1],)))
  model.add(layers.Dense(units = 32, activation='relu'))
  #Sigmoid for values between 0 and 1 (good for binary classification).
  model.add(layers.Dense(units = 1,activation='sigmoid'))


  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  return model

In [None]:
#Looking into model structure:
build_model().summary()

To get started i trained the model with 200 epochs.

In [None]:
#Model Training
#--------------------------------------------------------------------------
#Hyperparameters
num_epochs = 200

batch_size = 10

In [None]:
#Lists for storing scores
trainingScores = []
cvScores = []

for train_index, cv_index in k_fold.split(X_train,y_train):

  #Filtering data based on indices
  X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
  Y_train_fold, Y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]
  #Building the model
  model = build_model()
  #Fitting Model
  #model.fit(X_train_fold, Y_train_fold, epochs=num_epochs, batch_size=batch_size, verbose=0)
  history =  model.fit(X_train_fold, Y_train_fold, epochs=num_epochs, batch_size=batch_size, validation_data=(X_cv_fold, Y_cv_fold) ,verbose=0)

  
  #Evaluating the training performance:
  val_binary_crossentropy, val_accuracy = model.evaluate(X_train_fold, Y_train_fold, verbose=0)
  trainingScores.append(val_binary_crossentropy)
  print('--------------------------------------------------------')
  print('Training accuracy: ', val_accuracy)

  #Evaluating the CV performance:
  val_binary_crossentropy, val_accuracy = model.evaluate(X_cv_fold, Y_cv_fold, verbose=0)
  cvScores.append(val_binary_crossentropy)
  print('CV accuracy: ', val_accuracy)

Training accuracy here is at 1 and CV accuracy well below. The network is overfitted. To confirm this, I look at the loss in the learning curve. This also shows strong overfitting.

In [None]:
def plot_learning_curves(history):
  #We will omit the first 10 points for a better visualization:
  plt.plot(history.epoch,history.loss, "k--", linewidth=1.5, label="Training")
  plt.plot(history.epoch,history.val_loss, "b-.", linewidth=1.5, label="CV test")
  plt.legend()
  plt.ylim(0.,1,10)
  #plt.yscale("log")
  plt.xlabel("Epochs"),  plt.ylabel("loss")

In [None]:
hist = pd.DataFrame(history.history)
#Adding epoch column:
hist['epoch'] = history.epoch
# As you can see, we have the losses as well as mae for both training and CV data:
#hist.sample(3)

In [None]:
plot_learning_curves(hist)

In [None]:
model = history.model

In [None]:
val_binary_crossentropy_training, val_accuracy_total_training = model.evaluate(X_train, y_train, verbose=0)
print('--------------------------------------------------------')
print('accuracy for the entire trainings dataset: ', val_accuracy_total_training)
print('--------------------------------------------------------')

In [None]:
val_binary_crossentropy_test, val_accuracy_total_test = model.evaluate(X_test, y_test, verbose=0)
print('--------------------------------------------------------')
print('accuracy for the entire test dataset: ', val_accuracy_total_test)
print('--------------------------------------------------------')

In [None]:
y_preds = model.predict_classes(X_test)

In [None]:
cm1 = confusion_matrix(y_test,y_preds)

In [None]:
df_cm = pd.DataFrame(cm1, range(2), range(2))
# plt.figure(figsize=(10,7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(df_cm, annot=True, annot_kws={"size": 16}, cmap = 'crest') # font size

plt.show()



## Fine Tuning Parameters

First, I added a dropout layer between the two layers. This helped, but the network still overfits. Next, I reduced the number of neurons to 16 and added a second layer after the dropout layer. A 4th layer and a second dropout layer did not improve the accuracy of the model. Therefore I have commented them out. Additionally I added a kernel regularizer. As regularizers I tried **l1**, **l2**, and **l1_l2** and got the best results with **l2**. I did not change the loss, the metric and the solver.

In [None]:
def build_model():
  #Sequential API
  model = models.Sequential()
  #Defining the first hidden layer:
  model.add(layers.Dense(units = 16,  kernel_regularizer=regularizers.l2(0.001), kernel_initializer="he_uniform", activation='relu', input_shape=(X_train.shape[1],)))
 # model.add(layers.Dense(units = 16,  kernel_regularizer=regularizers.l2(0.001),activation='relu'))
 # model.add(layers.Dropout(0.3))
  model.add(layers.Dense(units = 16,  kernel_regularizer=regularizers.l2(0.001),kernel_initializer="he_uniform", activation='relu'))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(units = 16,  kernel_regularizer=regularizers.l2(0.001),kernel_initializer="he_uniform",activation='relu'))
  #Sigmoid for values between 0 and 1 (good for binary classification).
  model.add(layers.Dense(units = 1,activation='sigmoid'))


  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  return model

In [None]:
#Looking into model structure:
build_model().summary()

In [None]:
#Lists for storing scores
trainingScores = []
cvScores = []

for train_index, cv_index in k_fold.split(X_train,y_train):

    #Filtering data based on indices
  X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
  Y_train_fold, Y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]
  #Building the model
  model = build_model()
  #Fitting Model
  #model.fit(X_train_fold, Y_train_fold, epochs=num_epochs, batch_size=batch_size, verbose=0)
  history =  model.fit(X_train_fold, Y_train_fold, epochs=num_epochs, batch_size=batch_size, validation_data=(X_cv_fold, Y_cv_fold) ,verbose=0)


  #Evaluating the training pperformance:
  val_binary_crossentropy, val_accuracy = model.evaluate(X_train_fold, Y_train_fold, verbose=0)
  trainingScores.append(val_binary_crossentropy)
  print('--------------------------------------------------------')
  print('Training accuracy: ', val_accuracy)

  #Evaluating the CV pperformance:
  val_binary_crossentropy, val_accuracy = model.evaluate(X_cv_fold, Y_cv_fold, verbose=0)
  cvScores.append(val_binary_crossentropy)
  print('CV accuracy: ', val_accuracy)


In [None]:
hist = pd.DataFrame(history.history)
#Adding epoch column:
hist['epoch'] = history.epoch

In [None]:
plot_learning_curves(hist)

As the learning curve shows the Model still overfits. To avoid overfitting, I have added early stopping. The training is stopped if no improvement has taken place over further epochs. To get the best model from cross validation I put in a checkpoint.

In [None]:
myCheckpoint= keras.callbacks.ModelCheckpoint("my_best_model1.h5", save_best_only=True)
myEarly_stopping = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)

In [None]:
# run model with checkpoint and early stopping
#Lists for storing scores

trainingScores = []
cvScores = []

for train_index, cv_index in k_fold.split(X_train,y_train):

    #Filtering data based on indices
  X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
  Y_train_fold, Y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]
  #Building the model
  model = build_model()
  #Fitting Model
  #model.fit(X_train_fold, Y_train_fold, epochs=num_epochs, batch_size=batch_size, verbose=0)
  history =  model.fit(X_train_fold, Y_train_fold, epochs=num_epochs, batch_size=batch_size, validation_data=(X_cv_fold, Y_cv_fold) ,callbacks=[myCheckpoint,myEarly_stopping],verbose=0)


  #Evaluating the training pperformance:
  val_binary_crossentropy, val_accuracy = model.evaluate(X_train_fold, Y_train_fold, verbose=0)
  trainingScores.append(val_binary_crossentropy)
  print('--------------------------------------------------------')
  print('Training accuracy: ', val_accuracy)

  #Evaluating the CV pperformance:
  val_binary_crossentropy, val_accuracy = model.evaluate(X_cv_fold, Y_cv_fold, verbose=0)
  cvScores.append(val_binary_crossentropy)
  print('CV accuracy: ', val_accuracy)

#Load best model from Checkpoint
model = keras.models.load_model("my_best_model1.h5")

In [None]:
val_binary_crossentropy_test, val_accuracy_total_test = model.evaluate(X_test, y_test, verbose=0)
print('--------------------------------------------------------')
print('accuracy for the entire test dataset: ', val_accuracy_total_test)
print('--------------------------------------------------------')

In [None]:
y_preds = model.predict_classes(X_test)


In [None]:
cm2 = confusion_matrix(y_test,y_preds)
#storing false negatives
fn_nn = cm2[1,0]

In [None]:


df_cm = pd.DataFrame(cm2, range(2), range(2))
# plt.figure(figsize=(10,7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(df_cm, annot=True, annot_kws={"size": 16},  cmap = 'crest') # font size

plt.show()



### Summary 

The neural network has the potential to have the greatest accuracy. Depending on how the weights are initialized, the model has an accuracy between 81% and 86%. The false positives are the lowest in the best NN-case.

#6. Evaluation of the model predictions

If you look at the accuracies, you can see that all 4 models have about the same accuracy. This is about 84%. To judge the models I look at the number of false negatives declared. Since this is about detecting heart disease, the worst case is when the patient has heart disease but it is not detected. In this case, Logistic Regression and Neural Network are best. With the neural network it depends, as said above, on the initilized weights. In my experiments, the neural network classified between 6 in the best case and 12 in the worst case false negatives (from 30% Testdata size).

In [None]:
#Storing all scores 
scores = [accuracy_test_ges_lg, accuracy_test_ges_rf, accuracy_test_ges_knn, val_accuracy_total_test]
fales_negatives_count = [fn_lg, fn_rf, fn_knn,fn_nn]
algorithms = ['Logisitc Regression', 'Random Forests',  'K-Nearest Neighbors', 'Neural Network']

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Accuracy score")

sns.barplot(x = algorithms, y = scores)
print('Accuracies: logReg:' ,scores[0], 'Random Forest:' ,scores[1],  'K-Nearest-Neighbors:' ,scores[2],  'Neural Network:' ,scores[3])

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Count of false negatives")

sns.barplot(x = algorithms, y = fales_negatives_count)

#7. Lessons Learnt and Conclusions

* The classification whether heart disease is present or not can be classified with an accuracy around 84%.
* Baseline models predict less false positives than false negatives. The other way around would be better.
*   With none of the models it is possible to achieve an accuracy significantly above 85%. Since all models have similar accuracies and I have not found any better on Kaggle, this is probably the best possible result.
*   I think that the size of the data set is not enough to get even better results.
* Since we are dealing with human measurements and subjective values like chest pain, it is hard to get good results when the data set may be inconsistent. 
* After some research, I still found out that other attributes are also important in the development of heart disease. For example, smoking, obesity, stress and alcohol consumption. These data would certainly also be helpful in the detection.
* In the beginning, I hypothesized that the high cholesterol and blood pressure have an influence on heart disease. This data set did not show that.



