# Machine Learning 1: Heart Disease

## Introduction

In this short experiment I have chosen to take on a relatively small dataset on heart disease. This dataset was sourced from kaggle.com and originally collected by researchers from UC Irvine in California in 1988 from data from hospitals in Budapest, Zurich, Basel & Cleveland. There were originally 75 features selected for the dataset, but that has later been cut down to the 14 most useful ones for analysis. Out of these features there are many possible sensible correlations to be found, as they all influence the expected strength and well-being of the heart.

Despite the amount of features, the dataset is still challenging to work with since some seemingly continuous variables are actually categorical features and furthermore the set has only 301 complete data items. This means we will have to look for an inclusive selection of features to get the data strength for a good model, while making sure these variables are actually meaningful when interpreted as continuous.
    
Now to introduce the dataset:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_excel("E:\Data_mappie\heart_disease_ml1/processed.cleveland.xlsx", names=['Age', 'Sex', 'Chest Pain Level', 'Resting Blood Pressure', 'Cholesterol', 'Fasting Blood Sugar > 120', 'Resting ECG Result', 'Max Heart Rate', 'Exercise Angina', 'Exercise ST Depression', 'Slope Exercise ST', 'Major Vessels Flourosopy', 'Thalassemia', 'Heart Disease Level'])[1:]

df.head(25)

Unnamed: 0,Age,Sex,Chest Pain Level,Resting Blood Pressure,Cholesterol,Fasting Blood Sugar > 120,Resting ECG Result,Max Heart Rate,Exercise Angina,Exercise ST Depression,Slope Exercise ST,Major Vessels Flourosopy,Thalassemia,Heart Disease Level
1,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
2,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
3,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
4,56,1,2,120,236,0,0,178,0,0.8,1,0,3,0
5,62,0,4,140,268,0,2,160,0,3.6,3,2,3,3
6,57,0,4,120,354,0,0,163,1,0.6,1,0,3,0
7,63,1,4,130,254,0,2,147,0,1.4,2,1,7,2
8,53,1,4,140,203,1,2,155,1,3.1,3,0,7,1
9,57,1,4,140,192,0,0,148,0,0.4,2,0,6,0
10,56,0,2,140,294,0,2,153,0,1.3,2,0,3,0


As you can see the set contains 5 truly continuous features, 3 binary features and 6 more categorical ones. I have replaced the original feature names with more clearly readable ones. The first 7 values mostly speak for themselves, the ones for 'Max Heart Rate', angina and the STs are results from measurements after the subject exercised; then we have the number of major vessels out of 3 affected by flourosopy, the state of thalassemia in three levels, and finally the general severity of heart disease in the patient.

# The Experiments

As recommended for the dataset, I will use the severity of heart disease as the target feature here, since that would also be the most practically useful feature to predict in a medical context. At least each of the properly continuous variables seem to be very useful for our analysis here, but I will expend this with the chest pain level, which clearly has a continuous character, and the ST slope after exercise, where 1 is meant to be downward, 2 is flat, and 3 is upward, also suggesting a continuous nature. For clarification, the ST line is an electrocardiographic measurement where the line is meant to go upward in a short amount of time to indicate a healthy heart.

Now for some general setup, including a small simplification of the heart disease categories to avoid having too small a sample size for the models:

In [3]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

for i in range(1, len(df)):
    if df['Heart Disease Level'][i] > 1:
        df['Heart Disease Level'][i] = 1
print(df['Heart Disease Level'].value_counts())
y = np.array(df['Heart Disease Level'])
X = np.array(df[['Age', 'Resting Blood Pressure','Cholesterol', 'Max Heart Rate', 'Exercise ST Depression', 'Chest Pain Level', 'Slope Exercise ST']])
y.shape, X.shape

0    163
1    138
Name: Heart Disease Level, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Heart Disease Level'][i] = 1


((301,), (301, 7))

Firstly now, we will run a logistic regression model on the data at a few different tolerance levels and cross-validate them:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, shuffle=True, random_state=42, stratify=y)
for n in range(-5, 5):
    print('\nTolerance: ' + str(10**n))
    lr = LogisticRegression(tol=10.**n)
    artery = make_pipeline(StandardScaler(), lr)
    artery.fit(X_train, y_train)
    print(cross_val_score(estimator=artery, X=X_train, y=y_train, cv=10, n_jobs=-1))


Tolerance: 1e-05
[0.66666667 0.85714286 0.61904762 0.71428571 0.71428571 0.66666667
 0.66666667 0.9047619  0.66666667 0.71428571]

Tolerance: 0.0001
[0.66666667 0.85714286 0.61904762 0.71428571 0.71428571 0.66666667
 0.66666667 0.9047619  0.66666667 0.71428571]

Tolerance: 0.001
[0.66666667 0.85714286 0.61904762 0.71428571 0.71428571 0.66666667
 0.66666667 0.9047619  0.66666667 0.71428571]

Tolerance: 0.01
[0.66666667 0.85714286 0.61904762 0.71428571 0.71428571 0.66666667
 0.66666667 0.9047619  0.66666667 0.71428571]

Tolerance: 0.1
[0.66666667 0.85714286 0.61904762 0.71428571 0.71428571 0.66666667
 0.66666667 0.9047619  0.66666667 0.71428571]

Tolerance: 1
[0.66666667 0.85714286 0.61904762 0.71428571 0.71428571 0.66666667
 0.66666667 0.9047619  0.66666667 0.71428571]

Tolerance: 10
[0.61904762 0.85714286 0.80952381 0.66666667 0.66666667 0.71428571
 0.66666667 0.95238095 0.71428571 0.66666667]

Tolerance: 100
[0.52380952 0.52380952 0.52380952 0.52380952 0.52380952 0.52380952
 0.571428

Although the differences are minimal, the lower numbers appear to do better, so let's pick a safe one:

In [5]:
lr = LogisticRegression(tol=0.1)
artery = make_pipeline(StandardScaler(), lr)
artery.fit(X_train, y_train)
predictions = artery.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))

[[40  9]
 [11 31]]
0.7802197802197802
              precision    recall  f1-score   support

           0       0.78      0.82      0.80        49
           1       0.78      0.74      0.76        42

    accuracy                           0.78        91
   macro avg       0.78      0.78      0.78        91
weighted avg       0.78      0.78      0.78        91



Considering the size of the 0 class, the accuracy score isn't too impressive; it's not that much better than a model that would guess 0 for everything. Let's see if we get a better result with a perceptron layer:

In [6]:
artery = make_pipeline(StandardScaler(), Perceptron(eta0=0.1, max_iter=100, early_stopping=True))
artery.fit(X_train, y_train)
predictions = artery.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))

[[26 23]
 [10 32]]
0.6373626373626373
              precision    recall  f1-score   support

           0       0.72      0.53      0.61        49
           1       0.58      0.76      0.66        42

    accuracy                           0.64        91
   macro avg       0.65      0.65      0.64        91
weighted avg       0.66      0.64      0.63        91



The perceptron does even worse, barely going over chance. Now let's finally try support vector machines with parameter grid search:

In [7]:
from sklearn.svm import SVC

artery = make_pipeline(StandardScaler(), SVC())
param_grid = [{'svc__C': [0.01, 0.1, 1.0, 10.0, 100.0],
               'svc__kernel': ['linear']},
              {'svc__C': [0.01, 0.1, 1.0, 10.0, 100.0],
               'svc__gamma': [0.01, 0.1, 1.0, 10.0, 100.0],
               'svc__kernel': ['rbf']}]

gs = GridSearchCV(estimator=artery, param_grid=param_grid, scoring='accuracy', cv=10, refit=True)
gs = gs.fit(X_train, y_train)
gs.best_params_

{'svc__C': 0.01, 'svc__kernel': 'linear'}

In [8]:
artery = make_pipeline(StandardScaler(), SVC(C=0.1, kernel='linear'))
artery.fit(X_train, y_train)
predictions = artery.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))

[[42  7]
 [12 30]]
0.7912087912087912
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        49
           1       0.81      0.71      0.76        42

    accuracy                           0.79        91
   macro avg       0.79      0.79      0.79        91
weighted avg       0.79      0.79      0.79        91



# Results and Discussion

From this data it would appear that support vector machines give the best result by a small margin. It is, however, hard to draw conclusions from this since many different approaches could have changed the outcome. Although I figured it wouldn't be very purposeful to show this in separate cells, I have tried different a variety of different distributions of the target variable to get a stronger result, but ultimately any more categories than two just caused the model to ignore the middle one. I have also tried omitting some predictor variables, without any improvements.

Overall I have found the result somewhat underwhelming, naturally. One would assume that there would be some predictive power here, considering all these patients presumably went to get their heart checked out because they were noticing symptoms. It seems the human body is too complex to be properly understood within only 301 data items, and any further algorhythm testing will require bigger datasets for this kind of research.