Variables/Features

1. Age (in years)

2. Sex 
    - female = 0
    - male = 1

3. Cp - Chest pain type
    - typical angina = 1
    - atypical angina = 2
    - non-anginal pain = 3
    - asymptomatic = 4

4. trestbps - resting blood pressure on the admission to the hospital (in mmHg)

5. chol - serum cholesterol (in mg/dL)

6. fbs - fasting blood sugar
    - > 120 mg/dL = 1
    - < 120 mg/dL = 0

7. restecg
    - normal = 0
    - having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) = 1
    - showing probable or definite left ventricular hypertrophy by Estes' criteria = 2

8. thalach - maximum heart rate achieved

9. exang - exercise induced angina
    - yes = 1
    - no  = 0

10. oldpeak - ST depression induced by exercise relative to rest

11. slope - slope of the peakk exercise ST segment
    - upsloping =  1
    - flat = 2
    - downsloping = 3

12. ca - number of makor vessels (0-3) colored by fluoroscopy

13. thal ( refers to thalassemia, a genetic blood disorder that affetcs hemoglobin production)
    - normal = 3
    - fixed effect = 6
    - revearsable effect = 7

14. num - diagnosis of heart disease --> what we want to predict (CLASS)

In [None]:
import sys 

Import libraries

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

import sklearn 
from sklearn import model_selection
from sklearn.metrics import classification_report, accuracy_score

import keras
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from keras.utils import to_categorical


Loading data

In [27]:
# Heart disease dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

# names of each column in our pandas DataFrame
fields = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeeak", "slope", "ca", "thal", "class"]


In [None]:
# read csv
cleveland_data = pd.read_csv(url, names = fields)

In [None]:
cleveland_data.head() # shows the first few rows of the DataFrame

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeeak,slope,ca,thal,class
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [None]:
# print the shape of the dataFrame to know how many cases/Examples we have
print("Shape of DataFrame: {}".format(cleveland_data.shape))
print(cleveland_data.loc[1]) # print th efirst line

Shape of DataFrame: (303, 14)
    age  sex   cp  trestbps   chol  ...  oldpeeak  slope   ca  thal  class
1  67.0  1.0  4.0     160.0  286.0  ...       1.5    2.0  3.0   3.0      2
2  67.0  1.0  4.0     120.0  229.0  ...       2.6    2.0  2.0   7.0      1

[2 rows x 14 columns]


There are 14 features and 303 cases.

In [None]:
print(cleveland_data.loc[280:]) 

      age  sex   cp  trestbps   chol  ...  oldpeeak  slope   ca  thal  class
280  57.0  1.0  4.0     110.0  335.0  ...       3.0    2.0  1.0   7.0      2
281  47.0  1.0  3.0     130.0  253.0  ...       0.0    1.0  0.0   3.0      0
282  55.0  0.0  4.0     128.0  205.0  ...       2.0    2.0  1.0   7.0      3
283  35.0  1.0  2.0     122.0  192.0  ...       0.0    1.0  0.0   3.0      0
284  61.0  1.0  4.0     148.0  203.0  ...       0.0    1.0  1.0   7.0      2
285  58.0  1.0  4.0     114.0  318.0  ...       4.4    3.0  3.0   6.0      4
286  58.0  0.0  4.0     170.0  225.0  ...       2.8    2.0  2.0   6.0      2
287  58.0  1.0  2.0     125.0  220.0  ...       0.4    2.0    ?   7.0      0
288  56.0  1.0  2.0     130.0  221.0  ...       0.0    1.0  0.0   7.0      0
289  56.0  1.0  2.0     120.0  240.0  ...       0.0    3.0  0.0   3.0      0
290  67.0  1.0  3.0     152.0  212.0  ...       0.8    2.0  0.0   7.0      1
291  55.0  0.0  2.0     132.0  342.0  ...       1.2    1.0  0.0   3.0      0

Data Preparation
- We observe some "?" in some fields

In [40]:
data = cleveland_data[~cleveland_data.isin(['?'])] #substitute by NaN
data.loc[280:]


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeeak,slope,ca,thal,class
280,57.0,1.0,4.0,110.0,335.0,0.0,0.0,143.0,1.0,3.0,2.0,1.0,7.0,2
281,47.0,1.0,3.0,130.0,253.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
282,55.0,0.0,4.0,128.0,205.0,0.0,1.0,130.0,1.0,2.0,2.0,1.0,7.0,3
283,35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0
284,61.0,1.0,4.0,148.0,203.0,0.0,0.0,161.0,0.0,0.0,1.0,1.0,7.0,2
285,58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,4
286,58.0,0.0,4.0,170.0,225.0,1.0,2.0,146.0,1.0,2.8,2.0,2.0,6.0,2
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,,7.0,0
288,56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
289,56.0,1.0,2.0,120.0,240.0,0.0,0.0,169.0,0.0,0.0,3.0,0.0,3.0,0


In [45]:
data = data.dropna(axis=0) #drops rows with NaN values
data.loc[280:]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeeak,slope,ca,thal,class
280,57.0,1.0,4.0,110.0,335.0,0.0,0.0,143.0,1.0,3.0,2.0,1.0,7.0,2
281,47.0,1.0,3.0,130.0,253.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
282,55.0,0.0,4.0,128.0,205.0,0.0,1.0,130.0,1.0,2.0,2.0,1.0,7.0,3
283,35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0
284,61.0,1.0,4.0,148.0,203.0,0.0,0.0,161.0,0.0,0.0,1.0,1.0,7.0,2
285,58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,4
286,58.0,0.0,4.0,170.0,225.0,1.0,2.0,146.0,1.0,2.8,2.0,2.0,6.0,2
288,56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
289,56.0,1.0,2.0,120.0,240.0,0.0,0.0,169.0,0.0,0.0,3.0,0.0,3.0,0
290,67.0,1.0,3.0,152.0,212.0,0.0,2.0,150.0,0.0,0.8,2.0,0.0,7.0,1
