# Heart Disease UCI

Source: https://www.kaggle.com/ronitf/heart-disease-uci

Variables

1. age: The person's age in years
2. sex: The person's sex (1 = male, 0 = female)
3. cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
4. trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
5. chol: The person's cholesterol measurement in mg/dl
6. fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
7. restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
8. thalach: The person's maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
11. slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
12. ca: The number of major vessels (0-4)
13. thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
14. target: Heart disease (0 = no, 1 = yes)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import torch.nn as nn
import torch

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
%matplotlib inline

## Load the data

In [3]:
data = pd.read_csv('./data/heart.csv')

## Pre-process the data

1. Check missing data - none
2. Drop the irrelevant predictors - unecessary in this case
3. Ensure data types are correct
3. Encode the categorical columns
4. Normalize the continuous columns
5. Put back the categorical and continuous columns together
6. Split the data into a training set and a validation set

In [4]:
# Categorical columns
cat_cols = ['cp', 'restecg', 'slope', 'thal']

# Option 1:  one-hot encode categorical variables
def onehot_encoding(data, cat_cols):
    df_cat   = []
    for col in cat_cols:
        df_cat.append(pd.get_dummies(data[col], prefix = col))
    df_cat   = pd.concat(df_cat, axis = 1)
    return df_cat
df_cat = onehot_encoding(data, cat_cols)

# Option 2: embedding
def embedding(data, cat_cols):
    cat_szs = [len(data[col].unique()) for col in cat_cols]
    emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
    selfembeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
    embeddingz = []
    for i,e in enumerate(selfembeds):
        cat = torch.tensor(data[cat_cols[i]].values, dtype = torch.int64)
        embeddingz.append(e(cat))
    df_cat = torch.cat(embeddingz, 1)
    df_cat = pd.DataFrame(df_cat.detach().numpy(), index = data.index, 
                          columns = ['embeded_{}'.format(i) for i in range(df_cat.shape[1])])
    return df_cat, emb_szs

df_cat, emb_szs = embedding(data, cat_cols)

# continuous columns
df_cont = data.drop(cat_cols + ['target'], axis = 1)
df_cont = (df_cont - df_cont.mean()) / df_cont.std()

# put back together the categorical and continuous columns
X_all = pd.concat([df_cont, df_cat], axis = 1).values

# make sure the target variable is integer, in order to represent discrete classes
y_all = data['target'].values.astype(int)

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.2, random_state = 0)

# no more need to convert to torch tensors

## Random Forest Training

In [7]:
# n_estimators - number of trees in the forest
# criterion    - can be gini or entropy; measure the quality of a split
# max_depth    - maximum depth of the tree
# min_samples_split - minimum number of samples required to split an internal node
# min_samples_leaf - minimum number of samples required to be at a leaf node
# oob_score - out-of-bag score
reg = RandomForestClassifier(n_estimators = 200, criterion = 'entropy',
                             random_state = 0, max_depth = None,
                             min_samples_split = 2, min_samples_leaf = 1, oob_score = True)
reg.fit(X_train, y_train)
print("CE = {}".format(reg.score(X_test, y_test)))
y_val = reg.predict(X_test)

# evaluate the model
rows = 50
correct = 0
print(f'{"MODEL OUTPUT":26}  Y_TEST')
for i in range(rows):
    print(f'{str(y_val[i]):26} {y_test[i]:^7}')
    if y_val[i] == y_test[i]:
        correct += 1
print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')

CE = 0.8524590163934426
MODEL OUTPUT                Y_TEST
0                             0   
1                             1   
1                             0   
0                             0   
0                             1   
1                             0   
0                             0   
0                             0   
0                             0   
0                             0   
1                             1   
1                             1   
0                             0   
1                             1   
1                             1   
1                             1   
0                             1   
1                             1   
0                             0   
1                             1   
1                             1   
0                             0   
0                             0   
0                             0   
1                             1   
0                             0   
0                             0

In [15]:
print(reg.estimators_[:10])

[DecisionTreeRegressor(max_features='auto', random_state=209652396), DecisionTreeRegressor(max_features='auto', random_state=398764591), DecisionTreeRegressor(max_features='auto', random_state=924231285), DecisionTreeRegressor(max_features='auto', random_state=1478610112), DecisionTreeRegressor(max_features='auto', random_state=441365315), DecisionTreeRegressor(max_features='auto', random_state=1537364731), DecisionTreeRegressor(max_features='auto', random_state=192771779), DecisionTreeRegressor(max_features='auto', random_state=1491434855), DecisionTreeRegressor(max_features='auto', random_state=1819583497), DecisionTreeRegressor(max_features='auto', random_state=530702035)]


In [16]:
print(reg.oob_score_)

0.5022959612068372


In [24]:
reg.classes_

array([0, 1])

## XGBoost training

In [8]:
# n_estimators - number of boosting rounds, or number of decision trees to build
# objective - loss function; 'binary:logistic' for binary classification, 'multi:softmax' for multiclass classification
# max_depth - maximum depth of each regression tree
# eta - learning rate
# gamma - minimum loss reduction required to make a further partition on a leaf node of the tree
# subsample - fraction of samples to be used for fitting each tree
# colsample_bytree - fraction of features to be used for each tree
reg = XGBClassifier(n_estimators=500, max_depth=7, eta=0.1, gamma = 0, 
                    objective = 'binary:logistic', 
                    subsample=0.6, colsample_bytree=0.5, random_state = 333,
                    use_label_encoder = False)
reg.fit(X_train, y_train)
print("CE = {}".format(reg.score(X_test, y_test)))
y_val = reg.predict(X_test)

# evaluate the model
rows = 50
correct = 0
print(f'{"MODEL OUTPUT":26}  Y_TEST')
for i in range(rows):
    print(f'{str(y_val[i]):26} {y_test[i]:^7}')
    if y_val[i] == y_test[i]:
        correct += 1
print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')

CE = 0.8524590163934426
MODEL OUTPUT                Y_TEST
0                             0   
1                             1   
1                             0   
0                             0   
0                             1   
0                             0   
0                             0   
0                             0   
0                             0   
0                             0   
1                             1   
1                             1   
0                             0   
1                             1   
1                             1   
1                             1   
0                             1   
1                             1   
0                             0   
1                             1   
1                             1   
0                             0   
0                             0   
0                             0   
1                             1   
1                             0   
0                             0