# Risk Factors and Prediction of Coronary Heart Disease

**Context**

Coronary heart disease (CHD) is a [major cause of death](https://www.nhs.uk/conditions/coronary-heart-disease/) in the UK and worldwide.
CHD is the term that describes what happens when the heart's blood supply is blocked or interrupted by a build-up of fatty substances in the coronary arteries. This build-up process, known as atherosclerosis, can be caused by lifestyle factors, such as smoking and regularly drinking excessive amounts of alcohol. The risk of atherosclerosis increases with conditions like high cholesterol, high blood pressure (hypertension) or diabetes.

The main symptom of coronary heart disease is chest pain (angina). CHD is diagnosed from a combination of risk assessment (medical, family history and lifestyle) and testing. A number of different tests are used to diagnose heart related problems, including:
    
    blood tests
    electrocardiogram (ECG)
    exercise stress tests (eg, a treadmill test)
    coronary angiography ( using dye and X-ray to detect coronary artery blockages)
    
Our aim in this analysis is to determine which of the given features, if any, are risk factors for CHD and to fit a predictive model to determine CHD status.

**The Dataset**

The UCI Heart Disease dataset contains records for 303 subjects, with the recruitment process being unknown. The original dataset contained 76 attributes, but all published experiments refer to using the below subset of 14 of them.  There are several descriptions of these variables and their values online. The description below was settled upon after cross referencing with the variables in the dataset. The "target" field refers to the presence or absence of heart disease in the patient. 

**Attribute Information:**
        
        age: (age of subject)
        sex: ( 1=male, 0=female)
        cp: chest pain type
            0: typical angina
            1: atypical angina
            2: non-anginal pain
            3: asymptomatic
        trestbps: resting blood pressure (in mm Hg on admission to the hospital)
        chol: serum cholesterol in mg/dl 
        fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
        restecg: resting electrocardiographic results (‘ST’ relates to positions on the ECG plot)
            0: normal
            1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
            2: showing probable or definite left ventricular hypertrophy by Estes' criteria
            
        thalach: maximum heart rate achieved
        exang: exercise induced angina (1 = yes; 0 = no) 
        oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot.)
        slope: the slope of the peak exercise ST segment
            1: upsloping
            2: flat
            3: downsloping
        ca: number of major vessels (0-3) colored by dye in angiography (i.e.number of clear vessels)
        thal: Results of the blood flow observed via the radioactive dye
            0: NULL (dropped from the dataset previously)
            1: normal blood flow
            2: fixed defect (no blood flow in some part of the heart)
            3: reversible defect (a blood flow is observed but it is not normal)
        target: diagnosis of heart disease (angiographic disease status) in any major vessel
            0: < 50% diameter narrowing
            1: > 50% diameter narrowing


**Acknowledgements**

Creators:

    Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
    University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
    University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
    V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

In [None]:
#Import Libraries:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import  train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score, classification_report


In [None]:
#Read in data:
data = pd.read_csv("../input/heart-disease-uci/heart.csv")
data.head()

In [None]:
#The target has been wrongly classified in this particular dataset so we must first switch the labels around:
data['target'] = data['target'].replace({0:1,1:0})

# EDA

In [None]:
data.info()

The dataset consists of 303 observations with 13 features - all coded as integers or floats - and a binary target of the presence or absence of CHD. There are no missing values. There are 6 numeric features and 7 categorical features, which we will reclassify as object types. Some of the feature names are a little opaque, so we will rename them. We will also add in the categorical level labels.

In [None]:
#Rename columns:
data = data.rename(columns={'cp':'chest_pain','trestbps':'resting_bp','chol':'cholesterol','fbs':'fasting_bs',
                           'restecg':'resting_ecg','thalach':'max_heart_rate','exang':'ex_angina',
                           'oldpeak':'ST_depression','ca':'vessels','thal':'blood_flow'})

In [None]:
#Reclassify categorical variables as objects:
for col in ['sex','chest_pain','fasting_bs','resting_ecg', 'ex_angina','slope','blood_flow']:
    data[col] = data[col].astype('object')

#Replace categorical factor labels:
data['sex'] = data['sex'].replace({0:'female', 1:'male'})
data['chest_pain'] = data['chest_pain'].replace({0:'typical angina', 1:'atypical angina', 2:'non-anginal pain', 3:'asymptomatic'})
data['fasting_bs'] = data['fasting_bs'].replace({0:'under 120 mg/dl', 1:'over 120 mg/dl'})
data['resting_ecg'] = data['resting_ecg'].replace({0:'normal', 1:'wave abnormality', 2:'vent. hypertrophy'})
data['slope'] = data['slope'].replace({0:'upsloping', 1:'flat', 2:'downsloping'})
data['blood_flow'] = data['blood_flow'].replace({0:'NULL', 1:'normal flow', 2:'fixed defect', 3:'reversible defect'})

In [None]:
#Separate features and target:
data_y = data['target'].copy()
data_X = data.drop('target', axis=1).copy()

#Split data into training and test sets:
X, X_test, y, y_test = train_test_split(data_X,data_y,test_size=0.2,random_state=0)

## Numeric Variables

In [None]:
#Create lists of numeric and categorical variables:

num_vars = [col for col in X.columns if X[col].dtype in ['float64','int64']]
cat_vars = [col for col in X.columns if X[col].dtype=='object']

print("num_vars:\n", num_vars)
print("cat_vars:\n", cat_vars)

In [None]:
X[num_vars].describe()

In [None]:
#Histograms of numeric variables:
plt.subplots(2,3,figsize=(10,6))
plt.tight_layout()
j=1
for i in num_vars:
    skew = X[i].skew()
    plt.subplot(2,3,j)
    sns.histplot(X[i], bins=30)
    plt.title(i + ", skew="+ str(round(skew, 2)), fontsize=15)
    plt.xlabel("")
    j+=1

Age is approximately normally distributed with a mean of 54.9 years (std = 8.9). Max_heart_rate and resting_bp are both moderately skewed with medians of 153 bpm (IQR=31.5) and 130mmHg (IQR=20) respectively. [Note the rounding of blood pressure values to units of ten] . Cholesterol is very skewed, but this is influenced by the outlier value over 500. Without this value the skew for cholesterol is a more moderate 0.5. Neither ST_depression, median = 0.8 (IQR=1.75), or vessels, median = 1 ( IQR=1), are normally distributed, with vessels being a discrete count of the number of clear vessels.

### Numeric Variables Relationship with Outcome and Each Other

In [None]:
#Create lists of continuous and discrete numeric variables:
cont_num = [col for col in num_vars if X[col].nunique()>5]
disc_num = [col for col in num_vars if col not in cont_num]

In [None]:
#Boxplots of continuous features by CHD status:
nrows=2
ncols=3
plt.subplots(nrows,ncols,figsize=(9,7))
plt.tight_layout()

j=1
for i in cont_num:
    plt.subplot(nrows,ncols,j)
    sns.boxplot(x=y,y=X[i], palette={'lightgrey','lightcoral'})
    plt.title(i)
    j+=1

Subjects with CHD are older with a lower max_heart_rate and higher ST_depression score. There are several outliers with one particularly extreme cholesterol value of over 500.

In [None]:
#Calculate CHD proportions by no of vessel:
table = pd.crosstab(X['vessels'], data_y, normalize='index')
table[[1,0]].plot.bar(stacked=True, color=['lightcoral','lightgrey'])
plt.title("CHD Proportion by Number of Vessels", fontsize=15)
table

The incidence of CHD rises as the number of clear vessels rises before dropping back again at 4 vessels. However, there are only 4 subjects in this group. This outcome seems to cotradict the definition of CHD. It is likely that these labels are misclassified so caution must be used when interpreting these results.

In [None]:
#Plot correlation matrix of numeric variables:
corrmat = X[num_vars].corr(method='spearman')

#Mask from Seaborn tutorial:
mask = np.zeros_like(corrmat,dtype=np.bool)
mask[np.triu_indices_from(mask)]=True

plt.figure(figsize=(6,5))
sns.heatmap(corrmat,annot=True,mask=mask,cmap=sns.diverging_palette(240,10,as_cmap=True),vmin=-1,vmax=1)
plt.title("Feature Correlation Matrix",fontsize=20)
plt.show()

There is little correlation between the numeric features. The strongest correlation of -0.42 is between ST_depression and max_heart_rate, giving no cause for concern.

In [None]:
#Pairplots of features:
all_training = pd.concat([X[num_vars],y,],axis=1)
sns.pairplot(all_training, hue='target', height=1.5)
plt.show()

Notable is the cholesterol outlier visible in several plots.

# Categorical Variables

In [None]:
#Countplots of categorical variables:
plt.subplots(3,3,figsize=(11,11))
plt.tight_layout()

j=1
for i in cat_vars:
    plt.subplot(3,3,j)
    sns.countplot(x=X[i], palette='viridis')
    plt.title(i,fontsize=15)
    plt.xlabel("")
    j+=1

All categorical variables have more than 1 value, but there are only 4 observations in the resting_ecg left ventricle hypertrophy group. There are 2 NULL blood_flow values included which, according to the variable information, should have been removed beforehand. We will remove them later.

In [None]:
#Plot CHD by categorical variables:
for i in cat_vars:
    table = pd.crosstab(X[i],y)
    table[[1,0]].div(table.sum(1),axis=0).plot.bar(stacked=True,color=['lightcoral','lightgrey'])
    plt.title("CHD by " +i,fontsize=15)
    plt.ylabel("Proportion")

There is considerable difference in CHD rates across the categorical varable levels. The rate of CHD is higher in male subjects, those with typical anginal, left ventrical hypertrophy, exercise induced angina, flat slope and either a normal flow or a reversible defect.

# Feature Engineering
We will:
- Remove the 2 NULL blood flow entries,
- Remove the cholesterol outlier,
- Consider transforming the continuous variables,
- Encode categorical variables,
- Examine balance.

In [None]:
#Remove observations:
print("X shape",X.shape)
all_X = pd.concat([X,y],axis=1)
all_X = all_X[(all_X['blood_flow']!='NULL') & (all_X['cholesterol']<500)]
print("all_X shape",all_X.shape)
X = all_X.drop('target', axis=1).copy()
y = all_X['target'].copy()
print("X shape",X.shape,"\ny shape",y.shape)

### Transformations

In [None]:
#Re-plot cont_num variables:

plt.subplots(2,3,figsize=(10,6))
plt.tight_layout()
j=1
for i in cont_num:
    plt.subplot(2,3,j)
    sns.histplot(X[i],bins=30, kde=False)
    plt.title(i, fontsize=15)
    plt.xlabel("")
    j+=1

In [None]:
#Plot log transformed cont_num variables:

plt.subplots(2,3,figsize=(10,6))
plt.tight_layout()
j=1
for i in cont_num:
    plt.subplot(2,3,j)
    sns.histplot(np.log1p(X[i]),bins=30, kde=False)
    plt.title(i, fontsize=15)
    plt.xlabel("")
    j+=1

Resting_bp, cholesterol and ST_depression are improved by the transformation, although ST_depression is still not normally distributed. We will transform these 3 features and rely on the central limit theorem for the others.

In [None]:
#Log transform log_cols in X and X_test:
log_cols = ['resting_bp','cholesterol','ST_depression']
X_test = X_test.copy()

for i in log_cols:
    X[i] = np.log1p(X[i])
    X_test[i] = np.log1p(X_test[i])

### Encode Categorical Variables
Of our 7 categorical variables, sex, fasting_bs and ex_angina are already 2 level dummy variables. We will create dummy variables for the remaining 4 categorical variables.

In [None]:
#Create list of cat vars to encode:
encode_vars = ['chest_pain', 'resting_ecg', 'slope', 'blood_flow']    

#Join train and test together:
X_both = pd.concat([X,X_test],axis=0)

dummies = pd.get_dummies(X_both[encode_vars],prefix = ['chest_pain', 'resting_ecg', 'slope', 'blood_flow'],drop_first=True)
X_both.drop(encode_vars, axis=1, inplace=True)

X_both = pd.concat([X_both,dummies],axis=1)


#Remove labels from sex and fasting_bs:
X_both['sex'] = X_both['sex'].replace({'female':0, 'male':1})
X_both['fasting_bs'] = X_both['fasting_bs'].replace({'under 120 mg/dl':0, 'over 120 mg/dl':1})

#Recode ex_angina as an int64:
X_both['ex_angina'] = X_both['ex_angina'].astype('int64')

#Split X and X_test apart again:
X = X_both.iloc[:len(X),:].copy()
X_test = X_both.iloc[len(X):,:].copy()



X.shape,X_test.shape

In [None]:
X.info()

## Balance

In [None]:
#Sample proportion of CHD:
#Calculate proportions:
prop = 100*y.value_counts()/len(y)

#Plot doughnut chart:
#labels = ['No CHD','CHD']
labels = ['No CHD '+ str(round(prop[0],1)) + "%",'CHD '+ str(round(prop[1],1)) + "%"]
colormap = {'lightgrey','lightcoral'}
y.value_counts().plot.pie(startangle=90, colors=colormap, labels=labels)
plt.title("Overall CHD Proportion", fontsize=15)
plt.ylabel('')
circle = plt.Circle((0,0),0.7,color="white")
p = plt.gcf()
p.gca().add_artist(circle)
plt.show()

Our data is quite evenly balanced.

## Feature Selection with L1 Regularisation

In [None]:
#Tune an L1 Log Reg model:
#Pipeline with scaling:
clf_L1 = LogisticRegression(penalty='l1', solver = 'liblinear', max_iter=10000)

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_L1)])

hparams = {'model__C':[0.01,0.03,0.1,0.3,1,3,10]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X,y)
print("Best Params", grid.best_params_)
print("L1 Logistic Regression best CV AUC score:",grid.best_score_)

In [None]:
#Get selected features direct from gridsearchCV best estimator:

grid_coefs = pd.Series(grid.best_estimator_.named_steps['model'].coef_[0],X.columns)
grid_sel_feats = grid_coefs[grid_coefs!=0]
selected = list(grid_sel_feats.index)
#Plot coefs of selected:
grid_sel_feats.sort_values().plot.barh()
plt.title("L1 Regularization kept Feature Coefficients", fontsize=15);

dropped = [col for col in X.columns if col not in grid_sel_feats.index]
print(len(dropped),"dropped features:", dropped, "\n")
print(len(selected),"selected features:", selected, "\n\n")

In [None]:
#Update selected num_vars:
sel_num_vars = [col for col in num_vars if col in selected]

# Machine Learning

We will consider and apply the following classification models:
- Logistic Regression with l1 regularisation,
- Logistic Regression with L2 regularisation,
- Naive Bayes,
- KNN,
- LinearSVC,
- KernelSVC,
- Decision Tree,
- Random Forest,
- Gradient Boosting.

In [None]:
#Create a table for model evaluation:
model_table = pd.DataFrame(columns = ['Model','CV AUC Score'])

## L1 Logistic Regression

In [None]:
#Fit L1 logistic regression to selected features:
clf_L1 = LogisticRegression(penalty='l1', solver = 'liblinear', max_iter=10000)

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_L1)])

hparams = {'model__C':[0.01,0.03,0.1,0.3,1,3,10]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)

print("Best Params", grid.best_params_)
print("L1 Logistic Regression best CV AUC score:",grid.best_score_)

model_table = model_table.append({'Model': clf_L1,'CV AUC Score':grid.best_score_}, ignore_index=True)

## L2 Logistic Regression

In [None]:
#Fit L2 logistic regression to selected features:
clf_L2 = LogisticRegression(max_iter=10000)

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_L2)])


hparams = {'model__C':[0.01,0.03,0.1,0.3,1,3,10]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)
print("Best Params", grid.best_params_)
print("L2 Logistic Regression best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_L2,'CV AUC Score':grid.best_score_}, ignore_index=True)

## Naive Bayes
This dataset contains mixed data types of continuous and binary variables. It was, therefore, felt not to be appropriate to use any of the Naive Bayes models.

## KNN
We can use KNN for this binary classification problem. Non-parametric in nature, it can handle mixed data types with no assumptions about their distributions, as long as they are scaled. 

In [None]:
#Fit KNN to selected features:
clf_knn = KNeighborsClassifier()

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_knn)])

hparams = {'model__n_neighbors':[3,5,7,9]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)

print("Best Params", grid.best_params_)
print("KNN best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_knn,'CV AUC Score':grid.best_score_}, ignore_index=True)

## LinearSVC

In [None]:
#Fit LinearSVC to selected:
clf_lsvc = SVC(kernel='linear')

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_lsvc)])

hparams = {'model__C':[0.01,0.03,0.1,0.3,1,3,10]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)

print("Best Params", grid.best_params_)
print("LinearSVC best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_lsvc,'CV AUC Score':grid.best_score_}, ignore_index=True)

## KernelSVC

In [None]:
#Fit RBF SVC to selected features:
clf_rbf = SVC(kernel='rbf')

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_rbf)])

hparams = {'model__C':[0.001,0.01,0.1,1,10],
          'model__gamma':[0.0001,0.001,0.01,1,10]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)

print("Best Params", grid.best_params_)
print("RbfSVC best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_rbf,'CV AUC Score':grid.best_score_}, ignore_index=True)

## Decision Tree

In [None]:
#Fit Decision tree to selected:
clf_DT = DecisionTreeClassifier()

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_DT)])

hparams = {'model__max_depth':[2,3,4,6],
          'model__max_leaf_nodes':[6,8,10,20]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)

print("Best Params", grid.best_params_)
print("Decision Tree best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_DT,'CV AUC Score':grid.best_score_}, ignore_index=True)

## Random Forest

In [None]:
#Fit Random Rofest to selected:
clf_RF = RandomForestClassifier(random_state=0)

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_RF)])

hparams = {'model__max_depth':[2,3,4,5,6],
          'model__n_estimators':[20,30,40,50,60,80]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)

print("Best Params", grid.best_params_)
print("Random Forest best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_RF,'CV AUC Score':grid.best_score_}, ignore_index=True)

## Gradient Boosting

In [None]:
#Fit Gradient Boosting to selected:
clf_GB = GradientBoostingClassifier()

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_GB)])

hparams = {'model__n_estimators':[20, 40, 60, 80],
          'model__learning_rate':[0.01,0.03,0.1],
          'model__max_depth':[3,4,5]}

grid = GridSearchCV(pipe, param_grid = hparams, cv=5,n_jobs=3,scoring = 'roc_auc',verbose=1 )
grid.fit(X[selected],y)
print("Best Params", grid.best_params_)
print("GradientBoosting best CV score",grid.best_score_)

model_table = model_table.append({'Model': clf_GB,'CV AUC Score':grid.best_score_}, ignore_index=True)

### Table of Results

In [None]:
model_table.sort_values(by='CV AUC Score',ascending=False)

The tuned Random Forest model gives the highest AUC score. 

### Feature Importance

In [None]:
#Refit the tuned Random Forest Model:
X_train, X_valid, y_train, y_valid = train_test_split(X,y,test_size=0.2, random_state=0)

clf_RF = RandomForestClassifier(max_depth= 3, n_estimators= 40, random_state=0)

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_RF)])

pipe.fit(X_train[selected],y_train)

In [None]:
#Get Feature Importances:
#Get Feature Importances from the final tuned RF model, index is selected:
feat_imp = pd.Series(pipe.named_steps['model'].feature_importances_, index=selected)
feat_imp = feat_imp.sort_values(ascending=False)
sns.barplot(x=feat_imp, y=feat_imp.index)
plt.title("Feature Importances of Random Forest Model", fontsize=15)

## Refit Final Model to Test Set

In [None]:
#Refit the tuned Random Forest Model:

clf_RF = RandomForestClassifier(max_depth= 3, n_estimators= 40, random_state=0)

preprocessor = ColumnTransformer(transformers =[('num', MinMaxScaler(),sel_num_vars)], remainder='passthrough')

pipe = Pipeline(steps=[('preprocessing', preprocessor),
                          ('model', clf_RF)])

pipe.fit(X[selected],y)
y_preds = pipe.predict(X_test[selected])
y_probs = pipe.predict_proba(X_test[selected])[:,1]
auc_score = roc_auc_score(y_test, y_probs)
print("Final AUC Evaluation with Test Set:", auc_score)

In [None]:
#Create Confusion Matrix:
con_mat = pd.crosstab(y_test,y_preds,rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(con_mat,annot=True, cmap='Blues')
plt.title("Confusion Matrix",fontsize=15)
plt.show()

In [None]:
print(classification_report(y_test, y_preds))

# Summary

Max_heart_rate was the most important risk factor for CHD in both the Logistic Regression and final Random Forest models. Our final predictive model gave a recall of 85% and micro and macro accuracies of 87%. However, this is an old dataset with much uncertainty around the feature labelling so this analysis should be taken as a demonstration of techniques rather than one of accurate insights.