
# Lab 7: Heart Attack
Instructions

You will submit an HTML document to Canvas as your final version.

Your document should show your code chunks/cells as well as any output. Make sure that only relevant output is printed. Do not, for example, print the entire dataset in your final rendered file.

Your document should also be clearly organized, so that it is easy for a reader to find your answers to each question.
The Data

In this lab, we will use medical data to predict the likelihood of a person experiencing an exercise-induced heart attack.

Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain (“angina”) during exercise. The information collected includes:

age : Age of the patient

sex : Sex of the patient

cp : Chest Pain type
Value 0: asymptomatic
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain

trtbps : resting blood pressure (in mm Hg)

chol : cholesterol in mg/dl fetched via BMI sensor

restecg : resting electrocardiographic results
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

thalach : maximum heart rate achieved during exercise

output : the doctor’s diagnosis of whether the patient is at risk for a heart attack
0 = not at risk of heart attack
1 = at risk of heart attack

Although it is not a formal question on this assignment, you should begin by reading in the dataset and briefly exploring and summarizing the data, and by adjusting any variables that need cleaning.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer, make_column_selector
from plotnine import ggplot, aes, geom_line, geom_abline, labs, theme_classic, ggtitle, scale_color_manual

In [4]:
ha_df = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach,output
0,63,1,3,145,233,0,150,1
1,37,1,2,130,250,1,187,1
2,56,1,1,120,236,1,178,1
3,57,0,0,120,354,1,163,1
4,57,1,0,140,192,1,148,1


In [5]:
X = ha_df.drop(columns=['output'])
y = ha_df['output']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2 ,random_state=2)

ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False, handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
  ],
  remainder = "passthrough"
)

## Part One: Fitting Models

This section asks you to create a final best model for each of the model types studied this week. For each, you should:

Find the best model based on ROC AUC for predicting the target variable.

Report the (cross-validated!) ROC AUC metric.

Fit the final model.

Output a confusion matrix; that is, the counts of how many observations fell into each predicted class for each true class.

(Where applicable) Interpret the coefficients and/or estimates produced by the model fit.

You should certainly try multiple model pipelines to find the best model. You do not need to include the output for every attempted model, but you should describe all of the models explored. You should include any hyperparameter tuning steps in your writeup as well.

Q1: KNN

Q2: Logistic Regression

Q3: Decision Tree

Q4: Interpretation Which predictors were most important to predicting heart attack risk?

Q5: ROC Curve Plot the ROC Curve for your three models above

In [17]:
KNN_pipeline = Pipeline(steps=[
    ('preprocessor', ct),
    ('knn', KNeighborsClassifier())
])

knn_params = {'knn__n_neighbors': range(1, 21)}
knn_gscv = GridSearchCV(KNN_pipeline, knn_params, cv=5, scoring='roc_auc')
knn_gscv.fit(X_train, y_train)

print(f"Best KNN parameters: {knn_gscv.best_params_}")
print(f"AUC: {knn_gscv.best_score_.round(5)}")

y_pred_knn = knn_gscv.best_estimator_.predict(X_test)
print(confusion_matrix(y_test, y_pred_knn))

Best KNN parameters: {'knn__n_neighbors': 17}
AUC: 0.84721
[[14  7]
 [ 7 27]]
Best KNN parameters: {'knn__n_neighbors': 17}
AUC: 0.84721
[[14  7]
 [ 7 27]]


In [16]:
logit_pipeline = Pipeline([
    ('preprocessor', ct),
    ('logit', LogisticRegression())
])

logit_params = {'logit__C': [0.01, 0.1, 1, 10, 100]}
logit_gscv = GridSearchCV(logit_pipeline, logit_params, cv=5, scoring='roc_auc')
logit_gscv.fit(X_train, y_train)

print(f"Best Logitistic parameters: {logit_gscv.best_params_}")
print(f"AUC: {logit_gscv.best_score_.round(5)}")

y_pred_logit = logit_gscv.best_estimator_.predict(X_test)
print(confusion_matrix(y_test, y_pred_logit))

Best Logitistic parameters: {'logit__C': 0.01}
AUC: 0.85397
[[16  5]
 [ 6 28]]
Best Logitistic parameters: {'logit__C': 0.01}
AUC: 0.85397
[[16  5]
 [ 6 28]]
