
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
cp_dummies = pd.get_dummies(ha["restecg"], prefix="re")
ha = pd.concat([ha.drop(columns=["restecg"]), cp_dummies], axis=1)
ha.head()
out_dummies = pd.get_dummies(ha["output"], prefix="out")
ha = pd.concat([ha.drop(columns=["output"]), out_dummies], axis=1)
sex_dummies = pd.get_dummies(ha["sex"], prefix="sex")
ha = pd.concat([ha.drop(columns=["sex"]), sex_dummies], axis=1)
ha.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   age      273 non-null    int64
 1   cp       273 non-null    int64
 2   trtbps   273 non-null    int64
 3   chol     273 non-null    int64
 4   thalach  273 non-null    int64
 5   re_0     273 non-null    bool 
 6   re_1     273 non-null    bool 
 7   re_2     273 non-null    bool 
 8   out_0    273 non-null    bool 
 9   out_1    273 non-null    bool 
 10  sex_0    273 non-null    bool 
 11  sex_1    273 non-null    bool 
dtypes: bool(7), int64(5)
memory usage: 12.7 KB


## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [None]:
y = ha["cp"]
X = ha.drop(columns=["cp"])

num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols = X.select_dtypes(exclude=["int64", "float64"]).columns.tolist()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocess = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])

from sklearn.metrics import classification_report, confusion_matrix

knn_pipe = Pipeline([
    ("prep", preprocess),
    ("knn", KNeighborsClassifier(n_neighbors=7))
])

knn_pipe.fit(X_train, y_train)

y_pred_knn = knn_pipe.predict(X_test)

print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

print("\nKNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

tree_pipe = Pipeline([
    ("prep", preprocess),
    ("tree", DecisionTreeClassifier(max_depth=4, random_state=42))
])

tree_pipe.fit(X_train, y_train)
y_pred_tree = tree_pipe.predict(X_test)

print("Decision Tree Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_tree))

print("\nDecision Tree Classification Report:")
print(classification_report(y_test, y_pred_tree))

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_pipe = Pipeline([
    ("prep", preprocess),
    ("lda", LinearDiscriminantAnalysis())
])

lda_pipe.fit(X_train, y_train)
y_pred_lda = lda_pipe.predict(X_test)

print("LDA Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lda))

print("\nLDA Classification Report:")
print(classification_report(y_test, y_pred_lda))

KNN Confusion Matrix:
[[23  1  2  0]
 [ 5  2  4  1]
 [11  1  3  0]
 [ 2  0  0  0]]

KNN Classification Report:
              precision    recall  f1-score   support

           0       0.56      0.88      0.69        26
           1       0.50      0.17      0.25        12
           2       0.33      0.20      0.25        15
           3       0.00      0.00      0.00         2

    accuracy                           0.51        55
   macro avg       0.35      0.31      0.30        55
weighted avg       0.47      0.51      0.45        55

Decision Tree Confusion Matrix:
[[16  0  8  2]
 [ 4  0  8  0]
 [ 2  3  9  1]
 [ 1  0  1  0]]

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.62      0.65        26
           1       0.00      0.00      0.00        12
           2       0.35      0.60      0.44        15
           3       0.00      0.00      0.00         2

    accuracy                           0.45        

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

ha["cp_is_0"] = (ha["cp"] == 0).astype(int)
ha["cp_is_1"] = (ha["cp"] == 1).astype(int)
ha["cp_is_2"] = (ha["cp"] == 2).astype(int)
ha["cp_is_3"] = (ha["cp"] == 3).astype(int)

y = ha["cp_is_3"]
X = ha.drop(columns=["cp","cp_is_0","cp_is_1","cp_is_2","cp_is_3"])


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

log_pipe = Pipeline([
    ("preprocess", preprocess),
    ("logreg", LogisticRegression())
])

log_grid = {
    "logreg__C": [0.01, 0.1, 1, 5, 10],
    "logreg__penalty": ["l2"]

}

log_cv = GridSearchCV(log_pipe, log_grid, cv=5, scoring="f1")
log_cv.fit(X_train, y_train)

print("Best Logistic Regression Params:", log_cv.best_params_)
print("Best CV ROC AUC:", log_cv.best_score_)

log_best = log_cv.best_estimator_
y_pred_log = log_best.predict(X_test)
print(confusion_matrix(y_test, y_pred_log))

final_lr = log_cv.best_estimator_.named_steps["logreg"]
print("Logistic Regression Coefficients:")
print(final_lr.coef_)

print("\nLogistic Regression Classifaction Report:")
print(classification_report(y_test, y_pred_log))



Best Logistic Regression Params: {'logreg__C': 0.01, 'logreg__penalty': 'l2'}
Best CV ROC AUC: 0.0
[[53  0]
 [ 2  0]]
Logistic Regression Coefficients:
[[ 0.02172138  0.07519455 -0.02233399  0.06934126 -0.02004614  0.02000725
   0.01699923 -0.01703812  0.00296913 -0.00300802  0.01998115 -0.02002004
  -0.02002004  0.01998115  0.02680615 -0.02684504 -0.02684504  0.02680615]]

Logistic Regression Classifaction Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        53
           1       0.00      0.00      0.00         2

    accuracy                           0.96        55
   macro avg       0.48      0.50      0.49        55
weighted avg       0.93      0.96      0.95        55



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [None]:
ha["cp_is_0"] = (ha["cp"] == 0).astype(int)
ha["cp_is_1"] = (ha["cp"] == 1).astype(int)
ha.drop(columns=["cp_is_2","cp_is_3"])

y = ha["cp_is_1"]
X = ha.drop(columns=["cp","cp_is_0","cp_is_1","cp_is_2","cp_is_3"])


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

log_pipe = Pipeline([
    ("preprocess", preprocess),
    ("logreg", LogisticRegression())
])

log_grid = {
    "logreg__C": [0.01, 0.1, 1, 5, 10],
    "logreg__penalty": ["l2"]

}

log_cv = GridSearchCV(log_pipe, log_grid, cv=5, scoring="f1")
log_cv.fit(X_train, y_train)

print("Best Logistic Regression Params:", log_cv.best_params_)
print("Best CV ROC AUC:", log_cv.best_score_)

log_best = log_cv.best_estimator_
y_pred_log = log_best.predict(X_test)
print(confusion_matrix(y_test, y_pred_log))

final_lr = log_cv.best_estimator_.named_steps["logreg"]
print("Logistic Regression Coefficients:")
print(final_lr.coef_)

print("\nLogistic Regression Classifaction Report:")
print(classification_report(y_test, y_pred_log))


Unnamed: 0,age,cp,trtbps,chol,thalach,re_0,re_1,re_2,out_0,out_1,sex_0,sex_1,cp_is_0,cp_is_1
0,63,3,145,233,150,True,False,False,False,True,False,True,0,0
1,37,2,130,250,187,False,True,False,False,True,False,True,0,0
2,56,1,120,236,178,False,True,False,False,True,False,True,0,1
3,57,0,120,354,163,False,True,False,False,True,True,False,1,0
4,57,0,140,192,148,False,True,False,False,True,False,True,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
268,59,0,164,176,90,True,False,False,True,False,False,True,1,0
269,57,0,140,241,123,False,True,False,True,False,True,False,1,0
270,45,3,110,264,132,False,True,False,True,False,False,True,0,0
271,68,0,144,193,141,False,True,False,True,False,False,True,1,0
