
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [79]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [80]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [81]:
ha["cp_is_3"] = (ha["cp"] == 3).astype('int')

In [82]:
X = ha.drop(["cp_is_3"], axis = 1)
y = ha["cp_is_3"]

ct = ColumnTransformer(
  [
    ("standardize", 
    StandardScaler(), 
    make_column_selector(dtype_include=np.number))
  ],
  remainder = "passthrough"
)

log_pipe = Pipeline(
    [("preprocessing", ct),
    ("logistic_regression", LogisticRegression())]
)

In [83]:
log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='f1_weighted')
scores.mean()

0.9830157279176885

In [84]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha["cp_is_2"] = (ha["cp"] == 2).astype('int')
X = ha.drop(["cp_is_2"], axis = 1)
y = ha["cp_is_2"]

log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='f1_weighted')
scores.mean()

0.8362721582110308

In [85]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha["cp_is_1"] = (ha["cp"] == 1).astype('int')
X = ha.drop(["cp_is_1"], axis = 1)
y = ha["cp_is_1"]

log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='f1_weighted')
scores.mean()

0.7668272106122176

In [86]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha["cp_is_0"] = (ha["cp"] == 0).astype('int')
X = ha.drop(["cp_is_0"], axis = 1)
y = ha["cp_is_0"]

log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='f1_weighted')
scores.mean()

0.9851851851851852

It seems like OvR was best at distinguishing cp=0. It did slightly better distinguishing this from cp=3.

## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [87]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha = ha[(ha['cp'] != 2) & (ha['cp'] != 3)]
X = ha.drop(["cp"], axis = 1)
y = ha["cp"]

log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='roc_auc')
scores.mean()

0.8376709401709401

In [88]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha = ha[(ha['cp'] != 1) & (ha['cp'] != 3)]
X = ha.drop(["cp"], axis = 1)
y = ha["cp"]

log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='roc_auc')
scores.mean()

0.7515837104072398

In [89]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha = ha[(ha['cp'] != 2) & (ha['cp'] != 1)]
X = ha.drop(["cp"], axis = 1)
y = ha["cp"]

log_pipe.fit(X, y)
scores = cross_val_score(log_pipe, X, y, cv=5, scoring='roc_auc')
scores.mean()

0.7412307692307692

The OvO model was best at distinguishing between chest pains 0 and 1.