## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

In [69]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures


In [70]:
df = pd.read_csv("/Users/timothychristian/Desktop/ha_1.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       204 non-null    object
 1   age        204 non-null    int64 
 2   sex        204 non-null    int64 
 3   cp         204 non-null    int64 
 4   trtbps     204 non-null    int64 
 5   chol       204 non-null    int64 
 6   restecg    204 non-null    int64 
 7   thalach    204 non-null    int64 
 8   diagnosis  204 non-null    object
dtypes: int64(7), object(2)
memory usage: 14.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       204 non-null    object
 1   age        204 non-null    int64 
 2   sex        204 non-null    int64 
 3   cp         204 non-null    int64 
 4   trtbps     204 non-null    int64 
 5   chol       204 non-null    int64 


In [71]:
ct = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object))
  ],
  remainder = "passthrough"
)

## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

How high for the doctors to estimate a 90% chance that heart disease is present?

In [72]:
X = df[["age", "chol"]]
y = df["diagnosis"]

In [73]:
log_pipe = Pipeline(
  [("preprocessing", ct),
  ("logistic_regression", LogisticRegression())]
)

log_pipe.fit(X, y)

coefs = log_pipe.named_steps["logistic_regression"].coef_[0]
intercept = log_pipe.named_steps["logistic_regression"].intercept_[0]
print("coefs: ", coefs ,", intercept:", intercept)

coefs:  [0.04686331 0.00180124] , intercept: -3.2401122588585896
coefs:  [0.04686331 0.00180124] , intercept: -3.2401122588585896


In [74]:
beta_age = coefs[0]
beta_chol = coefs[1]

chol_50 = -(intercept + beta_age * 55) / beta_chol
print(f"Cholesterol needed for 50% threshold: {chol_50:.2f}")

Cholesterol needed for 50% threshold: 367.87
Cholesterol needed for 50% threshold: 367.87


In [75]:
log_odds_90 = np.log(0.9 / (1 - 0.9))
chol_90 = (log_odds_90 - intercept - beta_age * 55) / beta_chol
print(f"Cholesterol needed for 90% threshold: {chol_90:.2f}")

Cholesterol needed for 90% threshold: 1587.72
Cholesterol needed for 90% threshold: 1587.72


## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [76]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [77]:
lda_pipe = Pipeline(
  [("preprocessing", ct),
  ("lda", LinearDiscriminantAnalysis())]
)

lda_pipe.fit(X, y)

coefs = lda_pipe.named_steps["lda"].coef_[0]
intercept = lda_pipe.named_steps["lda"].intercept_[0]
print("coefs: ", coefs ,", intercept:", intercept)

coefs:  [0.04655744 0.00178967] , intercept: -3.2196776604068775
coefs:  [0.04655744 0.00178967] , intercept: -3.2196776604068775


In [78]:
beta_age = coefs[0]
beta_chol = coefs[1]

chol_50 = -(intercept + beta_age * 55) / beta_chol
print(f"Cholesterol needed for prediction of heart disease: {chol_50:.2f}")

Cholesterol needed for prediction of heart disease: 368.24
Cholesterol needed for prediction of heart disease: 368.24


## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [79]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC


In [80]:
svc_pipe = Pipeline(
  [("preprocessing", ct),
  ("svc", SVC(kernel='linear'))]
)

param_grid = {'svc__C': [0.1, 1, 10, 100]}
grid_search = GridSearchCV(svc_pipe, param_grid, cv=5)
grid_search.fit(X, y)
print(grid_search.best_params_)

{'svc__C': 0.1}
{'svc__C': 0.1}


In [81]:
best_svc = grid_search.best_estimator_

beta_age = best_svc.named_steps["svc"].coef_[0][0]
beta_chol = best_svc.named_steps["svc"].coef_[0][1]
intercept = best_svc.named_steps["svc"].intercept_[0]

chol_50 = -(beta_age * 55 + intercept) / beta_chol

print(f"Cholesterol needed for prediction of heart disease: {chol_50:.2f}")

Cholesterol needed for prediction of heart disease: 312.69
Cholesterol needed for prediction of heart disease: 312.69


## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.