## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

In [None]:
import pandas as pd
import numpy as np
ha = pd.read_csv('https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1')

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
ha.head()


Unnamed: 0,Name,age,sex,cp,trtbps,chol,restecg,thalach,diagnosis
0,Magnolia Cassin,60,1,1,117,230,1,160,No Disease
1,Dr. Cathern Heathcote,60,0,3,102,318,1,160,Disease
2,Miles Wolf,62,0,3,130,263,1,97,No Disease
3,Mikaila Block,43,1,1,115,303,1,181,Disease
4,Mrs. Jacquline Marquardt,66,1,1,120,302,0,151,Disease


## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

How high for the doctors to estimate a 90% chance that heart disease is present?

In [None]:
X = ha[["age", "chol"]]
y = ha["diagnosis"]

log_reg = LogisticRegression(C=1e6, solver="lbfgs")
log_reg.fit(X, y)

b0 = log_reg.intercept_[0]     # intercept
b_age, b_chol = log_reg.coef_[0]  # coefficients for age and chol

print("Intercept (b0):", b0)
print("b_age:", b_age)
print("b_chol:", b_chol)

def chol_for_prob(age, p):
    """
    Solve for cholesterol level such that
    P(heart disease | age, chol) = p
    using the fitted logistic regression model.
    """
    logit = np.log(p / (1 - p))
    chol_value = (logit - b0 - b_age * age) / b_chol
    return chol_value

age_55 = 55

chol_50 = chol_for_prob(age_55, 0.5)   # 50% probability
chol_90 = chol_for_prob(age_55, 0.9)   # 90% probability

print(f"\nFor a {age_55}-year-old:")
print(f"Chol for 50% probability: {chol_50:.2f}")
print(f"Chol for 90% probability: {chol_90:.2f}")


Intercept (b0): -3.24069238858948
b_age: 0.046876445708862215
b_chol: 0.0018007041214176147

For a 55-year-old:
Chol for 50% probability: 367.90
Chol for 90% probability: 1588.11


## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(X, y)

w = lda.coef_[0]      # linear weights [w_age, w_chol]
b = lda.intercept_[0] # intercept

w_age, w_chol = w
intercept = b

print("w_age:", w_age)
print("w_chol:", w_chol)
print("intercept:", intercept)

# Function: cholesterol cutoff for a given age
def chol_cutoff(age):
    # decision boundary: w_age*age + w_chol*chol + intercept = 0
    return -(w_age*age + intercept) / w_chol

chol_55 = chol_cutoff(55)
print("\nCholesterol threshold for 55-year-old:", chol_55)

w_age: 0.04655743553694484
w_chol: 0.0017896671039291314
intercept: -3.2196776604068784

Cholesterol threshold for 55-year-old: 368.235357529938


## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = {"C": [0.01, 0.1, 1, 10, 100, 1000]}

svc_grid = GridSearchCV(
    SVC(kernel="linear"),
    param_grid,
    cv=5
)

svc_grid.fit(X, y)

# Best tuned model
svc = svc_grid.best_estimator_

print("Best C:", svc_grid.best_params_)

# 5. Extract coefficients
w_age, w_chol = svc.coef_[0]
b = svc.intercept_[0]

print("w_age:", w_age)
print("w_chol:", w_chol)
print("intercept:", b)

# 6. Function: threshold cholesterol for given age
def chol_cutoff(age):
    # Solve: w_age*age + w_chol*chol + b = 0
    return -(w_age * age + b) / w_chol

# 7. Cholesterol cutoff for a 55-year-old
chol_55 = chol_cutoff(55)
print("Cholesterol threshold for 55-year-old:", chol_55)

KeyboardInterrupt: 

## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.