#### The goal of this question is predicting the heart health of patients in a hospital. In the homework package, you can access the data file **“HeartData.csv”**, which consists of 13 features and one response variable (num). The features represent some measurements of the patients’ health atributes and num is an indication of the heart health. If num = 0, the heart is healthy, and if num = 1, it reports an issue.

#### Consider splitting the data into a a training and test set. Samples 1 to 200 form the training set and samples 201 to 297 form the test set. Try the following classification models to predict “num” in terms of the other features in the dataset:
    – Use logistic regression for your classification. Report the p-values associated with the intercept and all the features. Which features have large p-values? Use the test data to estimate the accuracy of your model.
    – Apply LDA and QDA, and again report your model accuracies using the test data.
    – Among logistic regression, LDA, and QDA which model(s) seems the most accurate one(s)?

In [39]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from ISLP.models import (ModelSpec as MS,
                         summarize)
from ISLP import confusion_table
from sklearn.discriminant_analysis import \
     (LinearDiscriminantAnalysis as LDA,
      QuadraticDiscriminantAnalysis as QDA)

# = = = = = Logistic Regression = = = = =

In [40]:
Data = pd.read_csv('HeartData.csv')
train = (Data.index < 200)
X = MS(Data.columns.drop(['num'])).fit_transform(Data)
Y = Data['num']
print(Data)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   1       145   233    1        2      150      0      2.3   
1     67    1   4       160   286    0        2      108      1      1.5   
2     67    1   4       120   229    0        2      129      1      2.6   
3     37    1   3       130   250    0        0      187      0      3.5   
4     41    0   2       130   204    0        2      172      0      1.4   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
292   57    0   4       140   241    0        0      123      1      0.2   
293   45    1   1       110   264    0        0      132      0      1.2   
294   68    1   4       144   193    1        0      141      0      3.4   
295   57    1   4       130   131    0        0      115      1      1.2   
296   57    0   2       130   236    0        2      174      0      0.0   

     slope  ca  thal  num  
0        3   0     6    0  
1        2   3     3    1  
2  

In [48]:
y_train, X_train = Y.loc[train] , X.loc[train]
y_test, X_test = Y.loc[~train] , X.loc[~train]
print(y_train)
LogisticRegressionModel = sm.GLM(y_train, X_train, family=sm.families.Binomial()).fit()
LogisticRegressionModel.summary()

0      0
1      1
2      1
3      0
4      0
      ..
195    0
196    1
197    0
198    0
199    0
Name: num, Length: 200, dtype: int64


0,1,2,3
Dep. Variable:,num,No. Observations:,200.0
Model:,GLM,Df Residuals:,186.0
Model Family:,Binomial,Df Model:,13.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-63.478
Date:,"Wed, 15 May 2024",Deviance:,126.96
Time:,13:42:11,Pearson chi2:,174.0
No. Iterations:,6,Pseudo R-squ. (CS):,0.5236
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-10.3711,3.716,-2.791,0.005,-17.655,-3.087
age,-0.0073,0.031,-0.236,0.813,-0.068,0.054
sex,1.8161,0.703,2.582,0.010,0.438,3.195
cp,0.9642,0.290,3.324,0.001,0.396,1.533
trestbps,0.0341,0.014,2.432,0.015,0.007,0.062
chol,0.0075,0.005,1.583,0.113,-0.002,0.017
fbs,-1.0563,0.654,-1.616,0.106,-2.337,0.225
restecg,0.4627,0.244,1.894,0.058,-0.016,0.942
thalach,-0.0285,0.014,-1.967,0.049,-0.057,-9.99e-05


In [55]:
probs = LogisticRegressionModel.predict(exog=X_test)
print('===========================================================================')
labels = np.array([0]*y_test.shape[0])
print(probs>0.5)
labels[probs>0.5] = 1
print(probs.shape)
print(confusion_table(labels, y_test))
print('===========================================================================')
print('True rate:', np.mean(labels == y_test), ', False rate:', np.mean(labels != y_test))

200    False
201    False
202     True
203     True
204     True
       ...  
292     True
293    False
294     True
295     True
296    False
Length: 97, dtype: bool
(97,)
Truth       0   1
Predicted        
0          46  15
1           4  32
True rate: 0.8041237113402062 , False rate: 0.1958762886597938


# = = = = = Running LDA = = = = =

In [38]:
lda = LDA(store_covariance=True)
# Since the LDA estimator automatically adds an intercept, we should remove the column corresponding to 
# the intercept in both X_train and X_test. We can also directly use the labels rather than the Boolean 
# vectors y_train.

if 'intercept' in X_train:
    X_train, X_test = [M.drop(columns=['intercept'], axis = 1) for M in [X_train, X_test]]
    # print(X_test)
print('===========================================================================')
# print(y_train)
lda.fit(X_train, y_train)
lda_pred = lda.predict(X_test)
print(labels)
print(lda_pred)
print(confusion_table(lda_pred, y_test))
print('True rate:', np.mean(lda_pred == y_test), ', False rate:', np.mean(lda_pred != y_test))

[0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0
 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0
 0 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0]
[0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0
 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0
 0 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0]
Truth       0   1
Predicted        
0          46  14
1           4  33
True rate: 0.8144329896907216 , False rate: 0.18556701030927836


# = = = = = Running QDA = = = = =

In [23]:
qda = QDA(store_covariance=True)
qda.fit(X_train, y_train)
qda_pred = qda.predict(X_test)
print(confusion_table(qda_pred, y_test))
print(np.mean(qda_pred == y_test), np.mean(qda_pred != y_test))

Truth       0   1
Predicted        
0          46  15
1           4  32
0.8041237113402062 0.1958762886597938


#### Among these models, ***LDA*** seems to be the most accurate model.
#### ***QDA*** is less accurate since it might be overfitting.