<a href="https://colab.research.google.com/github/thaonguyyen/project_chd/blob/main/linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import the necessary packages

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read in the test data

In [25]:
url = 'https://raw.githubusercontent.com/thaonguyyen/project_chd/main/cleaned_train_data.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,0,1267,1,58,Some high school,0,0.0,0.0,0,0,0,220.0,143.0,104.0,29.85,75,87.0,1
1,1,1209,0,40,Some high school,1,15.0,0.0,0,0,0,199.0,122.0,82.0,22.16,85,77.0,0
2,2,2050,0,52,Some high school,0,0.0,0.0,0,0,0,275.0,112.0,71.0,25.68,80,85.0,0
3,3,1183,1,38,High school/GED,1,43.0,0.0,0,1,0,170.0,130.0,94.0,23.9,110,75.0,0
4,4,3225,0,43,Some high school,0,0.0,0.0,0,0,0,202.0,124.0,92.0,21.26,75,74.0,0


There are multiple dummy variables in this dataset. While most are intuitive (1 if true and 0 if false), 0 represents a female and 1 represents a male for the 'sex' variable.

In [35]:
# Linear model using all available variables
X = pd.concat([df[['sex', 'age']],
               pd.get_dummies(df['education'], dtype='int'),
               df[['currentSmoker', 'cigsPerDay', 'BPMeds', 'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']]], axis=1)
# Construct a linear model with no intercept using sklearn
from sklearn import linear_model
reg = linear_model.LinearRegression(fit_intercept=False).fit(X, y)

results = pd.DataFrame({'Variables':reg.feature_names_in_, 'Coefficient':reg.coef_})
results

Unnamed: 0,Variables,Coefficient
0,sex,0.064109
1,age,0.006864
2,College,-0.582501
3,High school/GED,-0.613281
4,Some college,-0.587104
5,Some high school,-0.585679
6,Unknown education,-0.625217
7,currentSmoker,-0.011914
8,cigsPerDay,0.002694
9,BPMeds,0.069935


In [36]:
# Calculate R^2
r2 = reg.score(X, y)
print('R_squared: ', r2)

# Calculate Adj R^2
n = len(y)
k = X.shape[1]
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('Adj R_squared: ', adj_r2)

R_squared:  0.09982977825200601
Adj R_squared:  0.09428038363383207


In [37]:
# Linear model with no education variable
X = df[['sex', 'age', 'currentSmoker', 'cigsPerDay', 'BPMeds', 'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']]
y = df['TenYearCHD']

# Construct a linear model with no intercept using sklearn
from sklearn import linear_model
reg = linear_model.LinearRegression(fit_intercept=False).fit(X, y)

results = pd.DataFrame({'Variables':reg.feature_names_in_, 'Coefficient':reg.coef_})
results

Unnamed: 0,Variables,Coefficient
0,sex,0.060253
1,age,0.004993
2,currentSmoker,-0.03649
3,cigsPerDay,0.003004
4,BPMeds,0.078546
5,prevalentStroke,0.122861
6,prevalentHyp,0.081005
7,diabetes,0.124615
8,totChol,5.2e-05
9,sysBP,0.002058


In [34]:
# Calculate R^2
r2 = reg.score(X, y)
print('R_squared: ', r2)

# Calculate Adj R^2
n = len(y)
k = X.shape[1]
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('Adj R_squared: ', adj_r2)

R_squared:  0.08319492283227181
Adj R_squared:  0.07903707667731608


Looking over the variables available in the dataset, it doesn't seem like education level would play a factor in coronary heart disease so we chose to create two models: one including all available variables and the other excluding education. According to the adjusted R^2 values of both models, it seems like the model with all variables has more explanatory power given the higher adjusted R^2 value.

A couple problems with using linear regression models for this particular dataset:
1. The variable we are trying to predict, TenYearCHD, can only take a binary value of 0 or 1 meaning we have a linear probability model. This means the predicted values are understood as probabilities of the event occurring.
2. Under linear probability models, the regression line can never fit the data perfectly if the dependent variable is binary and the regressors are continuous. This means the R^squared value, our primary tool for measuring the explanatory power or 'usefulness' of a model, loses its interpretation.
3. For machine learning purposes where we are only trying to build a model that predicts coronary heart disease, linear probability models aren't an ideal method since they predict probabilities. It would be better to use a decision tree.