# The Framingham Heart Study: Evaluating Risk Factors to Save Lives

<img src="images/Framingham2.jpg"/>

We'll be predicting the 10-year risk of coronary heart disease or CHD. This was the subject of an important 1998 paper introducing what is known as the Framingham Risk Score.

CHD is a disease of the blood vessels supplying the heart. This is one type of heart disease, which has been the leading cause of death worldwide since 1921. In 2008, 7.3 million people died from CHD. Even though the number of deaths due to CHD is still very high, age-adjusted death rates have actually declined 60% since 1950. This is in part due to earlier detection and monitoring partly because of the Framingham Heart Study.

This data set includes several demographic risk factors

    the sex of the patient, male or female;

    the age of the patient in years; 

    the education level coded as either 1 for some high school, 2 for a high school diploma or GED, 3 for some college or vocational school, and 4 for a college degree.

The data set also includes behavioral risk factors associated with smoking-- 

    whether or not the patient is a current smoker and 
    
    the number of cigarettes that the person smoked on average in one day.

While it is now widely known that smoking increases the risk of heart disease, the idea of smoking being bad for you was a novel idea in the 1940s.

Medical history risk factors were also included.

These were whether or not the patient was on blood pressure medication, 

    whether or not the patient had previously had a stroke, 

    whether or not the patient was hypertensive, and 

    whether or not the patient had diabetes.

Lastly, the data set includes risk factors from the first physical examination of the patient.

    total cholesterol level,
    
    systolic blood pressure,

    diastolic blood pressure, 

    Body Mass Index or BMI, 

    heart rate, and 
    
    blood glucose level.

### Read in Dataset

In [1]:
framingham = read.csv("data/framingham.csv")
head(framingham)

Unnamed: 0_level_0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>
1,1,39,4,0,0,0,0,0,0,195,106.0,70,26.97,80,77,0
2,0,46,2,0,0,0,0,0,0,250,121.0,81,28.73,95,76,0
3,1,48,1,1,20,0,0,0,0,245,127.5,80,25.34,75,70,0
4,0,61,3,1,30,0,0,1,0,225,150.0,95,28.58,65,103,1
5,0,46,3,1,23,0,0,0,0,285,130.0,84,23.1,85,85,0
6,0,43,2,0,0,0,0,1,0,228,180.0,110,30.3,77,99,0


In [2]:
str(framingham)

'data.frame':	4240 obs. of  16 variables:
 $ male           : int  1 0 1 0 0 0 0 0 1 1 ...
 $ age            : int  39 46 48 61 46 43 63 45 52 43 ...
 $ education      : int  4 2 1 3 3 2 1 2 1 1 ...
 $ currentSmoker  : int  0 0 1 1 1 0 0 1 0 1 ...
 $ cigsPerDay     : int  0 0 20 30 23 0 0 20 0 30 ...
 $ BPMeds         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ prevalentStroke: int  0 0 0 0 0 0 0 0 0 0 ...
 $ prevalentHyp   : int  0 0 0 1 0 1 0 0 1 1 ...
 $ diabetes       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ totChol        : int  195 250 245 225 285 228 205 313 260 225 ...
 $ sysBP          : num  106 121 128 150 130 ...
 $ diaBP          : num  70 81 80 95 84 110 71 71 89 107 ...
 $ BMI            : num  27 28.7 25.3 28.6 23.1 ...
 $ heartRate      : int  80 95 75 65 85 77 60 79 76 93 ...
 $ glucose        : int  77 76 70 103 85 99 85 78 79 88 ...
 $ TenYearCHD     : int  0 0 0 1 0 0 1 0 0 0 ...


In [3]:
summary(framingham)

      male             age          education     currentSmoker   
 Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
 Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
 Mean   :0.4292   Mean   :49.58   Mean   :1.979   Mean   :0.4941  
 3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
                                  NA's   :105                     
   cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
 Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
 1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
 Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
 Mean   : 9.006   Mean   :0.02962   Mean   :0.005896   Mean   :0.3106  
 3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
 Max.   :70.000   Max.   :1.0000

In [4]:
# Load the library caTools
library(caTools)

#### Randomly split the data into training and testing sets

In [5]:
set.seed(1000)

#Here, we'll put 65% of the data in the training set.
split = sample.split(framingham$TenYearCHD, SplitRatio = 0.65)

#### Split up the data using subset

In [6]:
train = subset(framingham, split==TRUE)
test = subset(framingham, split==FALSE)

#### Logistic Regression Model

In [7]:
framinghamLog = glm(TenYearCHD ~ ., data = train, family=binomial)
summary(framinghamLog)


Call:
glm(formula = TenYearCHD ~ ., family = binomial, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8487  -0.6007  -0.4257  -0.2842   2.8369  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -7.886574   0.890729  -8.854  < 2e-16 ***
male             0.528457   0.135443   3.902 9.55e-05 ***
age              0.062055   0.008343   7.438 1.02e-13 ***
education       -0.058923   0.062430  -0.944  0.34525    
currentSmoker    0.093240   0.194008   0.481  0.63080    
cigsPerDay       0.015008   0.007826   1.918  0.05514 .  
BPMeds           0.311221   0.287408   1.083  0.27887    
prevalentStroke  1.165794   0.571215   2.041  0.04126 *  
prevalentHyp     0.315818   0.171765   1.839  0.06596 .  
diabetes        -0.421494   0.407990  -1.033  0.30156    
totChol          0.003835   0.001377   2.786  0.00533 ** 
sysBP            0.011344   0.004566   2.485  0.01297 *  
diaBP           -0.004740   0.008001  -0.592  0

#### Predictions on the test set

In [8]:
predictTest = predict(framinghamLog, type="response", newdata=test)

#### Confusion matrix with threshold of 0.5

In [9]:
table(test$TenYearCHD, predictTest > 0.5)

   
    FALSE TRUE
  0  1069    6
  1   187   11

With a threshold of 0.5, we predict an outcome of 1, the true column, very rarely. This means that our model rarely predicts a 10-year CHD risk above 50%.

#### Accuracy

In [10]:
(1069+11)/(1069+6+187+11)

So the accuracy of our model is about 84.8%.

#### Baseline accuracy

In [11]:
(1069+6)/(1069+6+187+11) 

So the baseline model would get an accuracy of about 84.4%. So our model barely beats the baseline in terms of accuracy.

But do we still have a valuable model by varying the threshold?

#### Test set AUC 

In [12]:
# Load the library ROCR
library(ROCR)

In [13]:
ROCRpred = prediction(predictTest, test$TenYearCHD)
as.numeric(performance(ROCRpred, "auc")@y.values)

0.7421095

So we have an AUC of about 74% on our test set, which means that the model can differentiate between low risk patients and high risk patients pretty well.

As we saw in R, we were able to build a logistic regression model with a few interesting properties. It rarely predicted 10-year CHD risk above 50%. So the accuracy of the model was very close to the baseline model.

However, the model could differentiate between low risk patients and high risk patients pretty well with an out-of-sample AUC of 0.74.

Additionally, some of the significant variables suggest possible interventions to prevent CHD. We saw that **more cigarettes per day**, **higher cholesterol**, **higher systolic blood pressure**, and **higher glucose levels** all increased risk.