#  Linear regression using the Framingham Heart Study and StatsModels package

In this Notebook we are going to perform an epidemiological study assessing the possible clinical association between systolic blood pressure (sBP) and age.
Previous data shows sBP tends to increase with age. In other words, the older you get, the higher your mean sBP.

Our clinical question is:
### Is there a linear assocaition between age and mean sBP, when considering potential clinical confounders?
Chosen confounders: gender, education, smoking, intake of drugs for BP, and total cholesterol

* **Null hypothesis (H0)**: There is NO association between sBP and age (considered the chosen confounders);
* **Alternative hypothesis (H1)**: There is an association between sBP and age (considered the chosen confounders).

We are going to extract a publically available version of the Framingham heart study from; https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset/version/1

On this website, you can also consult the coding for the different variables. For example, regarding gender (variable male; male=1 means male and male=2 means female)

### First let's import all important packages

In [1]:
import numpy as np 
import pandas as pd
import statsmodels.formula.api as smf   # looks new? 

# any other options?
# import statsmodels as sm 
# import statsmodels.api as sm 

# Exploring data

### Open the dataset

In [2]:
df = pd.read_csv('framingham.csv')

### Show data frame

In [3]:
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


The head function shows us all columns and the first 5 rows/observations of our data frame.

In [6]:
df.shape

(4240, 16)

The output means the data frame has 4240 observations (rows) and 16 variables (columns). Remember **.head()** only gave us a glimpse from the 4240 rows.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4240 non-null   int64  
 1   age              4240 non-null   int64  
 2   education        4135 non-null   float64
 3   currentSmoker    4240 non-null   int64  
 4   cigsPerDay       4211 non-null   float64
 5   BPMeds           4187 non-null   float64
 6   prevalentStroke  4240 non-null   int64  
 7   prevalentHyp     4240 non-null   int64  
 8   diabetes         4240 non-null   int64  
 9   totChol          4190 non-null   float64
 10  sysBP            4240 non-null   float64
 11  diaBP            4240 non-null   float64
 12  BMI              4221 non-null   float64
 13  heartRate        4239 non-null   float64
 14  glucose          3852 non-null   float64
 15  TenYearCHD       4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB


In [8]:
df.describe()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,4240.0,4240.0,4135.0,4240.0,4211.0,4187.0,4240.0,4240.0,4240.0,4190.0,4240.0,4240.0,4221.0,4239.0,3852.0,4240.0
mean,0.429245,49.580189,1.979444,0.494104,9.005937,0.029615,0.005896,0.310613,0.025708,236.699523,132.354599,82.897759,25.800801,75.878981,81.963655,0.151887
std,0.495027,8.572942,1.019791,0.500024,11.922462,0.169544,0.076569,0.462799,0.15828,44.591284,22.0333,11.910394,4.07984,12.025348,23.954335,0.358953
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.54,44.0,40.0,0.0
25%,0.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,75.0,23.07,68.0,71.0,0.0
50%,0.0,49.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,234.0,128.0,82.0,25.4,75.0,78.0,0.0
75%,1.0,56.0,3.0,1.0,20.0,0.0,0.0,1.0,0.0,263.0,144.0,90.0,28.04,83.0,87.0,0.0
max,1.0,70.0,4.0,1.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


**Categorical (non-binary) variable: Education

**Discrete binary variables; male, CurrentSmoker, BPMeds, prevalentStroke, prevalentHyp, diabetes and TenYearCHD. 

The empty parenthesis in **df.describe()** tells Python to show all available variables.

Here we can see the number of observations per column (count), followed by the mean and standard deviation (sd). Note that for binary discrete variables the interpretation is somewhat different and the mean will equal the relative proportion (0 to 1, instead of 0–100%). 

### Simple Linear Regression 

First we are going to test the association between our outcome/dependent variable, **sysBP**, and our main indepedent variable, **age**.
**Let's leverage the StatsModel package!**

But first, let's analyse our input, before even going to our potential output

**model1 = smf.ols(formula='sysBP ~ age', data=df).fit()**

Reading our input

* **smf** calls the package StatsModels
* **ols** tells Python we are using an Ordinary Least Square (OLS) regression (a type linear regression)
* **formula=** used to write the dependent and all the independent variable(s)
* **first variable inside the parenthesis before "~"/dependent variable/outcome**: The first variable is our only dependent variable. This is our outcome, the variable that determines which type of regression to use, and the one to be associated with all other covariates;
* **~ tilde**: Marks the border between the outcome (dependent variable) to the left, and the covariates (independent variables) to the right;
* **independent covariates/independent variables** : All other variables after the "~", inside parenthesis;
* **+ sign inside parenthesis**: the + sign is used to separate different independent variables inside the same model (useful for multivariable models, aka: many independent variables)
* **,data=** This marks the name of the data frame. 
* **.fit()** tells Python we want to fit our function ("run the function")

In [4]:
model1 = smf.ols(formula='sysBP ~ age', data=df).fit()# any other way? 

# df['intercept'] = 1
# sm.OLS(df['sysBP'], df[['intercept', 'age']])

In [6]:
print(model1.summary()) 

                            OLS Regression Results                            
Dep. Variable:                  sysBP   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.155
Method:                 Least Squares   F-statistic:                     779.0
Date:                Wed, 19 Jan 2022   Prob (F-statistic):          1.58e-157
Time:                        22:19:52   Log-Likelihood:                -18770.
No. Observations:                4240   AIC:                         3.754e+04
Df Residuals:                    4238   BIC:                         3.756e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     82.1420      1.826     44.992      0.0

Same model in another style

In [11]:
model1.summary()

0,1,2,3
Dep. Variable:,sysBP,R-squared:,0.155
Model:,OLS,Adj. R-squared:,0.155
Method:,Least Squares,F-statistic:,779.0
Date:,"Tue, 18 Jan 2022",Prob (F-statistic):,1.58e-157
Time:,19:41:54,Log-Likelihood:,-18770.0
No. Observations:,4240,AIC:,37540.0
Df Residuals:,4238,BIC:,37560.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,82.1420,1.826,44.992,0.000,78.563,85.721
age,1.0128,0.036,27.911,0.000,0.942,1.084

0,1,2,3
Omnibus:,669.703,Durbin-Watson:,1.97
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1392.384
Skew:,0.944,Prob(JB):,4.4400000000000003e-303
Kurtosis:,5.078,Cond. No.,295.0


* On the left column, we see
   * The dependent variable is sysBP
   * We are running an ordinary least square (OLS) regression model (a form of linear regression)
   * We included 4240 observations (patients). 


Regarding the lower half

We are going to read the output, column by column. Here we are assessing the association between our outcome (sysbp) and our independent variable, age (first column), which has a positive association of 1.0128 (or coefficient, or second column), with a standard error of 0.036 (third column), with a t-ratio (for a t-distribution) of 27.911 (fourth column) which gives a p-value <0.001 (fifth column), corresponding to a 95% confidence interval for the beta coefficient between 0.94 2(lower bound, sixth column) and 1.084 (upper bound, seventh column). To make things simpler…

### We can reject the null hypothesis of no association between age and sBP. There is a positive (beta coefficient>0.0) association between age and systolic blood pressure, which is statistically significant (p<0.05 and 95% CI does not include a coefficient=0.0).

### Multiple Linear Regression 

Now that we feel confortable calling OLS models using StatsModels, let's try a **multiple linear regression**.
Basically the same but now we well add our chosen confounders, in other words we are adding more than 1 independent variables (to the right of the "~" in our function). 

In [13]:
model2 = smf.ols(formula='sysBP ~ age + male + C(education) + cigsPerDay + BPMeds + totChol', data=df).fit()

In [14]:
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:                  sysBP   R-squared:                       0.213
Model:                            OLS   Adj. R-squared:                  0.212
Method:                 Least Squares   F-statistic:                     135.4
Date:                Wed, 19 Jan 2022   Prob (F-statistic):          9.75e-202
Time:                        22:39:49   Log-Likelihood:                -17595.
No. Observations:                4007   AIC:                         3.521e+04
Df Residuals:                    3998   BIC:                         3.526e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              79.0066    

### Let's change "education" to a categorical variable

A very nice output. HOWEVER, education has 4 categories

In [14]:
df.education.value_counts()

1.0    1720
2.0    1253
3.0     689
4.0     473
Name: education, dtype: int64

Let's do a small adaptation to our model, in order to have the output for each category of education

In [15]:
model3 = smf.ols(formula='sysBP ~ age + male + C(education) + cigsPerDay + BPMeds + totChol', data=df).fit()

# what about dummy variables?

# can we change the reference? 
# https://stackoverflow.com/questions/38023881/pandas-change-the-order-of-levels-of-factor-type-object 

In [16]:
print(model3.summary())

                            OLS Regression Results                            
Dep. Variable:                  sysBP   R-squared:                       0.213
Model:                            OLS   Adj. R-squared:                  0.212
Method:                 Least Squares   F-statistic:                     135.4
Date:                Tue, 18 Jan 2022   Prob (F-statistic):          9.75e-202
Time:                        19:41:54   Log-Likelihood:                -17595.
No. Observations:                4007   AIC:                         3.521e+04
Df Residuals:                    3998   BIC:                         3.526e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              79.0066    

Same model in another style

In [17]:
model3.summary()

0,1,2,3
Dep. Variable:,sysBP,R-squared:,0.213
Model:,OLS,Adj. R-squared:,0.212
Method:,Least Squares,F-statistic:,135.4
Date:,"Tue, 18 Jan 2022",Prob (F-statistic):,9.75e-202
Time:,19:41:54,Log-Likelihood:,-17595.0
No. Observations:,4007,AIC:,35210.0
Df Residuals:,3998,BIC:,35260.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,79.0066,2.416,32.700,0.000,74.270,83.744
C(education)[T.2.0],-0.8474,0.760,-1.115,0.265,-2.338,0.643
C(education)[T.3.0],-3.5724,0.907,-3.937,0.000,-5.351,-1.794
C(education)[T.4.0],-4.5333,1.043,-4.348,0.000,-6.577,-2.489
age,0.8365,0.040,21.108,0.000,0.759,0.914
male,-0.0871,0.670,-0.130,0.897,-1.401,1.226
cigsPerDay,-0.0238,0.028,-0.849,0.396,-0.079,0.031
BPMeds,26.6926,1.848,14.443,0.000,23.069,30.316
totChol,0.0537,0.007,7.333,0.000,0.039,0.068

0,1,2,3
Omnibus:,652.299,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1444.451
Skew:,0.945,Prob(JB):,0.0
Kurtosis:,5.253,Cond. No.,1960.0


**change the reference for education**

In [18]:
df['education'].value_counts()

1.0    1720
2.0    1253
3.0     689
4.0     473
Name: education, dtype: int64

In [19]:
df['education']

0       4.0
1       2.0
2       1.0
3       3.0
4       3.0
       ... 
4235    2.0
4236    1.0
4237    2.0
4238    3.0
4239    3.0
Name: education, Length: 4240, dtype: float64

In [20]:
df['education'] = df['education'].astype('category')

df['education']

  for val, m in zip(values.ravel(), mask.ravel())


0       4.0
1       2.0
2       1.0
3       3.0
4       3.0
       ... 
4235    2.0
4236    1.0
4237    2.0
4238    3.0
4239    3.0
Name: education, Length: 4240, dtype: category
Categories (4, float64): [1.0, 2.0, 3.0, 4.0]

In [21]:
df['education'].cat.reorder_categories([4.0, 1.0, 2.0, 3.0], inplace=True)
# https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html 
df['education']

  df['education'].cat.reorder_categories([4.0, 1.0, 2.0, 3.0], inplace=True)
  for val, m in zip(values.ravel(), mask.ravel())


0       4.0
1       2.0
2       1.0
3       3.0
4       3.0
       ... 
4235    2.0
4236    1.0
4237    2.0
4238    3.0
4239    3.0
Name: education, Length: 4240, dtype: category
Categories (4, float64): [4.0, 1.0, 2.0, 3.0]

**re-fit the model**

In [22]:
model_ref = smf.ols(formula='sysBP ~ age + male + education + cigsPerDay + BPMeds + totChol', data=df).fit()

# should we use C here before education?

In [23]:
print(model_ref.summary())

                            OLS Regression Results                            
Dep. Variable:                  sysBP   R-squared:                       0.213
Model:                            OLS   Adj. R-squared:                  0.212
Method:                 Least Squares   F-statistic:                     135.4
Date:                Tue, 18 Jan 2022   Prob (F-statistic):          9.75e-202
Time:                        19:41:54   Log-Likelihood:                -17595.
No. Observations:                4007   AIC:                         3.521e+04
Df Residuals:                    3998   BIC:                         3.526e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           74.4733      2.444  

Perfect, now we can interpret the output and answer to our initial clinical question

## what about interaction terms? 

In [24]:
smf.ols(formula='sysBP ~ age + education + cigsPerDay + totChol + male * BPMeds', data=df).fit().summary()

0,1,2,3
Dep. Variable:,sysBP,R-squared:,0.213
Model:,OLS,Adj. R-squared:,0.212
Method:,Least Squares,F-statistic:,120.5
Date:,"Tue, 18 Jan 2022",Prob (F-statistic):,6.58e-201
Time:,19:41:54,Log-Likelihood:,-17594.0
No. Observations:,4007,AIC:,35210.0
Df Residuals:,3997,BIC:,35270.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,74.3962,2.445,30.426,0.000,69.602,79.190
education[T.1.0],4.5365,1.043,4.351,0.000,2.492,6.581
education[T.2.0],3.7216,1.078,3.453,0.001,1.609,5.835
education[T.3.0],0.9603,1.195,0.803,0.422,-1.383,3.304
age,0.8379,0.040,21.132,0.000,0.760,0.916
cigsPerDay,-0.0231,0.028,-0.822,0.411,-0.078,0.032
totChol,0.0539,0.007,7.355,0.000,0.039,0.068
male,-0.1975,0.678,-0.291,0.771,-1.526,1.131
BPMeds,25.4124,2.197,11.569,0.000,21.106,29.719

0,1,2,3
Omnibus:,653.622,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1450.654
Skew:,0.946,Prob(JB):,0.0
Kurtosis:,5.26,Cond. No.,3380.0


## Is there an association between systolic blood pressure and age?

### Yes, there is a positive association between age and systolic blood pressure (Betta coefficient [95%CI]: 0.84[0.76;0.91]), which is statistically significant (p<0.05 and confidence interval does not include 0), even when considering the effect of possible relevant confounders (covariates in the model).

### References

 https://www.w3schools.com/python/python_ml_multiple_regression.asp
 Run a regression More info here: https://www.statsmodels.org/stable/glm.html