# Practice notebook for regression analysis with NHANES

This notebook will give you the opportunity to perform some
regression analyses with the NHANES data that are similar to
the analyses done in the week 2 case study notebook.

You can enter your code into the cells that say "enter your code here",
and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar
to code that appears in the case study notebook.  You will need
to edit code from that notebook in small ways to adapt it to the
prompts below.

To get started, we will use the same module imports and
read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

url = "..\..\\nhanes_2015_2016.csv"
da = pd.read_csv(url)

# Drop unused columns, drop rows with any missing values.
vars = ["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "BMXBMI", "SMQ020"]
da = da[vars].dropna()

## Question 1:

Use linear regression to relate the expected body mass index (BMI) to a person's age.

In [18]:
# enter your code here

### OLS Model of BMXBMI with RIDAGEYR
model = sm.OLS.from_formula("BMXBMI ~ RIDAGEYR", data=da)
print(type(model))
result = model.fit()
print(result.summary())
cc = da[['BMXBMI','RIDAGEYR']].corr()
print(cc)
print(cc['BMXBMI']['RIDAGEYR']**2)
da.head()

<class 'statsmodels.regression.linear_model.OLS'>
                            OLS Regression Results                            
Dep. Variable:                 BMXBMI   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     2.720
Date:                Sun, 21 Feb 2021   Prob (F-statistic):             0.0991
Time:                        23:27:02   Log-Likelihood:                -17149.
No. Observations:                5102   AIC:                         3.430e+04
Df Residuals:                    5100   BIC:                         3.432e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
In

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDR,RIDRETH1,DMDEDUC2,BMXBMI,SMQ020
0,128.0,62,1,3,5.0,27.8,1
1,146.0,53,1,3,3.0,30.8,1
2,138.0,78,1,3,3.0,28.8,1
3,132.0,56,2,3,5.0,42.4,2
4,100.0,42,2,4,4.0,20.3,2


__Q1a.__ According to your fitted model, do older people tend to have higher or lower BMI than younger people?

The coefficient is quite small so it indicates that there is actually nochange in the BMI due to age. To corroborate this the R^2 was calculated with an almost 0 R. This indicates that there is zero correlation between the independent and the dependent variable.

__Q1b.__ Based your analysis, are you confident that there is a relationship between BMI and age in the population that NHANES represents?

Yes I am confident with the analysis but there is no relationship

__Q1c.__ By how much does the average BMI of a 40 year old differ from the average BMI of a 20 year old?

by zero

__Q1d.__ What fraction of the variation of BMI in this population is explained by age?

nothing can be explained by age

## Question 2: 

Add gender and ethnicity as additional control variables to your linear model relating BMI to age.  You will need to recode the ethnic groups based
on the values in the codebook entry for [RIDRETH1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDRETH1).

In [26]:
# enter your code here
# Create a labeled version of the gender variable
da["RIAGENDRx"] = da['RIAGENDR'].replace({1: "Male", 2: "Female"})
# Create a labeled version of the gender variable
da["RIDRETH1x"] = da['RIDRETH1'].replace({1: "Mexican American", 2: "Other Hispanic", 3: "Non-Hispanic White", 4: "Non-Hispanic Black", 5: "Other"})
da.head()
model = sm.OLS.from_formula("BMXBMI ~ RIDAGEYR + RIAGENDRx + RIDRETH1x", data=da)
print(type(model))
result = model.fit()
result.summary()

<class 'statsmodels.regression.linear_model.OLS'>


0,1,2,3
Dep. Variable:,BMXBMI,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.054
Method:,Least Squares,F-statistic:,49.27
Date:,"Sun, 21 Feb 2021",Prob (F-statistic):,3.9800000000000004e-59
Time:,23:57:55,Log-Likelihood:,-17007.0
No. Observations:,5102,AIC:,34030.0
Df Residuals:,5095,BIC:,34070.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,31.2361,0.355,87.891,0.000,30.539,31.933
RIAGENDRx[T.Male],-1.0226,0.190,-5.370,0.000,-1.396,-0.649
RIDRETH1x[T.Non-Hispanic Black],-0.4499,0.308,-1.460,0.144,-1.054,0.154
RIDRETH1x[T.Non-Hispanic White],-1.8555,0.282,-6.588,0.000,-2.408,-1.303
RIDRETH1x[T.Other],-4.7799,0.334,-14.318,0.000,-5.434,-4.125
RIDRETH1x[T.Other Hispanic],-0.9379,0.345,-2.721,0.007,-1.614,-0.262
RIDAGEYR,0.0065,0.005,1.196,0.232,-0.004,0.017

0,1,2,3
Omnibus:,917.09,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1855.286
Skew:,1.075,Prob(JB):,0.0
Kurtosis:,5.026,Cond. No.,308.0


__Q2a.__ How did the mean relationship between BMI and age change when you added additional covariates to the model?

It increase by just a bit however is almost negligible

__Q2b.__ How did the standard error for the regression parameter for age change when you added additional covariates to the model?

Assuming that is the std error of the intercept it increase by .065 from .290 to .355

__Q2c.__ How much additional variation in BMI is explained by age, gender, and ethnicity that is not explained by age alone?

Based on the parameters of the covariants it will look like ethnicity and gender explain quite a bit when all other variables are kept constant.
So males compared to females all things equal, have a lower BMI by one point
All females and males that are "Other" Ethnicity have a lower BMI of almost 5 points that if they were "Mexican American"
All females and males that are "Non-Hispanic White" Ethnicity have a lower BMI of almost 5 points that if they were "Mexican American"

__Q2d.__ What reference level did the software select for the ethnicity variable?

Mexican American

__Q2e.__ What is the expected difference between the BMI of a 40 year-old non-Hispanic black man and a 30 year-old non-Hispanic black man?

almost none, the age does not make any difference

__Q2f.__ What is the expected difference between the BMI of a 50 year-old Mexican American woman and a 50 year-old non-Hispanic black man?

almost none, the age does not make any difference

## Question 3: 

Randomly sample 25% of the NHANES data, then fit the same model you used in question 2 to this data set.

In [25]:
# enter your code here
ds = da.sample(frac=0.25, replace=False, random_state=1)
ds.describe()
model = sm.OLS.from_formula("BMXBMI ~ RIDAGEYR + RIAGENDRx + RIDRETH1x", data=ds)
print(type(model))
result = model.fit()
result.summary()

<class 'statsmodels.regression.linear_model.OLS'>


0,1,2,3
Dep. Variable:,BMXBMI,R-squared:,0.06
Model:,OLS,Adj. R-squared:,0.055
Method:,Least Squares,F-statistic:,13.48
Date:,"Sun, 21 Feb 2021",Prob (F-statistic):,7.11e-15
Time:,23:55:41,Log-Likelihood:,-4233.4
No. Observations:,1276,AIC:,8481.0
Df Residuals:,1269,BIC:,8517.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,31.6344,0.689,45.933,0.000,30.283,32.986
RIAGENDRx[T.Male],-1.0360,0.376,-2.753,0.006,-1.774,-0.298
RIDRETH1x[T.Non-Hispanic Black],-0.6427,0.595,-1.080,0.280,-1.810,0.525
RIDRETH1x[T.Non-Hispanic White],-2.1053,0.547,-3.852,0.000,-3.178,-1.033
RIDRETH1x[T.Other],-4.8910,0.653,-7.491,0.000,-6.172,-3.610
RIDRETH1x[T.Other Hispanic],-0.6012,0.672,-0.895,0.371,-1.920,0.717
RIDAGEYR,-0.0028,0.011,-0.260,0.795,-0.024,0.018

0,1,2,3
Omnibus:,236.335,Durbin-Watson:,2.014
Prob(Omnibus):,0.0,Jarque-Bera (JB):,485.09
Skew:,1.068,Prob(JB):,4.61e-106
Kurtosis:,5.135,Cond. No.,298.0


__Q3a.__ How do the estimated regression coefficients and their standard errors compare between these two models?  Do you see any systematic relationship between the two sets of results?

The coefficients increased slightly for Gender and Ethnicity. Age decreased but is still almost negligible
The standard errors increased due to the smaller sample

## Question 4:

Generate a scatterplot of the residuals against the fitted values for the model you fit in question 2.

In [None]:
# enter your code here

__Q4a.__ What mean/variance relationship do you see?

## Question 5: 

Generate a plot showing the fitted mean BMI as a function of age for Mexican American men.  Include a 95% simultaneous confidence band on your graph.

In [None]:
# enter your code here

__Q5a.__ According to your graph, what is the longest interval starting at year 30 following which the mean BMI could be constant?  *Hint:* What is the longest horizontal line starting at age 30 that remains within the confidence band?

__Q5b.__ Add an additional line and confidence band to the same plot, showing the relationship between age and BMI for Mexican American women.  At what ages do these intervals not overlap?

## Question 6:

Use an added variable plot to assess the linearity of the relationship between BMI and age (when controlling for gender and ethnicity).

In [None]:
# enter your code here

__Q6a.__ What is your interpretation of the added variable plot?

## Question 7: 

Generate a binary variable reflecting whether a person has had at least 12 drinks in their lifetime, based on the [ALQ110](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm#ALQ110) variable in NHANES.  Calculate the marginal probability, odds, and log odds of this variable for women and for men.  Then calculate the odds ratio for females relative to males.

In [None]:
# enter your code here

__Q7a.__ Based on the log odds alone, do more than 50% of women drink alcohol?

__Q7b.__ Does there appear to be an important difference between the alcohol use rate of women and men?

## Question 8: 

Use logistic regression to express the log odds that a person drinks (based on the binary drinking variable that you constructed above) in terms of gender.

In [None]:
# enter your code here

__Q8a.__ Is there statistical evidence that the drinking rate differs between women and men?  If so, in what direction is there a difference?

__Q8b.__ Confirm that the log odds ratio between drinking and smoking calculated using the logistic regression model matches the log odds ratio calculated directly in question 6.

## Question 9: 

Use logistic regression to relate drinking to age, gender, and education.

In [None]:
# enter your code here

__Q9a.__ Which of these predictor variables shows a statistically significant association with drinking?

__Q9b.__ What is the odds of a college educated, 50 year old woman drinking?

__Q9c.__ What is the odds ratio between the drinking status for college graduates and high school graduates (with no college), holding gender and age fixed?

__Q9d.__ Did the regression parameter for gender change to a meaningful degree when age and education were added to the model?

## Question 10:

Construct a CERES plot for the relationship between drinking and age (using the model that controls for gender and educational attainment).

In [None]:
# enter your code here

__Q10a.__ Does the plot indicate any major non-linearity in the relationship between age and the log odds for drinking?