[The National Longitudinal Study of Adolescent to Adult Health (Add Health)](https://addhealth.cpc.unc.edu/) is a panel study of about 20,000 adolescents first interviewed in grades 7-12 in 1994-1995. The cohort has been reinterviewed 4 times for a total of 5 waves of data collection. 

The third wave of data collection occurred in 2001-2002, when the cohort was aged 18-26 and a little over 15,000 members were successfully reinterviewed. The [Study Design](https://addhealth.cpc.unc.edu/documentation/study-design/) page shows this and other details.

The [public-use files](https://addhealth.cpc.unc.edu/data/#public-use) are substantially smaller than the full study, however, because of privacy concerns. The dataset we examine here, which is the public extract of the in-home questionnaire responses from wave 3, contains a little under 5,000 observations.

In [5]:
library(haven)
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.4.4     [32m✔[39m [34mpurrr  [39m 1.0.2
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.3
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



This extract contains the entire public-file dataset for the wave 3 in-home questionnaire, courtesy of [ICSPR](https://www.icpsr.umich.edu/web/ICPSR/studies/21600/). Here are the PDFs for the [questionnaire](doc/21600-0012-Questionnaire.pdf) and the [codebook](doc/21600-0012-Codebook.pdf).

In [3]:
addhealth_w3 <- read_dta("data/21600-0012-Data.dta")

In [4]:
head(addhealth_w3)

CASEID,AID,IMONTH3,IDAY3,IYEAR3,MACNO3,INTID3,BIO_SEX3,VERSION3,FRIEND,⋯,H3IR12,H3IR13,H3IR14,H3IR15,H3IR16,H3IR17,H3IR18,H3IR19,H3IR20,H3IR21
<dbl>,<chr>,<dbl+lbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,⋯,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>
1,57100270,12,13,2001,707,611707,2,13,0,⋯,0,0,0,0,0,3,1,2,0,0
2,57101310,11,19,2001,537,505537,2,11,0,⋯,0,0,0,0,0,1,1,2,2,1
3,57103869,1,23,2002,691,610691,1,15,0,⋯,0,1,0,0,0,1,1,3,2,0
4,57104676,3,11,2002,577,520577,1,16,1,⋯,0,1,0,0,0,1,1,1,0,0
5,57109625,2,26,2002,810,552810,1,16,1,⋯,0,1,0,0,0,5,1,2,0,0
6,57111071,11,9,2001,164,609164,1,11,1,⋯,0,1,0,0,0,5,2,2,0,0


Here are some useful variables, recoded for easy analysis.

In [6]:
# female = 1 when BIO_SEX3 == 2
addhealth_w3 <- mutate(addhealth_w3, 
                       female = BIO_SEX3 - 1
                      )

Below is the self-reported doctor's diagnosis of an eating disorder ("such as anorexia nervosa or bulimia"), where values of 6 or more signal the question was refused, response was "don't know," or not applicable. 

Recode to a binary indicator:

In [7]:
addhealth_w3 <- mutate(addhealth_w3,
                       eatdis = 
                       ifelse(H3GH8 < 6, H3GH8, NA)
                      )

This is the prevalence in the sample: 2.36%

In [8]:
summary(addhealth_w3$eatdis)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.0236  0.0000  1.0000       9 

In [9]:
addhealth_w3 <- mutate(addhealth_w3, 
                       age = CALCAGE3
                       )

In [12]:
addhealth_w3 <- mutate(addhealth_w3, 
                       blacknh = 
                       (H3OD4B == 1)*(H3OD2 == 0)
                       )

In [11]:
summary(addhealth_w3$blacknh)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.2401  0.0000  1.0000 

In [21]:
addhealth_w3 <- mutate(addhealth_w3, 
                       hispanic = 
                       ifelse(H3OD2 == 1,1,0)
                       )

In [18]:
addhealth_w3 <- mutate(addhealth_w3, 
                       othernh = 
                       (H3OD4A == 0)*(H3OD4B == 0)*(H3OD2 == 0)
                       )

In [14]:
summary(addhealth_w3$othernh)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.0424  0.0000  1.0000 

In [23]:
# Let's code edyrs so that first year of grad
# school (H3ED1 == 18) becomes 17 years
addhealth_w3 <- addhealth_w3 %>%
mutate(edyrs = case_when(
    H3ED1 == 6 ~ 6,
    H3ED1 == 7 ~ 7,
    H3ED1 == 8 ~ 8,
    H3ED1 == 9 ~ 9,
    H3ED1 == 10 ~ 10,
    H3ED1 == 11 ~ 11,
    H3ED1 == 12 ~ 12,
    H3ED1 == 13 ~ 13,
    H3ED1 == 14 ~ 14,
    H3ED1 == 15 ~ 15,
    H3ED1 == 16 ~ 16,
    H3ED1 == 17 ~ 17,
    H3ED1 == 18 ~ 17,
    H3ED1 == 19 ~ 18,
    H3ED1 == 20 ~ 19,
    H3ED1 == 21 ~ 20,
    H3ED1 == 22 ~ 21,
))

In [25]:
summary(addhealth_w3$edyrs)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   6.00   12.00   13.00   13.19   14.00   21.00       4 

In [27]:
addhealth_w3 <- mutate(addhealth_w3, 
                       foodstamps = 
                       ifelse(H3EC1C < 6, H3EC1C, NA)
                       )

In [28]:
summary(addhealth_w3$foodstamps)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.00000 0.00000 0.00000 0.04872 0.00000 1.00000      18 

In [29]:
eatdis_reg1 <- lm(eatdis ~ female + 
                  blacknh + hispanic + othernh +
                  edyrs +
                  foodstamps,
                 data = addhealth_w3)
summary(eatdis_reg1)


Call:
lm(formula = eatdis ~ female + blacknh + hispanic + othernh + 
    edyrs + foodstamps, data = addhealth_w3)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.06482 -0.04432 -0.02368 -0.00930  1.01073 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.013679   0.015501   0.882    0.378    
female       0.035341   0.004414   8.006 1.47e-15 ***
blacknh     -0.020649   0.005250  -3.933 8.51e-05 ***
hispanic    -0.008032   0.007211  -1.114    0.265    
othernh      0.002411   0.010853   0.222    0.824    
edyrs       -0.000313   0.001149  -0.272    0.785    
foodstamps   0.016517   0.010398   1.589    0.112    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1509 on 4848 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.0173,	Adjusted R-squared:  0.01609 
F-statistic: 14.23 on 6 and 4848 DF,  p-value: 3.882e-16
