<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 09</h1>

We'd like to understand the role of "family size," meaning number of siblings, as a potentially omitted variable in the [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484) study of earnings as a function of private school attendance. 

The role of family size is discussed by Angrist and Pischke in <i>Mastering Metrics</i> on pp 72-74. They reason that if family size were an omitted variable, it likely creates positive bias in the coefficient on private school, meaning that the true effect of private school attendance on earnings is (even) less positive.

This is clever reasoning that is an important part of ECON 140R. But if we had a dataset that actually measures siblings, what do the empirical signs and magnitudes imply?

<h2>Learning Objectives</h2>

1. Look at actual data from the U.S. Health and Retirement Study, on birth cohorts that are similar to those studied by [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484)

1. Use `mutate()` to create some indicator variables for gender identity and race/ethnicity in a real dataset. That function is found in the `dplyr` package, which is also part of the `tidyverse` package, so you can just load `tidyverse`

2. Run some regressions using `lm()`

3. Visualizations: common pitfalls, some workarounds

4. Empirically assess the associations between "family size," meaning number of siblings, and earnings, and thus whether it likely was an omitted variable in [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484)

<h2>The U.S. Health and Retirement Study</h2>

The dataset we'll use is the U.S. Health and Retirement Study (HRS), a panel survey of Americans aged 50 and older that started in 1992 and has been refreshed periodically.


This is an extract I prepared specially for this purpose. The entire RAND version of the longitudinal file is big, over 1 GB in size. Berkeley's datahub is not configured to allow more than a gigabyte of memory per user, so this would be problematic. If you want to use these data yourself:
* Navigate to [https://hrs.isr.umich.edu/](https://hrs.isr.umich.edu/) and register as a user
* Start with the RAND file, I think it's the easiest
* Download the data to your local machine and use RStudio

The fourth wave took place in 1998, and we'll examine data from it. It isn't a perfect match to the cohort examined by [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484) of college entrants in 1976 reinterviewed in 1995, but the analysis is feasible and may offer some insights. Here is a summary of a major way the samples are different:

[Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484) 
* Entered college in 1976
* Likely born around 1958
* Earnings measured for 1995

Extract of the 1998 wave of HRS
* Aged 50-59 in 1998
* Born 1939-1948
* Earnings measured for 1997

The 1998 wave of HRS and subsequent waves were designed to be nationally representative of Americans aged 50 and over. Mid-boomers born 1954-1959 were not formally added to the HRS until 2010. (See [HRS Survey Design and Methodology](https://hrs.isr.umich.edu/documentation/survey-design) for details).

Another dataset that might provide a better look at this question is the [National Longitudinal Survey of Youth 1979 (NLSY79)](https://www.bls.gov/nls/nlsy79.htm), which has followed nearly 13,000 men and women born 1957-1964. A similar option would be to examine the [National Longitudinal Study of 1972](https://nces.ed.gov/surveys/nls72/), as [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484) did in their original paper. But they appear not to have modeled family size as an additional covariate in that dataset, whether because of its omission or because they wanted to replicate their main results in a different dataset. A potential downside of NLS-72 is its smaller size, only 2,127 workers.

<i>Why choose HRS?</i> I know the dataset very well, so it was a snap for me to examine. If I had more time, I would look at the NLSY79 or the NLS-72 instead.

<h2>HRS and siblings</h2>

We'd like to understand how "family size," as discussed by Angrist and Pischke, might be an omitted variable in the [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484) study. The HRS measures the number of living siblings in wave 4, `r4livsib`, for this sample aged 50-59 in 1998. They are roughly 10-20 years older than the Dale-Krueger study participants, who entered college in 1976 and so were probably born around 1958.

Our objective is to run a version of this "long regression" from page 73 of <i>Mastering Metrics</i>:

$$
\ln Y_i = \alpha^l 
+ \beta^l \ P_i + 
\sum_j \gamma_j^l GROUP_{ji} 
+ \delta_1^l SAT_i
+ \delta_2^l \ln PI_i
+ \lambda FS_i
+ e^l_i
$$

where we can measure $FS_i$ in the HRS data, as `r4livsib`. We don't have many of the other right-hand-side variables shown here, but that shouldn't matter for this exercise. The coefficient on family size in a log earnings regression is not likely to depend much on the other controls shown here.

In [None]:
library(tidyverse)
library(haven)

This is an extract I prepared specially for this purpose. The entire RAND version of the longitudinal file is big, over 1 GB in size. Berkeley's datahub is not configured to allow more than a gigabyte of memory per user, so this would be problematic. If you want to use these data yourself:
* Navigate to [https://hrs.isr.umich.edu/](https://hrs.isr.umich.edu/) and register as a user
* Start with the RAND file, I think it's the easiest
* Download the data to your local machine and use RStudio

In [None]:
hrs_w4_earn_sibs = read_dta("data/hrs_w4_earn_sibs.dta")
head(hrs_w4_earn_sibs)

The RAND file uses a very helpful variable naming convention: `rKvarname`, where K is the wave. Here, let's look at summary statistics for the variable `r4livsib`, which is number of living siblings. For people we'll look at, this is going to be very close to siblings ever born. 

In [None]:
summary(hrs_w4_earn_sibs$r4livsib)
hist(hrs_w4_earn_sibs$r4livsib)

It's also helpful to look at years of education `raedyrs`, because that appears to be pretty important for understanding the effects of number of siblings on earnings:

In [None]:
summary(hrs_w4_earn_sibs$raedyrs)
hist(hrs_w4_earn_sibs$raedyrs)

Many folks are at that huge spike at a high school degree, 12 years. The Dale-Krueger dataset includes only those people and those with more education, and none of the left tail.

It might be interesting to see these two variables in a scatterplot, wouldn't it? Unfortunately, variables like this that take on integer values can create extremely unfortunate visualizations:

In [None]:
plot(hrs_w4_earn_sibs$r4livsib, hrs_w4_earn_sibs$raedyrs)

A tried and true solution to this problem is to MESS WITH THE DATA. You may not have known it, but chances are that in STAT 20 or DATA 8, you saw more than your fair share of scatterplots with deliberately "cooked" data in the way we're about to cook it.

If we add a small random number to both variables, we are monkeying with the data but basically preserving it. It's good to seed the random number generator (RNG) so that we can reproduce outcomes if we want to. 

In [None]:
set.seed(20220927)

In [None]:
# Let's create variables endin in -r that have a 
# normally distributed random variable added, with mean 0 and small SD
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, 
                           raedyrsr = raedyrs + rnorm(n(),0,1))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, 
                           r4livsibr = r4livsib + rnorm(n(),0,0.5))
head(hrs_w4_earn_sibs)

In [None]:
plot(hrs_w4_earn_sibs$r4livsibr, hrs_w4_earn_sibs$raedyrsr)

Another thing we could do is just run a regression. Here it is:

$$
raedyrs_i = \alpha^e + \beta^e \ livingsiblings_i + \epsilon^e_i
$$

In [None]:
edyrs_sib_reg <- lm(raedyrs ~ r4livsib, 
                    data = hrs_w4_earn_sibs)
summary(edyrs_sib_reg)

<hr>

Let's call `mutate()` to add some categoricals, for female gender identity and for the race/ethnicity categories that are useful to summarize folks:

In [None]:
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, 
                           rafemale = ragender - 1)

In [None]:
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, 
                           rablacknh  = ifelse(raraceth == 2, 1, 0))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, 
                           rahispanic = ifelse(raraceth == 3, 1, 0))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, 
                           raothernh  = ifelse(raraceth == 4, 1, 0))
head(hrs_w4_earn_sibs)

Behind the scenes, I have created some standard "labor economics variables." One thing you can do in a log-wage regression is control for age and age-squared. You could also control for age group, with indicators for set ranges of age, maybe in 5-year age groups. You could also calculate and use a rough measure of years of "experience" in the labor market, which is almost always calculated as age minus years of education:

$$
r4exper_i = r4age_i - raedyrs_i
$$

Labor economists often use this measure instead of age in a log wage regression, but the differences between these approaches tend to be minimal. 

I also created a variable `r4expersq` by squaring the experience variable. Over a broad age range, typically what we see is that earnings rise and then plateau with age, and thus a quadratic in experience can capture the typical experience fairly well. Ideally, the coefficient on the linear term should be positive, and the coefficient on the squared term should be negative, so that the parabola opens downward. This isn't always true, especially if we limit our analysis to a particular age range rather than all working ages 20-64.

Let's run this regression:

$$
\ln earnings_i = \alpha_i + \beta \ livingsiblings_i + \gamma \ raedyrs_i + B \cdot controls + e_i
$$

where the controls include years of experience and their square, and indicator variables for gender and racial/ethnic identity.

In [None]:
hrs_reg1 <- lm(logr4iearn ~ r4livsib 
               + raedyrs 
               + r4exper + r4expersq 
               + rafemale 
               + rablacknh + rahispanic + raothernh, 
               data = hrs_w4_earn_sibs)
summary(hrs_reg1)

Once we have controlled for age or experience, years of education, gender identity, and race/ethnicity, <b>it doesn't appear that number of living siblings tells us anything about earnings.</b>

By contrast, number of living siblings in 1998 definitely does appear to be correlated with years of education, controlling for gender and race/ethnicity:

In [None]:
hrs_reg2 <- lm(raedyrs ~ r4livsib 
               + rafemale 
               + rablacknh + rahispanic + raothernh, 
               data = hrs_w4_earn_sibs)
summary(hrs_reg2)

<hr>

Another reasonable approach here would be to drop observations who have less education than people in the Dale and Krueger (2002) study. They describe their sample in Table II on p. 1506, and they report that 85% graduated from college, and 56% obtained an advanced degree.

It turns out that in the HRS data, if we look at people with `raedyrs` of 15 and more, that gets us a sample that looks similar along that dimension. This code shows us frequencies of educational attainment:

In [None]:
raedyrs_freq <- as.data.frame( table(hrs_w4_earn_sibs$raedyrs) )
colnames(raedyrs_freq) <- c("Years of educ","Freq")
#temp <- table(unlist(hrs_w4_earn_sibs$raedyrs))
raedyrs_freq

In this HRS sample, we see 244 observations at 15 years of education but not a college degree, 620 with a college degree only, and 706 with more than a college degree.

In [None]:
share_no_college <- 244/(244+620+706)
share_no_college

share_college <- (620+706)/(244+620+706)
share_college

share_beyond_college <- 706/(244+620+706)
share_beyond_college

The code above shows that the 244 people at `raedyrs` == 15 are 15% of the total at or above that level, meaning 85% graduated college. That fits the Dale and Krueger sample well. Roughly 45% got more than a college degree, which is a little lower than what Dale and Krueger report. Close enough? It's hard to tell, but let's move ahead.

The code below drops education from the right-hand side and runs the model on people with 15+ years of education. This produces a subsample that is similar to the Dale and Krueger data.

In [None]:
hrs_reg3 <- lm(logr4iearn ~ r4livsib 
               #+ raedyrs
               + r4exper 
               + r4expersq 
               + rafemale 
               + rablacknh + rahispanic + raothernh, 
               data = subset(hrs_w4_earn_sibs, raedyrs >= 15))
summary(hrs_reg3)
nobs(hrs_reg3)

There's not much evidence here. We can also examine a pretty extreme model, where we drop all other covariates and look at the bivariate relationship between log earnings and living siblings:

In [None]:
hrs_reg4 <- lm(logr4iearn ~ r4livsib, 
               data = subset(hrs_w4_earn_sibs, raedyrs >= 15))
summary(hrs_reg4)
nobs(hrs_reg4)

Even here, we see little evidence that number of siblings, or family size, mattered much for earnings at these ages for this birth cohort.

<h2>Bottom lines and open questions</h2>

The point of this exercise was to explore the supposition that family size might be an omitted variable in the [Dale and Krueger (2002)](https://www-jstor-org.libproxy.berkeley.edu/stable/4132484), as posited by Angrist and Pischke in <i>Mastering Metrics</i> on pp 72-74.

Here are some takeaways:

<b>Angrist and Pischke's guesses are shrewd</b>, and they reveal the usefulness of the OVB formula in thinking through robustness. It is very plausible that if family size $FS$ were omitted from this long regression of log earnings on private school attendance and other things:

$$
\ln Y_i = \alpha^l 
+ \beta^l \ P_i + 
\sum_j \gamma_j^l GROUP_{ji} 
+ \delta_1^l SAT_i
+ \delta_2^l \ln PI_i
+ \lambda FS_i
+ e^l_i
$$

then the omission may produce positive bias in $\beta^s$ from a short regression. Why? You have to have two things:

1. We need $\gamma < 0$ meaning family size reduces earnings

2. We need $\pi_1 < 0$ meaning private school attendance falls with family size

When these are both true, then $OVB = \pi_1 \times \gamma > 0$ and thus $\beta^s$ will be too positive relative to the true $\beta$. (Another way this could work is if both $\gamma$ and $\pi_1$ were positive, but that does not seem likely.)

Both assumptions are plausible. 

<b>But HRS data suggests family size does not <u>independently</u> affect earnings: $\gamma = 0$</b>. I think this is because family size affects earnings through education. Here, for example, is another look at this: let's model earnings as a function of `r4livsib` in the full HRS sample, omitting education:

In [None]:
hrs_reg5 <- lm(logr4iearn ~ r4livsib 
               #+ raedyrs
               + r4exper 
               + r4expersq 
               + rafemale 
               + rablacknh + rahispanic + raothernh, 
               data = hrs_w4_earn_sibs)
summary(hrs_reg5)
nobs(hrs_reg5)

We now see a statistically significant negative coefficient on `r4livsib`, because education is omitted.

The bottom line that I take away from this is that <i><b>family size is probably not an omitted variable in the Dale and Krueger (2002) study</b></i>, "getting off on a technicality," as it were. Because their cohort uniformly attended college, a big channel through which siblings matter for earnings, namely education, was shut off. In the long regression, the coefficient on family size, $\gamma$, is statistically indistinguishable from zero. 

<b>An open question is whether the sample matters.</b> This is always a concern for applied research, and we usually call it <i>external validity</i>, whether a result from a certain sample is something that can be generalized beyond the sample, or external to it, to the broader human experience.

<hr>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>