<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 10_sim</h1>

<h2>Simulating bad controls</h2>

<h2>Learning objectives:</h2>

1. A "Monte Carlo" simulation allows us to create a toy reality where we can play and learn  
2. "Bad controls" cut off avenues through which a treatment affects an outcome
3. "Good controls" help us compare apples to apples

<h2>Bad controls</h2>

The problem of bad controls can be simple, and it can also be complicated. Scott Cunningham works through a <b>complicated</b> version of bad controls in the case of gender inequality in earnings in <i>Causal Inference: The Mixtape</i> [section 3.1.5](https://mixtape.scunning.com/03-directed_acyclical_graphs#discrimination-and-collider-bias), with an example drawn from work with Erin Hengel. Their example assumes omitted variable bias stemming from unobserved ability, which like gender discrimination also affects occupation and earnings. Omitting ability can even switch the sign of gender discrimination.

A <b>simpler</b> point is that when we run OLS on a log earnings equation like this:

$$
\log w_i = \alpha + \beta \ female_i + \gamma \ occupation_i + \epsilon_i
$$

then our estimate of $\beta$ is the <i>marginal effect</i> of gender discrimination on the log wage, where occupation is held constant. In the language of calculus, $\beta$ is the partial derivative of $\log w_i$ with respect to gender, not the total derivative. 

If gender discrimination also affects occupation, this model will attribute that effect solely to occupation and not at all to the female indicator variable. If the math is helpful, the key points are that the partial derivative is:

$$
\frac{\partial \log w_i}{\partial \ female_i} = \beta
$$

and the total derivative is:
$$
\frac{d \log w_i}{d \ female_i} = \beta + \gamma \frac{\partial \ occupation_i}{\partial \ female_i} = \ ?
$$

In [None]:
library(tidyverse)

Let us simulate a dataset with 10,000 observations split equally between self-identified males and females. Thanks to Scott Cunningham's code, we can do this by constructing a `tibble()`, with calls to random number generators `runif()` and `rnorm()`. We begin with males and females, and then we create an occupation variable and a log wage, both with baked-in gender discrimination:
$$
occupation_i = \alpha^o + \beta^o \ female_i + \nu^o_i
$$
$$
\log w_i = \alpha^w + \beta^w \ female_i + \gamma^w \ occupation_i + \nu^w_i
$$
"Occupation" here is proxied by a single index, which you could think of as occupational prestige.
When the $\beta$'s are negative, there is gender discrimination penalizing females.

It's helpful to seed the random number generator with some parameter, so we can reproduce the results. I like to use today's date.

In [None]:
set.seed(20240930)

In [None]:
ao = 1  
bo = -2 # gender discrimination in occupation

aw = 3.6
bw = -0.3 # gender discrimination in wages
gw = 0.2  # effect of occupation on wages

data_disc_1k_1 <- tibble(
    female     = ifelse(runif(10000)>=0.5,1,0),
    occupation = ao + bo*female + rnorm(10000, mean = 0, sd = 0.1),
    logwage    = aw + bw*female + gw*occupation + rnorm(10000, mean = 0, sd = 0.1) 
)

Let's now examine what OLS returns in short and long regressions. In the short regression, we model only the effect of gender discrimination: 
$$
\log w_i = \alpha^s + \beta^s \ female_i + \epsilon^s_i
$$
In the long regression, we also control for the effect of occupation:
$$
\log w_i = \alpha^l + \beta^l \ female_i + \gamma^l \ occupation_i + \epsilon^l_i
$$

These are identical to the data-generating relationships written above, but it is useful to keep them separate for reasons that will become clear shortly. Here is the short regression:

In [None]:
reg_1_short <- lm(logwage ~ female,
                  data = data_disc_1k_1)
summary(reg_1_short)

In the short regression above, the effect of gender discrimination is quite large: $\beta^s = -0.70$. Because there are no other variables on the right-hand side, this is actually the total derivative of the log wage with respect to $female_i$, and we can see 
$$ 
\beta^s = -0.70 = \beta^w + \gamma^w \times \beta^o = -0.3 - 0.2 \times 2 = -0.3 - 0.4 = -0.70
$$

This $\beta^s$ picks up the direct effect of gender discrimination on wages, and it also picks up its indirect effect through occupation, <i>because we are not controlling for occupation</i>. What if we did? Here is the long regression:

In [None]:
reg_1_long <- lm(logwage ~ female
                 + occupation,
                 data_disc_1k_1)
summary(reg_1_long)

We see a very different story emerge here, because controlling for occupation <i>shuts it off as a channel of causality</i> running from gender discrimination through occupation into wages. In the long regression, we see exactly the data-generating relationship that we posited, with $\beta^l = \beta^w \approx -0.3$ and $\gamma^l = \gamma^w \approx 0.2$. But are those results indicative of the full sweep of gender discrimination? Most would probably say they are not.

<hr>

One could turn off direct gender discrimination in wages to see this point another way. Suppose we set $\beta^w = 0$:

In [None]:
ao = 1  
bo = -2 # gender discrimination in occupation

aw = 3.6
bw = 0    # NO gender discrimination in wages
gw = 0.2  # effect of occupation on wages

data_disc_1k_2 <- tibble(
    female     = ifelse(runif(10000)>=0.5,1,0),
    occupation = ao + bo*female + rnorm(10000, mean = 0, sd = 0.1),
    logwage    = aw + bw*female + gw*occupation + rnorm(10000, mean = 0, sd = 0.1) 
)

In [None]:
reg_2_short <- lm(logwage ~ female,
                  data = data_disc_1k_2)
summary(reg_2_short)

The short regression still shows gender discrimination, because it reveals the total derivative of the log wage with respect to $female_i$. But all that is left is the $\gamma^w \times \beta^o = -0.4$ piece.

Consider the long regression:

In [None]:
reg_2_long <- lm(logwage ~ female
                 + occupation,
                  data = data_disc_1k_2)
summary(reg_2_long)

Here, there is no direct effect of gender discrimination on wages, <i>even though there still is an indirect effect running through occupation into wages</i>. In the long regression, we cannot and should not reject $\beta^l = 0$, but it would be a mistake to conclude there is no gender discrimination in wages. There clearly is discrimination, but it runs through occupation, which we have held constant.

<hr>

<h2>Good controls</h2>

Is it always a mistake to control for other characteristics? Definitely not. But especially when the research question has to do with <i>disparities</i>, or inequalities that should not exist from a normative or moral perspective, the choice of controls can be controversial.

Most studies of the labor market find that workers with more years of labor market <b>experience</b> typically earn more. Other things equal, older workers will have more experience than younger workers, and thus we probably would expect there to be an earnings benefit associated with age. In other words, we would probably not perceive wage inequality favoring older workers as unjust.

In a cross section of workers, <b>experience</b> and <b>education</b> are often negatively correlated. There are two reasons why: acquiring years education past the age of compulsory schooling typically requires sacrificing years of work experience; and older birth cohorts typically had less access to education than younger birth cohorts.

If we are interested in the effect of education on earnings, then <i>we should control for experience</i> or age in the regression, comparing apples-to-apples. Experience or age is probably a <b>good control</b> in an earnings regression in most cases. There are situations when the causal effect of a treatment variable on earnings might flow through experience; see for example the voluminous work on gender inequality by 2023 Nobel Laureate in Economics, [Claudia Goldin](https://www.nobelprize.org/prizes/economic-sciences/2023/prize-announcement/). But for many other situations, age is likely to be a more immutable characteristic. 

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>