<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 10</h1>

<h2>Bad controls in the January 2018 Current Population Survey (CPS)</h2>

<h2>Learning objectives:</h2>

1. More experience with OLS
2. Handy `factor()` function
3. Bad controls may cut off the avenues through which the treatment affects the outcome
4. Balancing omitted variable bias vs. bad controls is difficult

<h2>Bad controls</h2>

Let's examine how controlling for occupation and industry, while feasible to do, changes the estimated coefficients on education and female gender identity. It's probably not a good idea, unless there were a compelling reason to look at within-occupation differences in earnings by education. 

To explore this topic, we'll look at an extract from the January 2018 Current Population Survey (CPS), the annual job tenure supplement. I downloaded the extract from [IPUMS](http://cps.ipums.org), and in the future we'll walk through the IPUMS interface and contents.

In [1]:
library(tidyverse)
library(haven)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.4.3     [32m✔[39m [34mpurrr  [39m 1.0.2
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.3
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Setting this option removes scientific notation, which can sometimes be a real bummer to read.

In [6]:
options(scipen = 999)

This extract contains people aged 25-64, which are working ages when education is likely to be completed.

In [7]:
cpsj18_2564 = read_dta("data/cpsj18_2564_1.dta")
head(cpsj18_2564)

year,serial,month,hwtfinl,cpsid,statefip,metro,pernum,wtfinl,cpsidp,⋯,jtsuppwt,earnweek,logearnweek,female,hispanic,raceth,married,edyrs,exper,expersq
<dbl>,<dbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl+lbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl>,<dbl>
2018,1,1,1490.589,20161000000100,1,2,2,1490.589,20161000000102,⋯,0.0,,,1,0,1,1,14,50,2500
2018,3,1,1797.041,20180100000300,1,2,2,1797.041,20180100000302,⋯,2082.874,,,1,0,1,1,16,45,2025
2018,9,1,1735.762,20171000000400,1,2,1,1735.762,20171000000401,⋯,2324.656,903.0,6.805723,1,0,2,0,12,40,1600
2018,10,1,1582.769,20171000000600,1,2,1,1582.769,20171000000601,⋯,2070.467,1250.0,7.130899,1,0,2,0,16,40,1600
2018,12,1,1927.688,20170100001000,1,2,1,1927.688,20170100001001,⋯,2521.665,,,1,0,2,1,16,43,1849
2018,12,1,1927.688,20170100001000,1,2,2,2151.5,20170100001002,⋯,2830.419,,,0,0,2,1,14,39,1521


Here is a baseline regression, where we are controlling for 0/1 female gender identity and years of education, and for a set of controls including race/ethnicity, years of experience, years of experience squared, and also years of job tenure:

$$
\ln Y_i = \alpha + \beta^f \ female_i + \beta^e \ educ_i + B \ controls_i + \epsilon_i
$$

We'll run this regression and examine what we find for $\beta^f$ and $\beta^e$:

In [8]:
cps_reg1 <- lm(logearnweek ~ female 
               + edyrs 
               + factor(raceth) 
               + exper + expersq + jtyears, 
               data = cpsj18_2564)
summary(cps_reg1)


Call:
lm(formula = logearnweek ~ female + edyrs + factor(raceth) + 
    exper + expersq + jtyears, data = cpsj18_2564)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2460 -0.3124  0.0553  0.4172  2.6295 

Coefficients:
                   Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)      4.91807089  0.06655211  73.898 < 0.0000000000000002 ***
female          -0.36884313  0.01438936 -25.633 < 0.0000000000000002 ***
edyrs            0.10686433  0.00308434  34.647 < 0.0000000000000002 ***
factor(raceth)2 -0.14380309  0.02531355  -5.681  0.00000001381864817 ***
factor(raceth)3 -0.06483195  0.02308744  -2.808              0.00499 ** 
factor(raceth)4 -0.08274354  0.02620348  -3.158              0.00160 ** 
exper            0.02694388  0.00342620   7.864  0.00000000000000415 ***
expersq         -0.00045188  0.00005626  -8.032  0.00000000000000108 ***
jtyears          0.01684017  0.00098428  17.109 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.

We see a pretty large negative penalty for female workers: $-$37%, and we see a relatively large benefit for an additional year of education: 10.7%. Racial/ethnic minorities see earnings penalties relative to white non-Hispanic people (the baseline group). Log earnings are a parabola opening down in years of experience, which is consistent with typical theory. Years of job tenure is a benefit. 

Now let's explore controlling for occupation `occ`, and for kicks, also for industry `ind` and state of residence `statefip`. The last one is probably not a bad control, come to think of it, unless one believes that education's effect on earning is partially through enhanching geographic mobility, which isn't an impossibility. So let's try just that one first:

In [10]:
cps_reg2 <- lm(logearnweek ~ female 
               + edyrs 
               + factor(raceth) 
               + exper + expersq + jtyears
               + factor(statefip)
               , data = cpsj18_2564)
summary(cps_reg2)


Call:
lm(formula = logearnweek ~ female + edyrs + factor(raceth) + 
    exper + expersq + jtyears + factor(statefip), data = cpsj18_2564)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1730 -0.3054  0.0556  0.4154  2.5813 

Coefficients:
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         4.92568447  0.08389761  58.711 < 0.0000000000000002 ***
female             -0.36446582  0.01432826 -25.437 < 0.0000000000000002 ***
edyrs               0.10236582  0.00310533  32.965 < 0.0000000000000002 ***
factor(raceth)2    -0.17117692  0.02644045  -6.474         0.0000000001 ***
factor(raceth)3    -0.11219924  0.02477623  -4.529         0.0000060173 ***
factor(raceth)4    -0.14210344  0.02776289  -5.118         0.0000003144 ***
exper               0.02892904  0.00342474   8.447 < 0.0000000000000002 ***
expersq            -0.00048464  0.00005618  -8.627 < 0.0000000000000002 ***
jtyears             0.01699280  0.00098285  17.289 < 0.0000000000000

What do we see? 

In the coefficients on the top variables (i.e., not the state indicators), we do not see many big differences here compared to the first regression. The coefficient on female, $\beta^f$, shows a penalty of about $-$36%; and the coefficient on years of education, $\beta^e$, shows a benefit of additional year is 10.2%.

Let's throw the kitchen sink at things.

In these data, there are almost 500 occupation categories, and there are roughly 250 industries. For details, see the documentation at IPUMS: [OCC documentation](https://cps.ipums.org/cps/codes/occ_20112019_codes.shtml), [IND documentation](https://cps.ipums.org/cps/codes/ind_2014_codes.shtml).

In [11]:
cps_reg3 <- lm(logearnweek ~ female 
               + edyrs 
               + factor(raceth) 
               + exper + expersq + jtyears
               + factor(occ) 
               + factor(ind) 
               + factor(statefip)
               , data = cpsj18_2564)
summary(cps_reg3)


Call:
lm(formula = logearnweek ~ female + edyrs + factor(raceth) + 
    exper + expersq + jtyears + factor(occ) + factor(ind) + factor(statefip), 
    data = cpsj18_2564)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3302 -0.2492  0.0370  0.3346  2.6950 

Coefficients: (1 not defined because of singularities)
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         6.29870039  0.21752378  28.956 < 0.0000000000000002 ***
female             -0.23514338  0.01647747 -14.271 < 0.0000000000000002 ***
edyrs               0.05071949  0.00365472  13.878 < 0.0000000000000002 ***
factor(raceth)2    -0.06976667  0.02518971  -2.770             0.005624 ** 
factor(raceth)3    -0.02901805  0.02357138  -1.231             0.218331    
factor(raceth)4    -0.10409156  0.02629176  -3.959 0.000075866961920102 ***
exper               0.02223704  0.00322661   6.892 0.000000000005915869 ***
expersq            -0.00035482  0.00005297  -6.699 0.00000000002236971

What do we see here compared to earlier results?

The coefficient on female identity is now about $\beta^f = -0.24$, smaller in magnitude by a factor of about a third. In other words, within occupation-industry-state groupings in the data, female workers earn on average 24% less than observationally identical males.

Meanwhile, the coefficient on years of education has fallen by roughly half, down to $\beta^e = 0.051$. Here too, the story is that within occupation-industry-state groupings of workers, the returns to an additional year of education are smaller than they are overall.

<h2>Discussion</h2>

As is often the case, including more variables $Z$ on the right-hand side of the regression equation often reduces the magnitudes and statistical significance of coefficients we care about, like on a key treatment variable $X$.

<b>The problem</b> is deciding whether the $Z$'s belong there or not. Are they potentially omitted variables that would obscure the true story about $\beta$, the effect of $X$ on $Y$, if they were omitted? Or are they bad controls, which block off a critical channel of causality running between $X$ and $Y$?

An <b>omitted variable</b> is a $Z$ that should be there, and by its omission is biasing the story about the causal effect of $X$ on the outcome $Y$. A classic example would be a parental characteristic that varies with the treatment variable; the thought experiment is: suppose a policy changed $X$ for kids, despite important parental characteristics $Z$ being what they are. What is the likely effect of the policy on $Y$?

A <b>bad control</b> is a $Z$ that shouldn't be there, because controlling for it blocks off a critical pathway through which $X$ causes $Y$. A great example might be $Z = occupation$ and/or industry in a regression of $Y = earnings$ on $X = education$. Education raises earnings in part because it moves people out of engaging in manual labor and into services and other industries. It might be true that a highly educated janitor could be more productive and thus earn more than a poorly educated janitor, but it is the transition from janitor to something like an office worker, doctor, or lawyer that really raises earnings a lot. If we control for occupation and industry in an earnings regression, we will be closing off that pathway of switching occupations or industries, through which we imagine education has a large causal effect on earnings.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>