<img src="images/econ151.png" width="200" />

<h1>ECON 151 Class 20</h1>

As we did in Class 01, let us examine a dataset that measures wages and other characteristics of a cohort of men in the National Longitudinal Surveys (NLS) using a useful repository of data from Jeffrey Wooldridge's excellent textbook, <i>Introductory Econometrics, a Modern Approach</i>. 

This appears as Example 9.3 on page 281 of the 6th edition, and it draws on a dataset provided by [Blackburn and Newmark (1992)](https://www-jstor-org.libproxy.berkeley.edu/stable/2118394) on monthly earnings and other characteristics among men in 1980. As described by Blackburn and Newmark, the data come from the Young Men's Cohort of the NLS, first surveyed in 1966 at ages 14-24 (i.e., born in 1942-1952) and then again at one or two-year intervals afterward. Wooldridge remarks that the `wage2` extract includes wage data from 1980, when the individuals are aged 28-38.

<h2>Loading in the data</h2>

Helpfully, folks have dumped all Wooldridge's public datasets into an R package for us to use. Here is code that sets that up. Highlight the code snippet with your mouse or trackpad, and hit <tt>SHIFT+ENTER</tt> to run it.

In [1]:
install.packages('wooldridge')

Installing package into ‘/srv/r’
(as ‘lib’ is unspecified)



This command digs into that loaded package and retrieves part of it for our data:

In [2]:
data(wage2, package='wooldridge')

There are several ways of probing what it is that we just loaded. One convenient function to call is <tt>head()<tt>:

In [3]:
head(wage2)

Unnamed: 0_level_0,wage,hours,IQ,KWW,educ,exper,tenure,age,married,black,south,urban,sibs,brthord,meduc,feduc,lwage
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,769,40,93,35,12,11,2,31,1,0,0,1,1,2.0,8,8.0,6.645091
2,808,50,119,41,18,11,16,37,1,0,0,1,1,,14,14.0,6.694562
3,825,40,108,46,14,11,9,33,1,0,0,1,1,2.0,14,14.0,6.715384
4,650,40,96,32,12,13,7,32,1,0,0,1,4,3.0,12,12.0,6.476973
5,562,40,74,27,11,14,5,34,1,0,0,1,10,6.0,6,11.0,6.331502
6,1400,40,116,43,16,14,2,35,1,1,0,1,1,2.0,8,,7.244227


Another is to type `?wooldridge::wage2`

In [None]:
?wooldridge::wage2

The variables have mnemonic names you can guess. Probably the strangest one is <tt>lwage</tt>, which appears at the far right of the results window (scroll right), and which is the <b>natural logarithm of monthly earnings</b>.

<h3>The hourly wage</h3>

Monthly earnings are the product of the hourly wage, weekly hours worked, and weeks worked per month. Because each of these can vary across workers, it might be best to examine the hourly wage. Let's construct an estimate of the hourly wage using the `hours` variable in the dataset, which measures average weekly hours. We'll assume there are 4 weeks per month.

In [11]:
# hourly wage
wage2$hourlywage <- wage2$wage/wage2$hours/4

# log hourly wage
wage2$loghourlywage <- log(wage2$hourlywage)

While we're tweaking variables, let's also add the square of years of labor market experience.

In [15]:
# square of experience
wage2$expersq <- wage2$exper^2

<h2>Mincer earnings regressions</h2>

Jacob Mincer ([1974](https://www.nber.org/books-and-chapters/schooling-experience-and-earnings)) formalized what is now a standard tool in economics: the log earnings regression. [Heckman, Lockner, and Todd (2003)](https://www-nber-org.libproxy.berkeley.edu/papers/w9732) provide a review. The basic formulation of a Mincer log wage regression is

$$\log w_i = \alpha + \rho_s \ s_i + \beta_0 x_i + \beta_1 x_i^2 + \epsilon_i$$

where $w_i$ is the hourly wage for worker $i$; $s_i$ measures their years of schooling or education; there is a quadratic in their years of labor market experience, $x_i$; and $\epsilon_i$ is a white-noise error term with mean zero.

Other variables can appear on the right-hand side, such as gender identity, racial or ethnic identity, geographic location, and industry and occupation. When test scores are available, those also sometimes appear on the right-hand side.

<h3>Hourly wage or earnings?</h3>

I think manmy studies move back and forth between these two concepts as potential $y$-variables. If there are big differences in hours worked between units, then earnings will be affected by those differences, and it might be smarter to model wages instead.

<h3>Ordinary least squares in R</h3>

In R the <tt>lm()</tt> function fits multivariate linear models conveniently. The syntax takes getting used to, but to estimate this model:
$$y = \alpha + \beta x + \gamma z + \epsilon$$
we  call this code:

<center><tt>lm(y ~ x + z)</tt></center>

Can you see the similarities?

To estimate this equation via ordinary least squares (OLS), we call <tt>lm()</tt> (for "linear model") with an estimating equation. I like assigning the output to new structures on the left hand side of the "gets" operator, <tt><-</tt>

<h3>Racial inequality in men's wages</h3>

A useful way to document and explore inequality in wages is with multivariate OLS, controlling for the productive characteristics of workers on the right-hand side. Such an approach "bakes in" any causal effect of discrimination that runs through the productive characteristics, like educational attainment. But it it provides a useful answer to a specific question: how much inequality in wages do we see that is attributable to dynamics in the labor market?

Let's begin with an unadjusted comparison of log wages between black and white men. (Other groups were omitted from this extract.)

In [14]:
reg1 <- lm(loghourlywage ~ black,
           data = wage2)
summary(reg1)


Call:
lm(formula = loghourlywage ~ black, data = wage2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.20611 -0.27901  0.03483  0.31195  1.59182 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.65272    0.01564 105.642  < 2e-16 ***
black       -0.23745    0.04367  -5.438  6.9e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4466 on 933 degrees of freedom
Multiple R-squared:  0.03072,	Adjusted R-squared:  0.02968 
F-statistic: 29.57 on 1 and 933 DF,  p-value: 6.9e-08


<hr>

In words: Black men's wages were 23.7 percent lower than white men's wages in this dataset.

Let's now control for education and years of experience

In [17]:
reg2 <- lm(loghourlywage ~ black + educ + exper + expersq,
           data = wage2)
summary(reg2)


Call:
lm(formula = loghourlywage ~ black + educ + exper + expersq, 
    data = wage2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.05102 -0.26087  0.03652  0.29009  1.63980 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.5059786  0.1381430   3.663 0.000264 ***
black       -0.1736897  0.0425699  -4.080 4.89e-05 ***
educ         0.0661956  0.0073160   9.048  < 2e-16 ***
exper        0.0231237  0.0147436   1.568 0.117129    
expersq     -0.0001334  0.0006173  -0.216 0.828953    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.428 on 930 degrees of freedom
Multiple R-squared:  0.1127,	Adjusted R-squared:  0.1089 
F-statistic: 29.53 on 4 and 930 DF,  p-value: < 2.2e-16


<hr>

Once we control for years of education and a quadratic in years of labor market experience, the wage penalty for Black men falls in magnitude to 17.3 percent.

Note that the quadratic term is negative, which is what one would like to see: a wage that rises with experience or age, but at a decreasing rate, until reaching some maximum. Here, the maximum is reached at this level ("the x-coordinate of the vertex," from algebra or calculus):

In [20]:
coefs_reg2 <- coef(reg2)
(maxexper_reg2 <- -1*coefs_reg2["exper"]/(2*coefs_reg2["expersq"]))

Finally, let's also control for the IQ score:

In [21]:
reg3 <- lm(loghourlywage ~ black + educ + exper + expersq
                           + IQ,
           data = wage2)
summary(reg3)


Call:
lm(formula = loghourlywage ~ black + educ + exper + expersq + 
    IQ, data = wage2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0924 -0.2632  0.0344  0.2928  1.6163 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.2470939  0.1538850   1.606 0.108679    
black       -0.1149684  0.0451380  -2.547 0.011024 *  
educ         0.0526338  0.0081317   6.473 1.56e-10 ***
exper        0.0233206  0.0146432   1.593 0.111593    
expersq     -0.0001436  0.0006131  -0.234 0.814860    
IQ           0.0042780  0.0011516   3.715 0.000215 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4251 on 929 degrees of freedom
Multiple R-squared:  0.1257,	Adjusted R-squared:  0.121 
F-statistic: 26.71 on 5 and 929 DF,  p-value: < 2.2e-16


<hr>

In this last step, we see the magnitude of the wage penalty felt by Black men fall to 11.5 percent.

This is a simple but useful approach to probing wage inequality across groups. Here, our results suggest that discrimination and other factors appear to be responsible for a Black wage penalty of about 12 percent. That's pretty big, about two years' worth of schooling.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>