# Mroz 1987: Regression and Instrumental Variables

Mroz explored (mis)specification of statistical models using labor data on married women from 1975. These data were used for a number of examples in the book *Econometric Analysis of Cross Section and Panel Data* by Jeffrey Wooldridge. Here some of these examples are shown further.

## Setup

In [1]:
import numpy as np
import scipy as sp
import pandas as pd

import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_2sls, ee_regression

print("Versions")
print("NumPy:        ", np.__version__)
print("SciPy:        ", sp.__version__)
print("pandas:       ", pd.__version__)
print("Delicatessen: ", deli.__version__)

Versions
NumPy:         2.3.5
SciPy:         1.16.3
pandas:        2.3.3
Delicatessen:  4.1


In [2]:
d = pd.read_csv('data/mroz.csv').dropna()
d['intercept'] = 1

## Chapter 4: The Single-Equation Linear Model and OLS Estimation 

Here, a simple model for the log-transformed wage (`lwage`) is fit as a function of labor market experience (`exper`), years of schooling (`educ`), age (`age`), number of kids 0-6 years old (`kidslt6`), and number of kids 6-18 years old (`kidsge6`). Fitting this model is easily done using the built-in `ee_regression` functionality

In [3]:
design_matrix = ['intercept', 'exper', 'expersq', 'educ', 'age', 'kidslt6', 'kidsge6']

In [4]:
def psi_lm(theta):
    return ee_regression(theta, 
                         X=d[design_matrix], 
                         y=d['lwage'], 
                         model='linear')  

In [5]:
estr = MEstimator(psi_lm, init=[0., ]*7)
estr.estimate()

In [6]:
r = pd.DataFrame()
r['label'] = design_matrix
r['Est'] = estr.theta
r['SE'] = np.diag(estr.variance)**0.5
r.set_index('label').round(3)

Unnamed: 0_level_0,Est,SE
label,Unnamed: 1_level_1,Unnamed: 2_level_1
intercept,-0.421,0.316
exper,0.04,0.015
expersq,-0.001,0.0
educ,0.108,0.014
age,-0.001,0.006
kidslt6,-0.061,0.105
kidsge6,-0.015,0.029


These results match those provided in the box. Note that the reported standard error (SE) here corresponds to the heteroskedasticity-robust standard error reported in the book. 

## Chapter 5: Instrumental Variables Estimation of Single-Equation Linear Models 

The next chapter considers instrumental variable estimation using 2-stage least squares (2SLS). In particular we are interested in the effect of education (`educ`) on log-transformed wages (`lwage`). Here, we will account labor market experience (`exper`) in both stages. The instruments in this setting are mother's education (`motheduc`), father's educaiton (`fatheduc`), and husband's education (`huseduc`). 

We will apply the 2SLS estimator using the built-in `ee_2sls` function

In [7]:
def psi_2sls(theta):
    return ee_2sls(theta,
                   y=d['lwage'],
                   A=d['educ'],
                   Z=d[['motheduc', 'fatheduc', 'huseduc']],
                   W=d[['intercept', 'exper', 'expersq']])

In [8]:
init_vals = [0., ] + [0., ]*3 + [0., ]*3*2
estr = MEstimator(psi_2sls, init=init_vals)
estr.estimate()

In [9]:
r = pd.DataFrame()
r['label'] = ['educ', 'intercept', 'exper', 'expersq']
r['Est'] = estr.theta[:4]
r['SE'] = np.diag(estr.variance)[:4]**0.5
r.set_index('label').round(3)

Unnamed: 0_level_0,Est,SE
label,Unnamed: 1_level_1,Unnamed: 2_level_1
educ,0.08,0.022
intercept,-0.187,0.3
exper,0.043,0.015
expersq,-0.001,0.0


Again, these match the output provided in the book. Note that the order of the output of `ee_2sls` is slightly different from the book. 

## Example from `OneSampleMR`

As a final use of the Mroz data, we replicate the example with two action variable (education and labor force experience) from the `OneSampleMR` documentation, provided [here](https://remlapmot.github.io/OneSampleMR/articles/f-statistic-comparison.html). In this case, we will have age, and the number of kids serve as the instruments.

Currently, `ee_2sls` does not allow for multiple action variables. Therefore, we instead code up with 2SLS estimator using the basic regression functions. Briefly, we fit two models in the first stage (one for `educ` and one for `exper`). Using the predicted values from these models, we then fit the second stage model for `lwage`.

In [10]:
# Instrument design matrix
Z = d[['intercept', 'age', 'kidslt6', 'kidsge6']]

In [11]:
def psi(theta):
    gamma = theta[:3]
    alpha = theta[3:3+4]
    beta = theta[3+4:]
    
    # First-stage regression for education
    ee_s1a = ee_regression(alpha, X=Z, y=d['educ'], 
                           model='linear')  
    a_hat = np.dot(Z, alpha)

    # First-stage regression for experience
    ee_s1b = ee_regression(beta, X=Z, y=d['exper'], 
                           model='linear')
    b_hat = np.dot(Z, beta)

    # Second-stage regression for log(wage)
    Xhat = np.c_[np.asarray(d['intercept']), a_hat, b_hat]
    ee_2s = ee_regression(gamma, X=Xhat, y=d['lwage'], 
                           model='linear')

    return np.vstack([ee_2s, ee_s1a, ee_s1b])

In [12]:
init_vals = [0., ]*3 + [0., ]*4*2
estr = MEstimator(psi, init=init_vals)
estr.estimate()

In [13]:
r = pd.DataFrame()
r['label'] = ['educ', 'exper', 'intercept']
r['Est'] = estr.theta[:3]
r['SE'] = np.diag(estr.variance)[:3]**0.5
r.set_index('label').round(3)

Unnamed: 0_level_0,Est,SE
label,Unnamed: 1_level_1,Unnamed: 2_level_1
educ,-0.36,1.105
exper,0.106,0.088
intercept,0.016,0.008


These output mostly match those provided in the documentation. However, note that SE differs due to the use of a different variance estimator. 

## References

Mroz, T. A. (1987). The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. *Econometrica* 55(4), 765-799.

Wooldridge, J. M. (2010). *Econometric analysis of cross section and panel data*. MIT press.