# Three-stage Least Squares (3SLS)

This example demonstrates how a system of simultaneous equations can be jointly estimated using three-stage least squares (3SLS).  The simultaneous equations model the wage and number of hours worked.  The two equations are 

$$
\begin{eqnarray}
hours & = & \beta_0 + \beta_1 \ln(wage) + \beta_2 educ + \beta_3 age + \beta_4 kidslt6 + \beta_5 nwifeinc + \epsilon^h_i 
\\
\ln(wage) & = & \gamma_0 + \gamma_1 hours + \gamma_2 educ + \gamma_3 educ^2 + \gamma_4 exper + \epsilon^w_i 
\end{eqnarray}
$$

Each equation has a single exogenous variables.  The instruments for the endogenous variables are the regressors that appear in one equation but not the other. 

## Data

The data set is the MORZ data set from Wooldridge (2002).

In [1]:
from linearmodels.datasets import mroz
data = mroz.load()

Here the relevant variables are selected and missing observations are dropped to avoid warnings.

In [2]:
data = data[["hours","educ","age","kidslt6","nwifeinc","lwage","exper","expersq"]]
data = data.dropna()

The main models are imported:

* `IV2SLS` - single equation 2-stage least squares
* `IV3SLS` - system estimation using instrumental variables
* `SUR` - system estimation without endogenous variables


In [3]:
from linearmodels import IV2SLS, IV3SLS, SUR, IVSystemGMM

## Formulas

These examples use the formula interface.  This is usually simpler when models have exogenous regressors, endogenous regressors and instruments.  The syntax is the same as in the 2SLS models.

In [4]:
hours = "hours ~ educ + age + kidslt6 + nwifeinc + [lwage ~ exper + expersq]"

hours_mod = IV2SLS.from_formula(hours, data)
hours_res = hours_mod.fit(cov_type="unadjusted")
print(hours_res)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  hours   R-squared:                      0.1903
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1807
No. Observations:                 428   F-statistic:                    399.30
Date:                Mon, Dec 14 2020   P-value (F-stat)                0.0000
Time:                        16:47:04   Distribution:                  chi2(5)
Cov. Estimator:            unadjusted                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -99.299     48.997    -2.0266     0.0427     -195.33     -3.2666
age            19.429     6.2770     3.0952     0.00

The $\ln$ wage model can be similarly specified and estimated

In [5]:
lwage = "lwage ~ educ + exper + expersq + [hours ~ age + kidslt6 + nwifeinc]"

lwage_mod = IV2SLS.from_formula(lwage, data)
lwage_res = lwage_mod.fit(cov_type="unadjusted")
print(lwage_res)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                      0.7582
Estimator:                    IV-2SLS   Adj. R-squared:                 0.7559
No. Observations:                 428   F-statistic:                    1362.4
Date:                Mon, Dec 14 2020   P-value (F-stat)                0.0000
Time:                        16:47:04   Distribution:                  chi2(4)
Cov. Estimator:            unadjusted                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ           0.0875     0.0162     5.3892     0.0000      0.0557      0.1193
exper          0.0524     0.0299     1.7501     0.08

A system can be specified using a dictionary for formulas.  The dictionary keys are used as equation labels. Aside from this simple change, the syntax is identical.  

Here the model is estimated using `method="ols"` which will just simultaneously estimate the two equations but will produce estimates that are identical to separate equations. 

In [6]:
equations = dict(hours=hours, lwage=lwage)
system_2sls = IV3SLS.from_formula(equations, data)
system_2sls_res = system_2sls.fit(method="ols", cov_type="unadjusted")
print(system_2sls_res)

                           System OLS Estimation Summary                           
Estimator:                        OLS   Overall R-squared:                   0.1903
No. Equations.:                     2   McElroy's R-squared:                 0.1276
No. Observations:                 428   Judge's (OLS) R-squared:            -2.0961
Date:                Mon, Dec 14 2020   Berndt's R-squared:                 -0.7279
Time:                        16:47:05   Dhrymes's R-squared:                 0.1903
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours                  
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -99.299     48.997    -2.0266     0.0427     -195.33     -3.2666
age         

Using `method="gls"` will use GLS estimates which can be more efficient than the usual estimates. Here only the first equation changes.  This is due to the structure of the problem.

In [7]:
equations = dict(hours=hours, lwage=lwage)
system_3sls = IV3SLS.from_formula(equations, data)
system_3sls_res = system_3sls.fit(method="gls", cov_type="unadjusted")
print(system_3sls_res)

                           System GLS Estimation Summary                           
Estimator:                        GLS   Overall R-squared:                   0.0120
No. Equations.:                     2   McElroy's R-squared:                 0.0873
No. Observations:                 428   Judge's (OLS) R-squared:            -2.7778
Date:                Mon, Dec 14 2020   Berndt's R-squared:                 -0.7279
Time:                        16:47:05   Dhrymes's R-squared:                 0.0120
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours                  
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -109.90     48.052    -2.2870     0.0222     -204.08     -15.716
age         

## Direct Model Specification

The model can be directly specified using a dictionary of dictionaries where the inner dictionaries contain the 4 components of the model:

* dependent - The dependent variable
* exog - Exogenous regressors
* endog - Endogenous regressors
* instruments - Instrumental variables

The estimates are the same.  This interface is more useful for programmatically generating and estimating models.

In [8]:
hours = {"dependent": data[["hours"]],
         "exog": data[["educ","age","kidslt6","nwifeinc"]],
         "endog": data[["lwage"]],
         "instruments": data[["exper","expersq"]]}

lwage = {"dependent": data[["lwage"]],
         "exog": data[["educ","exper","expersq"]],
         "endog": data[["hours"]],
         "instruments": data[["age","kidslt6","nwifeinc"]]}

equations = dict(hours=hours, lwage=lwage)
system_3sls = IV3SLS(equations)
system_3sls_res = system_3sls.fit(cov_type="unadjusted")
print(system_3sls_res)

                           System GLS Estimation Summary                           
Estimator:                        GLS   Overall R-squared:                   0.0120
No. Equations.:                     2   McElroy's R-squared:                 0.0873
No. Observations:                 428   Judge's (OLS) R-squared:            -2.7778
Date:                Mon, Dec 14 2020   Berndt's R-squared:                 -0.7279
Time:                        16:47:05   Dhrymes's R-squared:                 0.0120
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours                  
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -109.90     48.052    -2.2870     0.0222     -204.08     -15.716
age         

# System GMM Estimation

System GMM is an alternative to 3SLS estimation. It is the natural extension to GMM estimation of IV models.  It makes weaker assumptions about instruments than 3SLS does. In particular, instruments are assumed exogenous on an equation-by-equation basis rather than the 3SLS assumption that all instruments are exogenous in all equations. 

The system GMM estimator is similar to the 3SLS estimator except that it requires making a choice about the moment weighting estimator.  Valid options for the weighting estimator are `"unadjusted"` or `"homoskedastic"` which assumes that residuals are conditionally homoskedastic or `"robust"` or `"heteroskedastic"` which allows for conditional heteroskedasticity. 

The System GMM estimator also supports iterative application where it is possible to iterate until convergence.  

Here the examples make use of the same data as in the 3SLS example and only use the formula interface. The default uses 2-step (efficient) GMM.

In [9]:
equations = dict(hours="hours ~ educ + age + kidslt6 + nwifeinc + [lwage ~ exper + expersq]", 
                 lwage="lwage ~ educ + exper + expersq + [hours ~ age + kidslt6 + nwifeinc]")
system_gmm = IVSystemGMM.from_formula(equations, data, weight_type="unadjusted")
system_gmm_res = system_gmm.fit(cov_type="unadjusted")
print(system_gmm_res)

                    System 2-Step System GMM Estimation Summary                    
Estimator:          2-Step System GMM   Overall R-squared:                   0.0121
No. Equations.:                     2   McElroy's R-squared:                 0.0871
No. Observations:                 428   Judge's (OLS) R-squared:            -2.7776
Date:                Mon, Dec 14 2020   Berndt's R-squared:                 -0.7268
Time:                        16:47:05   Dhrymes's R-squared:                 0.0121
                                        Cov. Estimator:                  unadjusted
                                        Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours                  
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -109.89     48.038    -2.2876     0.0222     -204.05     -15.741
age         

Robust weighting can be used by setting the `weight_type`.  The number of iterations can be set using `iter_limit`. Overall the parameters do not meaningfully change. 

In [10]:
system_gmm = IVSystemGMM.from_formula(equations, data, weight_type="robust")
system_gmm_res = system_gmm.fit(cov_type="robust", iter_limit=100)
print("Number of iterations: " + str(system_gmm_res.iterations))
print(system_gmm_res)

Number of iterations: 20


                    System Iterative System GMM Estimation Summary                   
Estimator:         Iterative System GMM   Overall R-squared:                  -0.0345
No. Equations.:                       2   McElroy's R-squared:                -0.2256
No. Observations:                   428   Judge's (OLS) R-squared:            -2.9557
Date:                  Mon, Dec 14 2020   Berndt's R-squared:                 -2.0361
Time:                          16:47:05   Dhrymes's R-squared:                -0.0345
                                          Cov. Estimator:                      robust
                                          Num. Constraints:                      None
                  Equation: hours, Dependent Variable: hours                  
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
educ          -118.31     57.508    -2.0572     0.0397     -231.02     -5.5