Following on from the example in Section 6.3, the file hwage.dat contains another subset of the data used by labor economist Tom Mroz. The variables with which we are concerned are

- $HW=$ husband’s wage in 2006 dollars
- $HE=$ husband’s education attainment in years
- $HA=$ husband's age
- $CIT=$ a variable equal to one if living in a large city, otherwise zero 

Estimate the model
$$ HW = \beta_1 + \beta_2 HE + \beta_3 HA + e$$
(a) What effects do changes in the level of education and age have on wages?

In [1]:
clear all
use http://www.principlesofeconometrics.com/poe4/data/stata/hwage.dta

reg hw he ha





      Source |       SS           df       MS      Number of obs   =       753
-------------+----------------------------------   F(2, 750)       =     74.37
       Model |  31825.8982         2  15912.9491   Prob > F        =    0.0000
    Residual |   160479.81       750   213.97308   R-squared       =    0.1655
-------------+----------------------------------   Adj R-squared   =    0.1633
       Total |  192305.708       752  255.725676   Root MSE        =    14.628

------------------------------------------------------------------------------
          hw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          he |   2.193289   .1800506    12.18   0.000     1.839826    2.546752
          ha |   .1996641   .0674912     2.96   0.003       .06717    .3321583
       _cons |  -8.123578   4.158325    -1.95   0.051    -16.28692     .039763
-----------------------------------------------

This suggests that each additional year of educational attainment increases wage by 2.193289 dollars, and that each additional year increases wage by 0.1996641.

(b) Does RESET suggest that the model in part (a) is adequate?

In [2]:
estat ovtest


Ramsey RESET test using powers of the fitted values of hw
       Ho:  model has no omitted variables
                 F(3, 747) =      6.65
                  Prob > F =      0.0002


No, it suggests the opposite. We reject the null hypothesis and accept the alternative hypothesis that the model has ommited variable bias.

(c) Add the variables $HE^2$ and $HA^2$ to the original equation and re-estimate it. Describe the effect that education and age have on wages in this newly estimated model.

In [3]:
gen he2 = he^2
gen ha2 = ha^2
reg hw he ha he2 ha2





      Source |       SS           df       MS      Number of obs   =       753
-------------+----------------------------------   F(4, 748)       =     44.37
       Model |  36876.7034         4  9219.17584   Prob > F        =    0.0000
    Residual |  155429.005       748  207.792787   R-squared       =    0.1918
-------------+----------------------------------   Adj R-squared   =    0.1874
       Total |  192305.708       752  255.725676   Root MSE        =    14.415

------------------------------------------------------------------------------
          hw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          he |  -1.457971   1.122786    -1.30   0.195    -3.662157    .7462154
          ha |   2.889541   .7328868     3.94   0.000     1.450781    4.328301
         he2 |   .1511426   .0458277     3.30   0.001     .0611762    .2411089
         ha2 |  -.0301212   .0081339    -3.70  

While education does not have signficiant relationship with wages, education squared does. Age and age squared both are significant in predicting HW. 

(d) Does RESET suggest that the model in part (c) is adequate?

In [4]:
estat ovtest


Ramsey RESET test using powers of the fitted values of hw
       Ho:  model has no omitted variables
                 F(3, 745) =      1.33
                  Prob > F =      0.2627


Yes, it does , because it doesn't have omitted variable bias.

(e) Reestimate the model in part (c) with the variable CIT included. What can you say about the level of wages in large cities relative to outside those cities?

In [5]:
reg hw he ha he2 ha2 cit


      Source |       SS           df       MS      Number of obs   =       753
-------------+----------------------------------   F(5, 747)       =     48.30
       Model |  46984.2168         5  9396.84337   Prob > F        =    0.0000
    Residual |  145321.492       747  194.540149   R-squared       =    0.2443
-------------+----------------------------------   Adj R-squared   =    0.2393
       Total |  192305.708       752  255.725676   Root MSE        =    13.948

------------------------------------------------------------------------------
          hw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          he |  -2.207574   1.091357    -2.02   0.043    -4.350066   -.0650814
          ha |   2.621256   .7101069     3.69   0.000     1.227214    4.015299
         he2 |   .1687597   .0444096     3.80   0.000     .0815773    .2559421
         ha2 |  -.0277679    .007877    -3.53   0.

Being in a city is correlated with a large boost in wages.

(f) Do you think CIT should be included in the equation?

Yes, because the adjusted R-squared is higher with it included.

In [6]:
estat ovtest


Ramsey RESET test using powers of the fitted values of hw
       Ho:  model has no omitted variables
                 F(3, 744) =      0.55
                  Prob > F =      0.6511


Each additional year of education 

(g) For both the model estimated in part (c) and the model estimated in part (e), evaluate the following four derivatives:
$$ \frac{\partial HW}{\partial HE} \text{ for } HE = 6 \text{ and } HE = 15$$
$$ \frac{\partial HW}{\partial HA} \text{ for } HA = 35 \text{ and } HE = 50$$

$$ HW = \beta_1 + \beta_2 HE + \beta_3 HA + \beta_4HE^2 + \beta_5 HA^2 + e$$

$$\frac{\partial HW}{\partial HE} = \beta_2 + 2\beta_4 HE \Rightarrow \frac{\partial HW}{\partial HE}(6) = 0.3557402, \frac{\partial HW}{\partial HE}(15)  = 3.076307$$
$$\frac{\partial HW}{\partial HA} = \beta_3 + 2\beta_5 HA \Rightarrow \frac{\partial HW}{\partial HE}(35) = 0.781057,  \frac{\partial HW}{\partial HE}(50) = -0.122579 $$

$$ HW = \beta_1 + \beta_2 HE + \beta_3 HA + \beta_4HE^2 + \beta_5 HA^2 + CIT + e$$

$$\frac{\partial HW}{\partial HE} = \beta_2 + 2\beta_4 HE \Rightarrow \frac{\partial HW}{\partial HE}(6) = 2.5280866, \frac{\partial HW}{\partial HE}(15)  = 1.985905$$
$$\frac{\partial HW}{\partial HA} = \beta_3 + 2\beta_5 HA \Rightarrow \frac{\partial HW}{\partial HE}(35) = 0.677503,  \frac{\partial HW}{\partial HE}(50) = -0.155534 $$

Does the omission of CIT lead to omitted-variable bias? Can you suggest why?

A little bit, but not by much - I'm assuming that there isn't much of a correlation in AGE and probablility of living in the city. 

Write a one paragraph summary of Farrar and Glauber (1967) explaining the multicollinearity problem in terms of estimation and specification.

Multicollinearity is the interdependence among the explanatory variables $X$. This means that the correlation matrix $(X^T X)$ approaches singularity, and the elements of the inverse matrix $(X^T X)^{-1}$ explode. Variances for the parameter estimates, given by $V(b') = \sigma^2_u (X^T X)^{-1}$ hence also become infinite. This means that explained variance can be allocated arbitrarily among the regression coefficients. As for specification, models tend usually to be a pared-down version of the econometrician's more complex mental model, because of limitations (e.g. multicollinearity) in the data. The sample usually contains basic information, spread out over a larger number of increasingly multicollinear independent variables. This rapidly decreases the stability and therefore the sample significance of each independent variable's contribution to explained variance. The econometrician is therefore in a bind because he wants to have large numbers of variables to obtain reliable forecasts of model complex relationships - but this almost always introducing greater multicollinearity.

In summary, multicollinearity is a statistical, rather than a mathematical condition, and hence it is a matter of determining the severity, rather than the existence of multicollinearity. What we are trying to do is to reduce the gap between the informational requirements of a model, and the informational content of the data, and we can do this via a combination is reducing model complexity and increasing information. Ideally, the model is completely specified, and the data is internall orthogonality. By locating the interdependence in $X$, and understanding its pattern (factor analysis), instability can be evaluated and corrected. Multicollinearity amongst non-critical variables can be tolerated, but if critical variables are affected, we need more information in order to provide coefficient estimates (either for the variables direcrtly, or the members of the set on which they are dependent).  

In a statistical model collinearity arises because of poor experimental design, or in the case below, because of data that do not vary enough to permit precise measurement of the parameters. Run the code below and explain the output in each step. Note that the variance inflation factor (VIF) is a function of the R-squared of an auxiliary regression. The rule-of-thumb is that a VIF > 10 indicates weak identification of the corresponding variable's coefficient.

In [7]:
clear all
use https://www.stata.com/data/s4poe4/rice.dta 
summarize
pwcorr





    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        firm |        352        22.5     12.7165          1         44
        year |        352      1993.5    2.294549       1990       1997
        prod |        352    6.466392    5.076672        .09       31.1
        area |        352    2.117528    1.451403         .2          7
       labor |        352    107.2003     76.6456          8        436
-------------+---------------------------------------------------------
        fert |        352    187.0545    168.5852        3.4     1030.9


             |     firm     year     prod     area    labor     fert
-------------+------------------------------------------------------
        firm |   1.0000 
        year |   0.0000   1.0000 
        prod |  -0.2344   0.0183   1.0000 
        area |  -0.2410  -0.0431   0.8876   1.0000 
       labor |  -0.2514  -0.0341   0.8899   0.9192   1

We see a concerning degree of correlation between some of the variables: For example, labor and fert is at 0.83, labor and area is at 0.92, area and fert is at 0.84, prod and area is at 0.89, prod and labor is at 0.89, prod and fert is at 0.82.

In [8]:
gen lprod = log(prod) 
gen larea = log(area) 
gen llabor = log(labor) 
gen lfert = log(fert) 
pwcorr lprod larea llabor lfert //to quickly regenerate correlations, one can use the wildcard * 







             |    lprod    larea   llabor    lfert
-------------+------------------------------------
       lprod |   1.0000 
       larea |   0.8934   1.0000 
      llabor |   0.9004   0.9280   1.0000 
       lfert |   0.8540   0.8520   0.8631   1.0000 


Again, there is a high degree of collinearity within the variables labor, prod, area, and fert (and therefore their log counterparts)

In [9]:
regress lprod larea llabor lfert if year == 1994 


      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(3, 40)        =     92.91
       Model |  27.3469024         3  9.11563413   Prob > F        =    0.0000
    Residual |  3.92452618        40  .098113154   R-squared       =    0.8745
-------------+----------------------------------   Adj R-squared   =    0.8651
       Total |  31.2714286        43  .727242524   Root MSE        =    .31323

------------------------------------------------------------------------------
       lprod |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       larea |   .2106069   .1820736     1.16   0.254    -.1573777    .5785914
      llabor |    .377584   .2550577     1.48   0.147    -.1379068    .8930747
       lfert |   .3433348   .1279984     2.68   0.011     .0846404    .6020292
       _cons |  -1.947286   .7384865    -2.64   0.

Both larea and llabor are not statistically significant, but they are probably important. Because of the high degree of multicollinearity, the regression is uncertain about how to establish the coefficient estimates between larea and llabor.

In [10]:
estat vif //variance inflation factor 


    Variable |       VIF       1/VIF  
-------------+----------------------
      llabor |     17.73    0.056389
       larea |      9.15    0.109297
       lfert |      7.68    0.130141
-------------+----------------------
    Mean VIF |     11.52


Yup, there is a lot of multicollinearity. llabor is especially problematic.

In [11]:
constraint 1 larea + llabor + lfert = 1 
cnsreg lprod larea llabor lfert if year ==1994, c(1) 





Constrained linear regression                   Number of obs     =         44
                                                Root MSE          =     0.3134

 ( 1)  larea + llabor + lfert = 1
------------------------------------------------------------------------------
       lprod |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       larea |   .2262278   .1815276     1.25   0.220    -.1403746    .5928303
      llabor |   .4834192      .2332     2.07   0.044     .0124623     .954376
       lfert |    .290353   .1170861     2.48   0.017     .0538929    .5268131
       _cons |  -2.168297   .7064722    -3.07   0.004    -3.595047   -.7415474
------------------------------------------------------------------------------


We constrain this with our own knowledge that larea + llabor + lfert = 1. We obtain better estimates than in the previous regression.

In [12]:
reg lprod larea llabor lfert 
estat vif



      Source |       SS           df       MS      Number of obs   =       352
-------------+----------------------------------   F(3, 348)       =    646.51
       Model |  226.084873         3  75.3616243   Prob > F        =    0.0000
    Residual |   40.565356       348  .116567115   R-squared       =    0.8479
-------------+----------------------------------   Adj R-squared   =    0.8466
       Total |  266.650229       351  .759687262   Root MSE        =    .34142

------------------------------------------------------------------------------
       lprod |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       larea |   .3617359   .0639678     5.65   0.000     .2359237    .4875481
      llabor |   .4328479   .0668825     6.47   0.000      .301303    .5643928
       lfert |   .2095023   .0382654     5.47   0.000     .1342417    .2847628
       _cons |  -1.546786   .2556536    -6.05   0

Just a normal regression, but this time there is less multicollinearity because there are more observations and hence a richer dataset. 

In [13]:
cnsreg lprod larea llabor lfert , c(1) 
estat vif



Constrained linear regression                   Number of obs     =        352
                                                Root MSE          =     0.3409

 ( 1)  larea + llabor + lfert = 1
------------------------------------------------------------------------------
       lprod |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       larea |    .359491   .0625303     5.75   0.000     .2365074    .4824745
      llabor |   .4299212   .0645841     6.66   0.000     .3028982    .5569442
       lfert |   .2105878    .037687     5.59   0.000     .1364657      .28471
       _cons |  -1.538065    .250208    -6.15   0.000     -2.03017   -1.045959
------------------------------------------------------------------------------

estat vif not valid


r(321);
r(321);




