In [40]:
import pandas as pd
import numpy as np 
import statsmodels.api as sm 
from patsy import dmatrices, dmatrix

@author: Yiming Cai

### Question (a)

Use OLS to estimate the parameters of the model
$$logw = \beta_{1} + \beta_{2}educ + \beta_{3}exper + \beta_{4}exper^{2} + \beta_{5}smsa + \beta_{6}south + \epsilon $$
Give an interpretation to the estimated β2 coefficient.

In [2]:
df = pd.read_excel("Test4_data.xls")
df.head()

Unnamed: 0,logw,educ,age,exper,smsa,south,nearc,daded,momed
0,6.306275,7,29,16,1,0,0,9.94,10.25
1,6.175867,12,27,9,1,0,0,8.0,8.0
2,6.580639,12,34,16,1,0,0,14.0,12.0
3,5.521461,11,27,10,1,0,1,11.0,12.0
4,6.591674,12,34,16,1,0,1,8.0,7.0


In [3]:
y, X = dmatrices("logw ~ educ + exper + np.square(exper) + smsa + south", df)

>The OLS estimation result is given as follows: 

In [4]:
mod = sm.OLS(y, X).fit()
print (mod.summary())

                            OLS Regression Results                            
Dep. Variable:                   logw   R-squared:                       0.263
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     214.6
Date:                Mon, 14 May 2018   Prob (F-statistic):          3.70e-196
Time:                        20:28:05   Log-Likelihood:                -1365.6
No. Observations:                3010   AIC:                             2743.
Df Residuals:                    3004   BIC:                             2779.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            4.6110      0.068  

### Answer (a) 

> Other things being equal, taking one year education, the expected log of wage (logw) would increase by 0.086. Or put it another way, because $$ log(\frac{w2}{w1}) = 0.0816 \Rightarrow\frac{w2}{w1} = e^{0.0816} = 1.085 \Rightarrow w2 = (1+8.5\%)w1 \Rightarrow $$ one additional year of education is associated with 8.5% increase on expected wage level. 

### Question (b)
 
 OLS may be inconsistent in this case as __educ__ and __exper__ may be endogenous. Give a reason why this may be the case. Also indicate whether the estimate in part (a) is still useful.

### Answer (b)

> educ and exper might be endogenous due to __ommited variables__. For example, individual's characteristics are likely to influence individual's educ and exper. Individual with higher intellectual ability and motivation would likely to obtain higher education, i.e., more number of years of schooling (educ). Also, people with hard-working ethics tend to have more working experience. All these characteristics are likely to positively influence wage level but not included in the model. Therefore, __educ__ and __exper__ are endogenouse, resulting estimate in part(a) being inconsistent. 

### Question (c)
 
 Give a motivation why __age__ and __age2__ can be used as instruments for exper and exper2.

### Answer(c)

> Older people tend to have longer working experience, nevertheless, the wage is unlikely to influenced by age itself. Therefore, __age__ and **age^{2}** is likely to be correlated with **exper** and **exper^{2}** but uncorrelated with error term ($\epsilon$), which suffice them to be instruments for exper and exper2. 

### Question (d)

Run the first-stage regression for __educ__ for the two-stage least squares estimation of the parameters in the model above when __age, age2, nearc, dadeduc, and momeduc__ are used as additional instruments. What do you conclude about the suitability of these instruments for schooling?

In [21]:
df.head()

Unnamed: 0,logw,educ,age,exper,smsa,south,nearc,daded,momed
0,6.306275,7,29,16,1,0,0,9.94,10.25
1,6.175867,12,27,9,1,0,0,8.0,8.0
2,6.580639,12,34,16,1,0,0,14.0,12.0
3,5.521461,11,27,10,1,0,1,11.0,12.0
4,6.591674,12,34,16,1,0,1,8.0,7.0


In [22]:
y2, X2 = dmatrices("educ ~ age + np.square(age) + nearc + daded + momed + smsa + south", df)

In [23]:
first_stage_mod = sm.OLS(y2, X2).fit()
print (first_stage_mod.summary())

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.245
Method:                 Least Squares   F-statistic:                     140.4
Date:                Mon, 14 May 2018   Prob (F-statistic):          2.14e-179
Time:                        20:28:44   Log-Likelihood:                -6808.2
No. Observations:                3010   AIC:                         1.363e+04
Df Residuals:                    3002   BIC:                         1.368e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -5.6524      3.976     -1.

The above result suggests: 
<ol>
  <li>There are enough instruments.</li>
  <li>The p-values suggest instruments are correlated with educ.</li>
</ol>
Therefore, these instruments are suitable for schooling. However, the validity of these instruments require following Sargon test.  

### Question (e)

Estimate the parameters of the model for log wage using two-stage least squares where you correct for the endogeneity of education and experience. Compare your result to the estimate in part (a).

> As suggested by Quesiton (b) and Question (c). $age, age2, nearc, dadeduc, and momeduc$ can be used as instruments for $educ$, and $age$ and $age^{2}$ would be instruments for $expr$ and $expr^{2}$ respectively. 

In [24]:
y3, X3 = dmatrices("exper ~ age + np.square(age) + nearc + daded + momed + smsa + south ", df )
expr_stage1_mod = sm.OLS(y3, X3).fit()
print (expr_stage1_mod.summary())

                            OLS Regression Results                            
Dep. Variable:                  exper   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     933.7
Date:                Mon, 14 May 2018   Prob (F-statistic):               0.00
Time:                        20:29:12   Log-Likelihood:                -6808.2
No. Observations:                3010   AIC:                         1.363e+04
Df Residuals:                    3002   BIC:                         1.368e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -0.3476      3.976     -0.

In [25]:
y4, X4 = dmatrices("np.square(exper) ~ age + np.square(age) + nearc + daded + momed + smsa + south", df )
expr2_stage1_mod = sm.OLS(y4, X4).fit()
print (expr2_stage1_mod.summary())

                            OLS Regression Results                            
Dep. Variable:       np.square(exper)   R-squared:                       0.657
Model:                            OLS   Adj. R-squared:                  0.656
Method:                 Least Squares   F-statistic:                     820.4
Date:                Mon, 14 May 2018   Prob (F-statistic):               0.00
Time:                        20:29:23   Log-Likelihood:                -16020.
No. Observations:                3010   AIC:                         3.206e+04
Df Residuals:                    3002   BIC:                         3.210e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        681.3828     84.846      8.

> Calculated the predicted values for $educ, exper, exper^{2}$

In [26]:
educ_explained = first_stage_mod.predict(X2)

In [27]:
exper_explained = expr_stage1_mod.predict(X3)

In [28]:
exper2_explained = expr2_stage1_mod.predict(X4)

> Next, use the predicted values as variables the second stage OLS

In [29]:
df_2sls = df[["logw","smsa", "south"] ].copy()
df_2sls["educ_explained"] = educ_explained
df_2sls["exper_explained"] = exper_explained
df_2sls["exper2_explained"] = exper2_explained

In [30]:
df_2sls.head()

Unnamed: 0,logw,smsa,south,educ_explained,exper_explained,exper2_explained
0,6.306275,1,0,13.55971,9.44029,96.593284
1,6.175867,1,0,12.589499,8.410501,78.458225
2,6.580639,1,0,14.330376,13.669624,207.685943
3,5.521461,1,0,14.363443,6.636557,43.802227
4,6.591674,1,0,12.279697,15.720303,245.456999


In [31]:
y, X_stage2 = dmatrices("logw ~ educ_explained + exper_explained+ exper2_explained+ smsa+ south", df)

In [32]:
stage2_mod = sm.OLS(y, X_stage2).fit()
print (stage2_mod.summary())

                            OLS Regression Results                            
Dep. Variable:                   logw   R-squared:                       0.219
Model:                            OLS   Adj. R-squared:                  0.218
Method:                 Least Squares   F-statistic:                     168.6
Date:                Mon, 14 May 2018   Prob (F-statistic):          1.84e-158
Time:                        20:29:46   Log-Likelihood:                -1452.9
No. Observations:                3010   AIC:                             2918.
Df Residuals:                    3004   BIC:                             2954.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            4.4169      0.118  

> below is the estimate from (a), in comparison, the impact of education on wage becomes less but the effect of experience grows. The non-linear term exper^2 remains negative but almost doubles the effect. 

In [34]:
print (mod.summary())

                            OLS Regression Results                            
Dep. Variable:                   logw   R-squared:                       0.263
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     214.6
Date:                Mon, 14 May 2018   Prob (F-statistic):          3.70e-196
Time:                        20:42:57   Log-Likelihood:                -1365.6
No. Observations:                3010   AIC:                             2743.
Df Residuals:                    3004   BIC:                             2779.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            4.6110      0.068  

### Question (f)

Perform the Sargan test for validity of the instruments. What is your conclusion?

> 1 . Calculate the residuals using formula $e_{2SLS} = y - Xb_{2SLS} $

In [35]:
e_2sls = df.logw.values  - stage2_mod.predict(X)

>  2 . Regress $e_{2SLS}$ on $Z$, where $Z$ is (constant, age, age2, nearc, dadeduc, momeduc, smsa, south)

In [36]:
z = dmatrix("age + np.square(age) + nearc + daded + momed + smsa + south",
        data= df ,return_type= "dataframe")

In [37]:
z_mod = sm.OLS(e_2sls, z).fit()
print (z_mod.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.5282
Date:                Mon, 14 May 2018   Prob (F-statistic):              0.814
Time:                        20:43:06   Log-Likelihood:                -1388.1
No. Observations:                3010   AIC:                             2792.
Df Residuals:                    3002   BIC:                             2840.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.1258      0.657      0.

> 3 . calculate $nR^{2}$, and $nR^{2} \sim \chi^{2}(m-k)$, where m = 8, k = 6

In [39]:
n = 3010
R_2 = z_mod.rsquared
n * R_2 

3.7023886431634678

> since the critical value for $\chi^{2}(2)$ at 5% confidence level is 5.99 and 3.7 < 5.99, therefore, we reject the null hypothesis and correlation Z and /epsilon is 0. 

>Conclusion: the instrument variables are actually not valid, further refinements are required. 