In [363]:
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

SyntaxError: invalid syntax (<ipython-input-363-3d77afab096e>, line 1)

## Labor Economics PS #2 

By Rachel Anderson 10/14/2018

---



### 2. Angrist-Krueger
------
### Question 1

#### Identification Strategy
Angrist and Krueger (1995) exploit exogenous variation in education generated by individual's season of birth to estimate the imapct of compulsory schooling one ducation and earnings.
<p>
Their logic for why there should be a correlation is as follows:
<p>
>"If the fraction of students who desire to leave school before they reach the legal dropout age is constant across birthdays, a student's birhtday should be expected to influence his or her ultimate educational attainment.  This relationship would be expected because, in the absence of rolling admissions to school, students born in different months of the year start school at different ages.  This fact, in conjunction with compulsory schooling laws, which require students to attend school until they reach a specified birthday, produces a correlation between date of birth and years of schooling."
<p>

#### Main Assumptions
From this we understand the main assumptions:

1. Quarter of birth is uncorrelated with ability and other characteristics that affect wages other than schooling.
2. Quarter of birth is correlated with years of schooling.
3. The returns to education are approximately linear.
4. Students who attend school longer because of compulsory schooling receive higher earnings as a result of their increased schooling.

Support for the identifying assumption comes from: (1) Children born in the first quarter of the year have a slightly *lower* average level of educationt han children born later in the year. (2) The seasonal pattern in education is not evident in college graduation rates, nor is it evident in graduate school completion rates.  (3) In comparing enrollment rates of 15- and 16-year-olds in states that have an age 16 schooling requirement with enrollment rates in states that have an age 17 schooling requirement, we find a greater decline in the enrollment of 16-year-olds in states that permit 16-year-olds to leave school than in states that compel 16-year-olds to attend school.

#### Instrument Validity

That season of birth should be correlated with ability or other factors that impact an individual's labor market outcomes is dubious, and so I am confident that date of birth is a valid instrument, i.e.
$$ E(Z_i \epsilon_i) = 0 $$ 
for $Z_i = (qob1, qob2, qob3, qob4)$.

One might believe that individuals born in the first quarter of the year may have been exposed to extreme weather conditions in the months following their month that differentially affected their development compared to other season of birth cohorts.  However, the effect found by AK appears for all of the United States, including states like Florida where the weather is consistent year-round, and after adding state fixed effects, so the instrument is not invalid for this reason.

More relevant questions are whether the instrument is sufficiently *strong*, and whether the authors truly measure the returns to an additional year of schooling or some other phenonenon. 

Bound & Jaeger (1996) point out that the strength of the relationship between quarter of birth and educational attainment is weaker for later cohorts, but there is no evidence that the strength of the relationship between QOB and earnings is weaker in those cohorts. Furthermore, they find that the association between quarter of birth and earnigns or other labor market outocmes existed for cohorts that were not bound by compulsory school attendance. 


In [369]:
import pandas as pd
import numpy as np
from IPython.display import display, Markdown, Latex, Math
from scipy.stats import chi2, t

#Set up directorys
dataDir = "../Data/"
texDir = "../TeX"

In [297]:
#Import Data
akdata_raw = pd.read_csv(dataDir+"akdata.csv")

#Drop if year of birth != 30
akdata = akdata_raw[akdata_raw['yob'] == 30]
akdata.insert(0, 'ones', 1)

#Count remaining observations
display(Markdown("### Question 2"))
display(Markdown("After dropping all observations with year of birth "\
    "different from 1930, there are " + str(len(akdata)) + \
    " observations left in the data set."))

### Question 2

After dropping all observations with year of birth different from 1930, there are 33602 observations left in the data set.

In [276]:
def do_ols(y,X):
    
    #Function that performs OLS with indep. vars X, dep. var y
    #Returns White robust standard errors
    
    tX = np.transpose(X)
    XX = np.dot(tX,X)
    XXinv = np.linalg.inv(XX)
    Xy = np.dot(tX,y)
    
    beta = np.dot(XXinv,Xy)
    e = y-np.dot(X,beta)
    
    inner = np.diag(np.array([i**2 for i in e]))
    outer = np.dot(XXinv, tX)
    
    avar = np.dot(outer, inner)
    avar = np.dot(avar, np.transpose(outer))

    se = get_se(avar)
    return(beta, se, np.asmatrix(avar))

def get_se(avar):
    return(np.sqrt(np.diagonal(avar)))


In [144]:
def make_markdown_table(array):

    """ Input: Python list with rows of table as lists
               First element contains column names. 
               Second element contains row names.
        Output: String to put into a .md file 
        
    Ex Input: 
        [["Name", "Age", "Height"],
         ["Jake", 20, 5'10],
         ["Mary", 21, 5'7]] 
    """


    markdown = "\n" + str("|")
    
    for e in array[0]:
        to_add = " " + str(e) + str(" |")
        markdown += to_add
    markdown += "\n"

    markdown += '|'
    for i in range(len(array[0])+1):
        markdown += str("-------------- | ")
    markdown += "\n"


    for index, entry in enumerate(array[2:]): 
        markdown += str("| ") + str(array[1][index]) + " | "
        for e in entry:
            to_add = str(e) + str(" | ")
            markdown += to_add
        markdown += "\n"

    return markdown + "\n"

In [216]:
y = akdata['lwage']
X = akdata[['ones', 'educ']]

betaOLS, seOLS, avarOLS = do_ols(y,X)


In [302]:
temp = np.column_stack((betaOLS, seOLS))
display(Markdown("### Question 3"))
display(Markdown("Below are the OLS estimates of <p>$$ Y_i = a + b\ educ_i + \epsilon_i $$ <p> and the corresponding robust standard errors"))
display(Markdown(make_markdown_table([['', 'Coefficient','Standard Error'], ['a', 'b'], temp[0], temp[1]])))

### Question 3

Below are the OLS estimates of <p>$$ Y_i = a + b\ educ_i + \epsilon_i $$ <p> and the corresponding robust standard errors


|  | Coefficient | Standard Error |
|-------------- | -------------- | -------------- | -------------- | 
| a | 5.04000937949 | 0.0152481769739 | 
| b | 0.0692617169477 | 0.00116277846384 | 



### Question 4

Below is code that generates results for the first stage regression of years of education on the instruments (and a constant):


In [236]:
# Generate dummy variables qob1-qob4 from qob variable

d = {}
for x in range(1,5):
    d["qob" + str(x)] = [1 if akdata['qob'][i] == x else 0 for i in range(len(akdata))]
    try:
        akdata.insert(x, 'qob'+str(x), d["qob"+str(x)])
    except:
        break

In [247]:
# List combinations of instruments

Z_drop4 = akdata[['ones', 'qob1', 'qob2', 'qob3']] #4 dropped
Z_drop3 = akdata[['ones', 'qob1', 'qob2', 'qob4']] #3 dropped
Z_drop2 = akdata[['ones', 'qob1', 'qob3', 'qob4']] #2 droped
Z_drop1 = akdata[['ones', 'qob2', 'qob3', 'qob4']] #1 dropped

tables = {}
coef = {}
se = {}
avar = {}

for x in range(1,5):
    index = "Z_drop" + str(x)
    coef[index], se[index], avar[index] = do_ols(akdata['educ'], eval(index))
    temp = np.column_stack((coef[index], se[index]))
    display(Markdown("First stage results for dropping qob" + str(x) + " \n"))
    display(Markdown(make_markdown_table([['Variable', 'Coefficient','Standard Error'], eval(index).columns.tolist(), temp[0], temp[1], temp[2],temp[3]])))

First stage results for dropping qob1 



| Variable | Coefficient | Standard Error |
|-------------- | -------------- | -------------- | -------------- | 
| ones | 12.280405003 | 0.0376135342181 | 
| qob2 | 0.148013291448 | 0.0534304307423 | 
| qob3 | 0.211454662236 | 0.0523515293 | 
| qob4 | 0.344270482249 | 0.0534090296364 | 



First stage results for dropping qob2 



| Variable | Coefficient | Standard Error |
|-------------- | -------------- | -------------- | -------------- | 
| ones | 12.4284182944 | 0.0379477663761 | 
| qob1 | -0.148013291448 | 0.0534304307423 | 
| qob3 | 0.0634413707885 | 0.0525921822765 | 
| qob4 | 0.196257190801 | 0.0536449388411 | 



First stage results for dropping qob3 



| Variable | Coefficient | Standard Error |
|-------------- | -------------- | -------------- | -------------- | 
| ones | 12.4918596652 | 0.0364129738372 | 
| qob1 | -0.211454662236 | 0.0523515293 | 
| qob2 | -0.0634413707885 | 0.0525921822765 | 
| qob4 | 0.132815820012 | 0.0525704399258 | 



First stage results for dropping qob4 



| Variable | Coefficient | Standard Error |
|-------------- | -------------- | -------------- | -------------- | 
| ones | 12.6246754852 | 0.0379176276991 | 
| qob1 | -0.344270482249 | 0.0534090296364 | 
| qob2 | -0.196257190801 | 0.0536449388411 | 
| qob3 | -0.132815820012 | 0.0525704399258 | 



From the above calculations, we can see that which variable is dropped in the first equation does not matter...

In [243]:
def calc_Wald(R, V, beta, r, q):
    # tests q linear restrictions of from R*beta = r
    # V is estimate of avar(beta)
    
    rb = np.dot(R, beta) - r
    RVR = np.dot(np.dot(R, V), np.transpose(R))
    RVRinv = np.linalg.inv(RVR)
    
    W = np.dot(np.dot(np.transpose(rb), RVRinv), rb)
    
    pvalue = 1- chi2.cdf(W, q)
    
    return(W, pvalue)

def calc_Fstat(Wald, q):
    return(np.asscalar(Wald[0]/q))

In [244]:
Wald = {}
Fstats = {}
q = 4
R = np.identity(4)
for x in avar:
    Wald[x] = calc_Wald(R, avar[x], coef[x], 0, q)
    Fstats[x] = calc_Fstat(Wald[x], q)

In [353]:
def do_2sls(y, Z, pi, avar1):
    
    #Inputs - data = (y,X,Z),  
    #pi = first stage regression coef.
    #avar1 = first stage avar(pi)
    #Outpus - coef, avar2 
    
    Xhat = np.dot(Z, pi)
    ones = np.ones(len(Xhat))
    X= np.column_stack((ones,Xhat))
    
    beta, se, avarOLS = do_ols(y,X)
    avar = calc_avar(y,X,Z,beta)
    
    se = get_se(avar)
    
    return(beta, se, avar)

######################################

def calc_avar(y,X,Z,beta):

    n=len(y)
    e = y-np.dot(X,beta)
    Omega = np.diag(np.array([i**2 for i in e]))
    
    inner = np.dot(np.dot(np.transpose(Z),Omega), Z)
    
    outer = calc_outer(Z,X)
    
    avar = np.dot(np.dot(outer, inner), np.transpose(outer))
    
    return(avar)

######################################

def calc_outer(Z,X):
    tZ = np.transpose(Z)
    ZX = np.dot(tZ, X)
    XZ = np.transpose(ZX)
    
    ZZ = np.dot(tZ, Z) 
    ZZinv = np.linalg.inv(ZZ)
    
    outer = np.linalg.inv(np.dot(np.dot(XZ,ZZinv), ZX))
    outer = np.dot(np.dot(outer, XZ), ZZinv)

    return(outer)

######################################

# def calc_avar2(y,Z,pi):

#     n=len(y)
#     Xhat= np.dot(Z,pi)
#     e = y-Xhat
#     Omega = np.diag(np.array([i**2 for i in e]))
    
#     inner = np.dot(np.dot(np.transpose(Xhat),Omega), Xhat)
    
#     outer = calc_outer2(Z,pi)
#     try:
#         avar = np.dot(np.dot(outer, inner), np.transpose(outer))
#     except:
#         avar = (outer**2)*inner
    
#     return(avar)

# def calc_outer2(Z,pi):
#     Xhat = np.dot(Z, pi)
#     outer = 1/Xhat
#     return(outer)

In [354]:
beta2SLS, se2SLS, avar2SLS = do_2sls(y,Z_drop1,coef['Z_drop1'], avar['Z_drop1'])

In [355]:
temp = np.column_stack((beta2SLS, se2SLS))
display(Markdown("### Question 5"))
display(Markdown("Below are the 2SLS estimates of <p>$$ Y_i = a + b\widehat{X}_i + \epsilon_i $$ <p> where <p>" \
                 "$$ \widehat{X}_i = z_i'\widehat{\pi} $$"))
display(Markdown(make_markdown_table([['', 'Coefficient','Standard Error'], ['a', 'b'], temp[0], temp[1]])))

### Question 5

Below are the 2SLS estimates of <p>$$ Y_i = a + b\widehat{X}_i + \epsilon_i $$ <p> where <p>$$ \widehat{X}_i = z_i'\widehat{\pi} $$


|  | Coefficient | Standard Error |
|-------------- | -------------- | -------------- | -------------- | 
| a | 4.94201352746 | 0.380633660952 | 
| b | 0.0771296141957 | 0.0305587493448 | 



### Question 6

Folllowing Wooldridge (2002), the Hausman test is conducted by estimating

$$ \log(\text{wage})_i = \beta_0 + \beta_1 educ_i + \rho \widehat{v}_2 + error$$

where $\widehat{v}_2$ is the residual from the first-stage regression of $educ$ on $Z = [1, qob1, qob2, qob3]$.

Then the usual heteroskedasticity-robust $t$ statistic for $\widehat{\rho}$ is a valid test of $H_0: \rho = 0$

In [372]:
def do_hausman(y, X, Z):
    pi, sePi, avarPi = do_ols(X, Z)
    v2 = X-np.dot(Z, pi)
    
    newX = np.column_stack((np.ones(len(y)),X,v2))
    coefNew, seNew, avarNew = do_ols(y, newX)
    
    tt,pval = do_t_test(len(y), coefNew[-1], 0, seNew[-1])
    
    return(tt, pval)

def do_t_test(n, thetaHat, theta0, se):
    tt = (thetaHat-theta0)/se
    pval = t.sf(np.abs(tt), n-1)*2 
    return(tt, pval)

In [373]:
tt, pval = do_hausman(y, akdata['educ'], Z_drop1)

In [None]:
<script>
  jQuery(document).ready(function($) {

  $(window).load(function(){
    $('#preloader').fadeOut('slow',function(){$(this).remove();});
  });

  });
</script>

<style type="text/css">
  div#preloader { position: fixed;
      left: 0;
      top: 0;
      z-index: 999;
      width: 100%;
      height: 100%;
      overflow: visible;
      background: #fff url('http://preloaders.net/preloaders/720/Moving%20line.gif') no-repeat center center;
  }

</style>

<div id="preloader"></div>