# Problem set 2
Before we start working on some exercises we will briefly introduce two concepts in Python. First, importing and exporting data. Second, using functions. If you are already familiar
with these features, you can skip the next two sections and jump directly to the exercises.

In [1]:
import numpy as np
from numpy import linalg as la
import pandas as pd
from io import StringIO
from tabulate import tabulate
from matplotlib import pyplot as plt
from scipy.stats import chi2
import scipy.stats as st

#Supress Future Warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Import this weeks LinearModels .py file
import w2_LinearModels_post as lm
%load_ext autoreload
%autoreload 2

## Exercises
### Import data

The exercise takes up the union membership example from before. The data set WAGEPAN.TXT contains information about 545 men who worked every year from 1980 to 1987 in the US. The variables of interest are



$$
\begin{align}
\log\left(wage_{it}\right) & =\beta_{0}+\beta_{1}\textit{exper}_{it}+\beta_{2}\textit{exper}_{it}^{2}+\beta_{3}\textit{union}_{it}+\beta_{4}\textit{married}_{i} +\beta_{5}\textit{educ}_{i}+\beta_{6}\textit{hisp}_{i}+\beta_{7}\textit{black}_{i}+c_{i}+u_{it} \tag{1}
\end{align}
$$

Note that *educ*, *hisp*, and *black* are time-invariant variables.

The data has 10 columns. Named from 0 to 9. Here is a variable describtion:
- Column 0: ID
- Column 1: Year
- Column 2: Black
- Column 3: Experience
- Column 4: Hispanic
- Column 5: Married
- Column 6: Education
- Column 7: Union
- Column 8: ln wage
- Column 9: Experience sqr

Start by loading the data. Some of this has been done for you already. Since we are working with panels, we need to know how many persons there are and how many time periods we observe them. Since we operate using a balanced panel, this makes our life a little easier.

In [2]:
# First, import the data into numpy. 
# Data should load the firms.csv file provided for the production function exercises.
data = np.loadtxt('firms.csv', delimiter=',', skiprows=1)
id_array = data[:, 0].astype(int)

# Count how many firms we have and how many years per firm.
unique_id = np.unique(id_array, return_counts=True)
N = unique_id[0].size
T = int(unique_id[1].mean())
year = data[:, 1].astype(int)


In [3]:
# Load the rest of the data into arrays.
# Dependent variable: log deflated sales (ldsa).
y = data[:, 4].reshape(-1, 1)

# Regressors: constant, log labour, log capital.
x = np.column_stack([
    np.ones(N * T),
    data[:, 3],  # log labour
    data[:, 2]   # log capital
])

label_y = 'Log deflated sales'
label_x = ['Constant', 'Log labour', 'Log capital']


## Pooled OLS (POLS) Estimator
- **Estimate (1) by pooled OLS,** thus considering for the moment the unobserved components of (q) as one (composite) error term $v_{it}=c_{i}+u_{it}$. 
- Fill in the remaining parts of the function est_ols() in the accompanying python file (LinearModelsWeek2_ante.py) to estimate the model.
- What assumptions are made about $E\left[c_{i}\mathbf{x}_{it}\right]$ and $E\left[u_{it}\mathbf{x}_{it}\right]$ when justifying this estimation approach? (Hint: See Wooldridge p. 283)

In [4]:
# Estimate coefficients
b_hat = lm.est_ols(y,x)

# Print the results
for label, b_k in zip(label_x, b_hat):
    print(f'{label:16}: {b_k[0]:7.4f}')

Constant        :  0.0000
Log labour      :  0.6748
Log capital     :  0.3100


Calculate the standard errors of the coefficients. This is very similar to previous week's exercise. (Hint: See Wooldridge p. 59-60)

In [5]:
# Calculate the residuals
resid = y - x @ b_hat

# Calculate estimate of variance of residuals
SSR = resid.T @ resid
K = x.shape[1]
sigma = SSR / (N*T - K)

# Calculate the variance-covariance matrix
cov = sigma * la.inv(x.T @ x)

# Calculate the standard errors 
# Make sure to output the result in a vector
se = np.sqrt(np.diag(cov)).reshape(-1,1)

#Print results
for label, b_k, se_k in zip(label_x, b_hat, se):
    print(f'{label:16}: {b_k[0]:7.4f}    ({se_k[0]:6.4f})')

Constant        :  0.0000    (0.0050)
Log labour      :  0.6748    (0.0102)
Log capital     :  0.3100    (0.0091)


Fill in the functions estimate() and variance() in the accompanying python file. You can reuse most of the code above.

Using the function, print_table(), you should reproduce the table below.

In [6]:
# Estimate model using OLS
ols_result = lm.estimate(y,x, N=N, T=T, robust_se=True)

# Print table
lm.print_table((label_y, label_x), ols_result, title="Pooled OLS", floatfmt='.4f')

Pooled OLS
Dependent variable: Log deflated sales

               Beta      Se    t-values
-----------  ------  ------  ----------
Constant     0.0000  0.0161      0.0000
Log labour   0.6748  0.0366     18.4526
Log capital  0.3100  0.0324      9.5810
R² = 0.914
σ² = 0.131


In [7]:
# --- CRS restriction helpers ---
def _crs_wald(results, skip):
    b = results['b_hat'][skip:, :]
    cov = results['cov'][skip:, skip:]
    R = np.array([[1.0, 1.0]])
    q = np.array([[1.0]])
    diff = R @ b - q
    var_Rb = R @ cov @ R.T
    W = (diff.T @ la.inv(var_Rb) @ diff).item()
    crit = chi2.ppf(0.95, 1)
    p_val = 1 - st.chi2.cdf(W, 1)
    return W, crit, p_val

def crs_test_fe(results):
    return _crs_wald(results, 0)

def crs_test_fd(results):
    return _crs_wald(results, 0)

def crs_test_re(results):
    return _crs_wald(results, 1)


## Fixed Effects (FE) Estimator
In the next step, we will estimate the model using fixed effects. This is done by first performing the fixed effects (within-groups) transformation on the data and then using pooled OLS on the transformed data.
We will break this down into multiple steps.

### Using numpy
Create a transformation matrix with dimensions $T \times T$ that can be used to transform the data. Note that the matrix will be premultiplied on the data for each individual, so the dimensions will match in the end.

In [8]:
# Create transformation matrix
def demeaning_matrix(T):
    Q_T = np.eye(T) - np.tile(1/T, (T, T))
    return Q_T



Use the supplied perm() function to apply the transformation to the data.

In [9]:
# Create the demeaning matrix
Q_T = demeaning_matrix(T)

# Transform the data
y_demean = lm.perm(Q_T, y)
x_demean = lm.perm(Q_T, x)


What is the rank and eigenvalues of the within transformed $\mathbf{X}$ matrix? Why?

What happens to *educ, hisp, and black* and the constant when the data are within transformed? 

In [10]:
# Create function to check rank of demeaned matrix, and return its eigenvalues.
def check_rank(x):
    print(f'Rank of demeaned x: {la.matrix_rank(x)}')
    lambdas, V = la.eig(x.T@x)
    np.set_printoptions(suppress=True)  # This is just to print nicely.
    print(f'Eigenvalues of within-transformed x: {lambdas.round(decimals=0)}')
    print(V)
    # Use eigen vectors to identify which variables are dropped.


Adjust `x_demean` such that the model can be estimated using the FE estimator. Adjust the labels to match with `x_demean`.

Estimate the model using the estimate() function, and print the results.

In [11]:
# Choose variables to include in fixed effects model
x_demean = x_demean[:, 1:]
label_x_fe = label_x[1:]


In [12]:
# Estimate FE OLS using the demeaned variables.
fe_result = lm.estimate(y_demean, x_demean, transform='fe', N=N, T=T, robust_se=True)

# Print results
lm.print_table((label_y, label_x_fe), fe_result, title='FE regression', floatfmt='.4f')

FE regression
Dependent variable: Log deflated sales

               Beta      Se    t-values
-----------  ------  ------  ----------
Log labour   0.6942  0.0417     16.6674
Log capital  0.1546  0.0299      5.1630
R² = 0.477
σ² = 0.018


In [None]:
# Between estimator feeding the RE calculations
P_T = np.ones((1, T)) / T
y_mean = lm.perm(P_T, y)
x_mean = lm.perm(P_T, x)
be_result = lm.estimate(y_mean, x_mean, transform='be', N=N, T=T, robust_se=True)
lm.print_table((label_y, label_x), be_result, title='Between Estimator', floatfmt='.4f')


In [None]:
# Quasi-demeaning parameter for RE
sigma2_u = float(fe_result['sigma'])
sigma2_w = float(be_result['sigma'])
sigma2_c = max(sigma2_w - sigma2_u / T, 0.0)
_lambda = 1 - np.sqrt(sigma2_u / (sigma2_u + T * sigma2_c))
print(f'Lambda is approximately equal to {_lambda:.4f}.')


In [None]:
# Random effects transformation and estimation
P_T_full = np.ones((T, T)) / T
C_T = np.eye(T) - _lambda * P_T_full
y_re = lm.perm(C_T, y)
x_re = lm.perm(C_T, x)
re_result = lm.estimate(y_re, x_re, transform='re', N=N, T=T, robust_se=True)
lm.print_table((label_y, label_x), re_result, title='Random Effects', floatfmt='.4f')


## First-difference (FD) Estimator
Construct $\mathbf{D}$ and use the procedure `perm` $(\mathbf{D},\mathbf{x})$ to compute first differences of the elements of $\mathbf{y}$ and $\mathbf{x}$. $\mathbf{D}$ should be a $(T-1) \times T$ matrix. Why?

What happens to *educ, hisp* and *black* and the constant when the data are transformed into first differences? What is the rank of the first differenced $\mathbf{x}$-matrix? Why?

In [13]:
# Create transformation matrix
def fd_matrix(T):
    D_T = np.eye(T) - np.eye(T, k=-1)
    D_T = D_T[1:]
    return D_T

# Print the matrix
D_T = fd_matrix(T)
print(f'First differening matrix for T={T} \n', D_T)

First differening matrix for T=12 
 [[-1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0. -1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0. -1.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0. -1.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0. -1.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0. -1.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0. -1.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0. -1.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0. -1.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  1.]]


In [14]:
# Transform the data.
y_diff = lm.perm(D_T, y)
x_diff = lm.perm(D_T, x)

# Print x_diff
print(x_diff)

[[ 0.         0.000907  -0.0733878]
 [ 0.        -0.023856  -0.0455976]
 [ 0.        -0.052741  -0.0365186]
 ...
 [ 0.         0.102943   0.172166 ]
 [ 0.         0.048671   0.182393 ]
 [ 0.         0.056783   0.014258 ]]


In [15]:
# Check rank condition.
check_rank(x_diff)

Rank of demeaned x: 2
Eigenvalues of within-transformed x: [35. 47.  0.]
[[ 0.          0.          1.        ]
 [ 0.6043423  -0.79672478  0.        ]
 [-0.79672478 -0.6043423   0.        ]]


Adjust `x_diff` such that the model can be estimated using the FD estimator. Adjust the labels to match with `x_diff`.

Estimate the model using the estimate() function, and print the results.

In [16]:
# Choose variables to include in first-difference model
x_diff = x_diff[:, 1:]
label_x_fd = label_x[1:]


In [17]:
# CRS Wald tests for FE, FD, and RE

# Ensure FD result is evaluated
fd_result = lm.estimate(y_diff, x_diff, transform='fd', N=N, T=T-1, robust_se=True)

W_fe, crit_fe, pval_fe = crs_test_fe(fe_result)
print(f'CRS Wald test (FE): {W_fe:.4f}')
print(f'Critical value (5%): {crit_fe:.4f}')
print(f'p-value: {pval_fe:.4f}')

W_fd, crit_fd, pval_fd = crs_test_fd(fd_result)
print(f'CRS Wald test (FD): {W_fd:.4f}')
print(f'Critical value (5%): {crit_fd:.4f}')
print(f'p-value: {pval_fd:.4f}')

W_re, crit_re, pval_re = crs_test_re(re_result)
print(f'CRS Wald test (RE): {W_re:.4f}')
print(f'Critical value (5%): {crit_re:.4f}')
print(f'p-value: {pval_re:.4f}')


CRS Wald test (FE): 19.4029
Critical value (5%): 3.8415
p-value: 0.0000
CRS Wald test (FD): 150.0280
Critical value (5%): 3.8415
p-value: 0.0000


In [18]:
# Estimate FE OLS using the demeaned variables.
fd_result = lm.estimate(y_diff, x_diff, transform='fd', N=N, T=T-1, robust_se=True)

# Print results
lm.print_table((label_y, label_x_fd), fd_result, title='FD regression', floatfmt='.4f')

FD regression
Dependent variable: Log deflated sales

               Beta      Se    t-values
-----------  ------  ------  ----------
Log labour   0.5487  0.0292     18.8191
Log capital  0.0630  0.0232      2.7097
R² = 0.165
σ² = 0.014


You should get a table that looks like this:

FD regression <br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1158 | 0.0196 |     5.9096 |
| Experience sqr | -0.0039 | 0.0014 |    -2.8005 |
| Union          |  0.0428 | 0.0197 |     2.1767 |
| Married        |  0.0381 | 0.0229 |     1.6633 |
R² = 0.004 <br>
σ² = 0.196

**NB:** Did you use the right standard errors? Did you use the right number of time periods in the estimate() function?

How big is the union premium according to the estimate from this model? Compare the FD estimate with the estimate that you calculated from the FE regression. Is there a difference? If yes, what (if anything) can we conclude based on this finding?

## Tests
### Test for serial correlation in the errors using an auxilliary AR(1) model
Tests assumption FD.3, where the errors $e_{it} = \Delta u_{it}$ should be serially uncorrelated.

We can easily test this assumption given the OLS residuals from the FD version of equation (1). Run the regression (note that you will lose data for
the first *two* periods)
\begin{equation}
\hat{e}_{it}=\rho\hat{e}_{it-1}+error_{it},\quad t=\color{red}{3},\dotsc,T,\quad i=1,\dotsc,N\tag{2}
\end{equation}

Do you find any evidence of serial correlation? Does FD.3 seem appropriate? And why don't we include an intercept in this auxilliary equation?

*Note:* Under FE.3, the idiosyncratic errors $u_{it}$
are uncorrelated. However, FE.3 implies that the $e_{it}$'s are autocorrelated. In fact, of the $u_{it}$'s are serially uncorrelated to begin with, corr $\left(e_{it},e_{it-1}\right)=-0.5$. (Check!) This test is of course only valid if the explanatory variables are strictly exogenous!

*Hint:* You can use the `perm` function to lag
the error term variable. Consider the following; 

$$
{\begin{bmatrix}
1 & 0 & 0 & \cdots & 0 & 0\\
0 & 1 & 0 & \cdots & 0 & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\
0 & 0 & 0 & \cdots & 1 & 0
\end{bmatrix}}_{T-1\times T}\times{\begin{bmatrix}y_{1}\\
y_{2}\\
\vdots\\
y_{T}
\end{bmatrix}}_{T \times 1}={\begin{bmatrix}y_{1}\\
y_{2}\\
\vdots\\
y_{T - 1}
\end{bmatrix}}_{T - 1\times 1}
$$

*Hint:* You can use the `perm` function to remove the first time-period in the residual. Consider the following; 

$$
{\begin{bmatrix}
0 & 1 & 0 & \cdots & 0 & 0\\
0 & 0 & 1 & \cdots & 0 & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\
0 & 0 & 0 & \cdots & 0 & 1
\end{bmatrix}}_{T-1\times T}\times{\begin{bmatrix}y_{1}\\
y_{2}\\
\vdots\\
y_{T}
\end{bmatrix}}_{T \times 1}={\begin{bmatrix}y_{2}\\
y_{3}\\
\vdots\\
y_{T}
\end{bmatrix}}_{T - 1\times 1}
$$

In [19]:
# Make function to calculate the serial correlation
def serial_corr(y, x, T):
    # Calculate the residuals
    b_hat = lm.est_ols(y, x)
    e = y - x@b_hat
    
    # Create a lag transformation matrix
    L_T = np.eye(T, k=-1)
    L_T = L_T[1:]

    # Lag residuals
    e_l = lm.perm(L_T, e)

    # Create a transformation matrix that removes the first observation of each individual
    I_T = np.eye(T, k=0)
    I_T = I_T[1:]
    
    # Remove first observation of each individual
    e = lm.perm(I_T, e)
    
    # Calculate the serial correlation
    return lm.estimate(e, e_l,N=N,T=T-1)

In [20]:
# Estimate serial correlation
corr_result = serial_corr(y_diff, x_diff, T-1)

# Print results
label_ye = 'OLS residual, e\u1d62\u209c'
label_e = ['e\u1d62\u209c\u208B\u2081']
lm.print_table(
    (label_ye, label_e), corr_result, 
    title='Serial Correlation', floatfmt='.4f'
)

Serial Correlation
Dependent variable: OLS residual, eᵢₜ

          Beta      Se    t-values
-----  -------  ------  ----------
eᵢₜ₋₁  -0.1987  0.0148    -13.4493
R² = 0.039
σ² = 0.014


You should get a table that looks like this:

Serial Correlation <br>
Dependent variable: OLS residual, eᵢₜ

|       |    Beta |     Se |   t-values |
|-------|---------|--------|------------|
| eᵢₜ₋₁ | -0.3961 | 0.0147 |   -27.0185 |
R² = 0.182 <br>
σ² = 0.143

### Test for strict exogeneity

Add a lead of the union variable, $union_{i,t+1}$ to the equation (1) (note that you will lose data from period $T$ , 1987) and estimate the model with *fixed effects* (i.e., you have to demean $union_{i,t+1}$ along with all the other variables and throw out time constant variables). Is $union_{i,t+1}$ significant? What does this imply for the strict exogeneity assumption?

*Hint:* To lead a variable, think along the same lines as in the previous exercise.

In [21]:
# Lead log labour
F_T = np.eye(T, k=1)[:-1]
lemp_lead = lm.perm(F_T, x[:, 1].reshape(-1, 1))


In [22]:
# Remove the last observed year for every individual
I_T = np.eye(T, k=0)
I_T = I_T[:-1]

x_exo = lm.perm(I_T, x)
y_exo = lm.perm(I_T, y)

In [23]:
# Add lemp_lead to x_exo
x_exo = np.hstack((x_exo, lemp_lead))

# Within transform the data
Q_T = demeaning_matrix(T - 1)
yw_exo = lm.perm(Q_T, y_exo)
xw_exo = lm.perm(Q_T, x_exo)

# Drop the demeaned constant and keep labour, capital, lead labour
xw_exo = xw_exo[:, 1:]


In [24]:
# Estimate model
exo_test = lm.estimate(yw_exo, xw_exo, N=N, T=T - 1, transform='fe', robust_se=True)

# Print results
label_exo = label_x_fe + ['Lead log labour']
lm.print_table((label_y, label_exo), exo_test, title='Exogeneity test', floatfmt='.4f')


Exogeneity test
Dependent variable: Log deflated sales

                   Beta      Se    t-values
---------------  ------  ------  ----------
Log labour       0.5681  0.0397     14.3113
Log capital      0.1495  0.0291      5.1287
Lead log labour  0.1532  0.0281      5.4442
R² = 0.473
σ² = 0.016
