# Week 5: High-Dimensional Methods and Confidence Intervals

The purpose of this week's problem set is to get familiar with inference based on high-dimensional methods.  Our focus is again on methods based on the Lasso, and we again use the <tt>housing.csv</tt> dataset. (See the previous problem set for data details.) Note how our focus has here changed from prediction (of house prices) to inference (drivers of house prices).

We first read the data into Python and remove missings.

In [1]:
# Load packages
import numpy as np
import numpy.linalg as la
import pandas as pd
from sklearn.linear_model import Lasso
from scipy.stats import norm
from sklearn.preprocessing import PolynomialFeatures

# Read data
housing = pd.read_csv("housing.csv")
housing=housing.dropna() # dropping observations missing a bedroom count 
print("The number of rows and columns are {} and also called shape of the matrix".format(housing.shape)) # data dimensions

The number of rows and columns are (20433, 10) and also called shape of the matrix


In [2]:
print("Columns names are \n {}".format(housing.columns))

Columns names are 
 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')


In [3]:
print(housing.head()) # first observations

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  


In [4]:
print(housing.tail()) # last observations

       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
20635    -121.09     39.48                25.0       1665.0           374.0   
20636    -121.21     39.49                18.0        697.0           150.0   
20637    -121.22     39.43                17.0       2254.0           485.0   
20638    -121.32     39.43                18.0       1860.0           409.0   
20639    -121.24     39.37                16.0       2785.0           616.0   

       population  households  median_income  median_house_value  \
20635       845.0       330.0         1.5603             78100.0   
20636       356.0       114.0         2.5568             77100.0   
20637      1007.0       433.0         1.7000             92300.0   
20638       741.0       349.0         1.8672             84700.0   
20639      1387.0       530.0         2.3886             89400.0   

      ocean_proximity  
20635          INLAND  
20636          INLAND  
20637          INLAND  
20638          INLAN

In [5]:
print(housing.dtypes) # data types

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object


We model house prices (<tt>median_house_value</tt>) using a linear (in the parameters) model of the basic regressors (minus the categorical variable <tt>ocean_proximity</tt>). 

$$
\underbrace{\mathtt{median\,house\,value}}_{=Y}= \alpha\times\underbrace{\mathtt{median\,income}}_{=D} + Z'\gamma + \varepsilon,\quad\mathrm{E}[\varepsilon|D,Z]=0.
$$

Note that Z should contains a constant, but the implementation of Lasso in sklearn adds this automatically.

We here focus on constructing a confidence interval for the coefficient of <tt>median_income</tt> after having used the Lasso. In doing so we treat both <tt>median_income</tt> and the remaining ($p=7$) contrOLS as exogenous. Moreover, we augment the above model with another linear model

$$
\mathtt{median\,income}=Z'\psi + \nu,\quad\mathrm{E[\nu|Z]=0},
$$

now for <tt>median_income</tt>.

(One would be hard pressed to claim that median income *causes* house price movements. This is only an exercise in the mechanics.)

# Exercises

## Part 1: Prepare data
Use the eight basic regressors ($Z_1,\dotsc,Z_p$) and add all control quadratics ($Z_1^2,\dotsc,Z_p^2$), cubics ($Z_1^3,\dotsc,Z_p^3$), first-order interactions ($Z_1Z_2,Z_1Z_3,\dotsc,Z_{p-1}Z_{p}$), and second-order interactions ($Z_1Z_1Z_2,Z_1Z_1Z_3,\dotsc,Z_{p}Z_{p}Z_{p-1}$). 

Hints: Use <tt>sklearn.preprocessing.PolynomialFeatures</tt> for simple transformation. Your optimizer may not converge. Consider increasing the maximum number of iterations using the Lasso option <tt>max_iter=</tt>[your number].

### Question 1.1
Setup data and add all control quadratics, cubics, first-order interactions, and second-order interactions. Don't include a constant - this is done automatically by the Lasso implementation in python. How many regressors do you have now?

In [6]:
# Setup data
y = housing.median_house_value
d = housing.median_income
Z_basic = housing.drop(["median_house_value","median_income","ocean_proximity"],axis=1)

# Add polynomial features
# Hint: remember, you don't want the constant
Z = PolynomialFeatures(3, include_bias=False).fit_transform(Z_basic)

# Display number of regressors
print("The number of regressors in Z is {}".format(Z.shape[1]))

The number of regressors in Z is 119


    You should get: The number of regressors in Z is 119

In [7]:
# Construct X 
X = np.column_stack((d,Z))

# Find N
N = X.shape[0]

### Question 1.2
Standardize variables before running the Lasso.

*Note:* Make sure make a degrees of freedom correction when computing the standard errors. Pandas does this automatically, but if you use numpy, you should set the argument ddof=1 in the function np.std().

In [8]:
# Create a function for standardizing
def standardize(X):

    X_stan = (X - np.mean(X, axis=0))/np.std(X, axis=0, ddof=1)
    return X_stan

# Standardize data
X_stan = standardize(X)
Z_stan = standardize(Z)
d_stan = standardize(d)

## Part 2: OLS

Students get slightly different answers with each different version of Python. Your results for Exercise 3 should be correct to 3 significant figures.

### Question 2.1
Estimate $\alpha$ using Ordinary Least Squares (OLS). Remember to add a constant to the regressors for this part.

In [9]:
# Add a constant to X
xx = np.column_stack((np.ones(N),X))

# Reshape y
yy = np.array(y).reshape(-1,1)

# Calculate OLS estimate
coefs_OLS = la.inv(xx.T@xx)@xx.T@yy
alpha_OLS = coefs_OLS[1][0]

# Calculate residuals
res_OLS = yy - xx@coefs_OLS

# Display alpha
print("alpha_OLS = ",alpha_OLS.round(2))

alpha_OLS =  37149.54


#### Hint: We are doing OLS not Lasso

    You should get: alpha_OLS =  37143.8

### Question 2.2

Estimate the variance of the OLS estimator and calculate the standard deviation of $\hat{\alpha}$. For this exercise we will assume homoscedasticity.

In [10]:
# Estimate variance
SSR = res_OLS.T@res_OLS
sigma2_OLS = SSR/(N-xx.shape[1])
var = sigma2_OLS*la.inv(xx.T@xx)

# Calculate standard errors
se = np.sqrt(np.diagonal(var)).reshape(-1,1)

# Get standard error of alpha
se_OLS = se[1][0]

# Display standard error
print("se_OLS = ",se_OLS.round(2))


se_OLS =  394.33


    You should get:  se_OLS =  394.41

### Question 2.3 

Calculate the 95% confidence interval for $\hat{\alpha}$.

*Hint:* Use scipy.stats.norm.ppf to find quantiles of the normal distribution.

In [11]:
# Calculate the quantile of the standard normal distribution that corresponds to the 95% confidence interval of a two-sided test
q = norm.ppf(1-0.025)

# Calculate confidence interval
CI_low_OLS  = alpha_OLS-q*se_OLS
CI_high_OLS = alpha_OLS+q*se_OLS

# Display confidence interval
CI_OLS =  (((alpha_OLS-q*se_OLS).round(2),(alpha_OLS+q*se_OLS).round(2)))
print("CI_OLS = ",(CI_low_OLS.round(2),CI_high_OLS.round(2)))

CI_OLS =  (36376.67, 37922.41)


    You should get:  CI_OLS =  (36370.76, 37916.84)

## Part 3: Post-Single Lasso

### Question 3.1
Estimate $\alpha$ using Post-Single Lasso (PSL).

Step 0: Calculate BRT

In [12]:
# Make a function that calculates BRT. Hint: You implemented a version of this last week
def BRT(X_tilde,y):
    (N,p) = X_tilde.shape
    sigma = np.std(y, ddof=1)
    c=1.1
    alpha=0.05

    penalty_BRT= (sigma*c)/np.sqrt(N)*norm.ppf(1-alpha/(2*p))

    return penalty_BRT

In [13]:
# Calculate BRT
penalty_BRTyx = BRT(X_stan, y)
print("lambda_BRT =",penalty_BRTyx.round(2))

lambda_BRT = 3135.12


    You should get:  lambda_BRT = 3135.12

Step 1: Lasso Y using D and Z. Collect variables in Z with non-zero coefficients in a set called Z_J.

*Hint:* Set max_iter=10_000 to make the Lasso converge.

In [14]:
# Run Lasso 
fit_BRTyx = Lasso(penalty_BRTyx, max_iter=10000).fit(X_stan,y)
coefs=fit_BRTyx.coef_

# Save variables where coefficients are not zero
Z_J = Z[:,coefs[1:]!=0] # Note: We use Z and not Z_stan

# Display number of variables in Z_J
print("The number of variables in Z_J is {}".format(Z_J.shape[1]))

The number of variables in Z_J is 8


    You should get: The number of variables in Z_J is 8

Step 2: Regress Y using D and Z_J

In [15]:
# Add a constant to X
xx = np.column_stack((np.ones(N),d,Z_J))
yy = np.array(y).reshape(-1,1)

# Calculate OLS estimate
coefs_PSL = la.inv(xx.T@xx)@xx.T@yy
alpha_PSL = coefs_PSL[1][0]

# Calculate residuals
res_PSL = yy - xx@coefs_PSL

# Display alpha
print("alpha_PSL = ",alpha_PSL.round(2))

alpha_PSL =  38147.07


    You should get: alpha_PSL =  38147.07

### Question 3.2

Estimate the variance of the second step OLS estimator and calculate the standard deviation of $\tilde{\alpha}$.

In [16]:
# Estimate variance
SSR = res_PSL.T@res_PSL
sigma2_PSL = SSR/(N-xx.shape[1])
var = sigma2_PSL*la.inv(xx.T@xx)

# Calculate standard errors
se = np.sqrt(np.diagonal(var)).reshape(-1, 1)
se_PSL=se[1][0]

# Display standard error
print("se_PSL = ",se_PSL.round(2))


se_PSL =  268.92


    You should get: se_PSL =  268.92

### Question 3.3 

Calculate the 95% confidence interval for $\tilde{\alpha}$.

In [17]:
# Calculate the z statistic that corresponds to the 95% confidence interval of a two-sided test
q = norm.ppf(1-0.025)

# Calculate confidence interval
CI_low_PSL  = alpha_PSL-q*se_PSL
CI_high_PSL = alpha_PSL+q*se_PSL

# Display confidence interval
CI_PSL =  (((alpha_PSL-q*se_PSL).round(2),(alpha_PSL+q*se_PSL).round(2)))
print("CI_PSL = ",(CI_low_PSL.round(2),CI_high_PSL.round(2)))

CI_PSL =  (37620.01, 38674.14)


    You should get: CI_PSL =  (37620.01, 38674.14)

## Part 4: Double Post Lasso

### Question 4.1
Estimate $\alpha$ using Double Post Lasso (DPL).

Step 0: Calculate BRT

*Note:* In this exercise we will use the penalty suggested by BRT. BRT relies on homoscedasticity which is a strong assumption.

In [18]:
# Calculate BRT
penalty_BRTyx = BRT(X_stan,y)
print("lambda_BRT =",penalty_BRTyx.round(2))

lambda_BRT = 3135.12


    You should get: lambda_BRT = 3135.12

Step 1: Lasso Y using D and Z

*Hint:* To calculate the residuals from the LASSO-regression you can use the predict method from the Lasso object. The predict method returns the predicted values from the LASSO regression. You can then calculate the residuals by subtracting the predicted values from the actual values. 

In [19]:
# Run Lasso 
fit_BRTyx = Lasso(penalty_BRTyx, max_iter=10000).fit(X_stan, y)
coefs=fit_BRTyx.coef_

# Calculate residuals
resyx = y-fit_BRTyx.predict(X_stan)

# Calculate Y - Z@gamma (epsilon + alpha*d)
# Hint: You only need the variables given to you in this cell, in addition
# to a standardized data set you made previoously.
resyxz = resyx + d_stan*coefs[0]

# Display first coefficient
print("First coefficient =",coefs[0].round(2))

First coefficient = 74248.24


    You should get: First coefficient = 74248.24

Step 2: Lasso D using Z

In [20]:
# Calculate BRT
penalty_BRTdz = BRT(Z_stan, d)

In [21]:
# Run Lasso
fit_BRTdz = Lasso(penalty_BRTdz, max_iter=10000).fit(Z_stan, d)
coefs=fit_BRTdz.coef_

# Calculate residuals
resdz=d-fit_BRTdz.predict(Z_stan)

# Display first coefficient
print("First coefficient =",coefs[0].round(2))

First coefficient = -0.55


    You should get: First coefficient = -0.55

Step 3: Estimate alpha

In [22]:
# Calculate alpha
num = resdz@resyxz
denom = resdz@d
alpha_PDL = num/denom

# Display alpha
print("alpha_PDL = ",alpha_PDL.round(2))

alpha_PDL =  40788.63


    You should get: alpha_PDL =  40788.63

### Question 4.2
Calculate the implied variance estimate, $\check{\sigma}^2$, and calculate the standard deviation of $\check{\alpha}$.

In [23]:
print(resdz)
print(resyxz)

0        4.202866
1        2.192354
2        2.468338
3        1.736934
4       -0.491862
           ...   
20635   -1.308216
20636   -0.673057
20637   -1.432693
20638   -1.295756
20639   -0.635164
Name: median_income, Length: 20433, dtype: float64
0        245315.420670
1        135649.102591
2        129070.921505
3        110644.275816
4        101898.049265
             ...      
20635    -66141.944258
20636    -59714.968561
20637    -51759.065020
20638    -63054.102000
20639    -55439.748102
Length: 20433, dtype: float64


In [24]:
# Calculate variance    
num = resdz**2@resyx**2/N
denom = (resdz.T@resdz/N)**2
sigma2_PDL = num/denom

# Display variance
print("sigma2_PDL = ",sigma2_PDL.round(2))

sigma2_PDL =  4557181789.27


    You should get: sigma2_PDL =  4557181789.27

In [25]:
# Calculate standard error
se_PDL = np.sqrt(sigma2_PDL/N)

# Display standard error
print("se_PDL = ",se_PDL.round(2))

se_PDL =  472.26


    You should get: se_PDL =  472.26

### Question 4.3
Calculate the confidence interval for $\check{\alpha}$.

In [26]:
# Calculate the quantile of the standard normal distribution that corresponds to the 95% confidence interval of a two-sided test
q = norm.ppf(1-0.025)

# Calculate confidence interval
CI_low_PDL  = alpha_PDL - q * se_PDL
CI_high_PDL = alpha_PDL + q * se_PDL

# Display confidence interval
print("CI_PDL = ",(CI_low_PDL.round(2),CI_high_PDL.round(2)))

CI_PDL =  (39863.01, 41714.24)


    You should get: CI_PDL =  (39863.01, 41714.24)

### Question 4.4
Compare OLS, PSL and PDL. 
- Which estimator do you believe the most? 
- Does the dimensionality of the problem affect your answer?

In [27]:
# Create a dictionary with the results
results = {'OLS': [alpha_OLS, se_OLS, CI_low_OLS, CI_high_OLS], 
           'PSL': [alpha_PSL, se_PSL, CI_low_PSL, CI_high_PSL],
           'PDL': [alpha_PDL, se_PDL, CI_low_PDL, CI_high_PDL]}

# Create a dataframe from the dictionary
df_results = pd.DataFrame.from_dict(results, orient='index', columns=['Estimate of alpha', 'Standard error', 'Low bound of CI', 'High bound of CI'])

# Format the dataframe to two digits after the comma
df_results = df_results.round(2)

# Display the dataframe
df_results


Unnamed: 0,Estimate of alpha,Standard error,Low bound of CI,High bound of CI
OLS,37149.54,394.33,36376.67,37922.41
PSL,38147.07,268.92,37620.01,38674.14
PDL,40788.63,472.26,39863.01,41714.24


    You should get:
|      | Estimate of alpha | Standard error | Low bound of CI | High bound of CI |
|-----:|------------------:|---------------:|----------------:|-----------------:|
|  OLS |          37143.80 |         394.41 |        36370.76 |         37915.84 |
|  PSL |          38147.07 |         268.92 |        37620.01 |         38674.14 |
|  PDL |          40788.63 |         472.26 |        39863.01 |         41714.24 |

## Part 5: Post Partialling Out Lasso

An alternative to Post Double Lasso is Post Partialling Out Lasso (PPOL). PPOL is based on another orthogonalized moment condition, which is asymptotically first order equivalent to the one used in Post Double Lasso,

$$
E[(D - Z'\psi_0) ([Y - Z'\delta_0] - \alpha_0[D - Z' \psi_0])] = 0 
$$

The PPOL estimator of $\alpha_0$ can be found by applying the following 3 steps:
1. Lasso Y using Z to get residuals $\hat{\zeta} = Y - Z' \hat{\delta}$
2. Lasso D using Z to get residuals $\hat{\nu} = D - Z' \hat{\psi}$
3. OLS of $\hat{\zeta}$ on $\hat{\nu}$ to get $\breve{\alpha} = \frac{\sum_i \hat{\nu}_i \hat{\zeta}_i}{\sum_i \hat{\nu}_i^2}$


### Question 5.1
Estimate $\alpha$ using Post Partialling Out Lasso (PPOL).

Step 1: Lasso Y using Z

In [28]:
penalty_BRTyz = BRT(Z_stan, y)
print("lambda_BRT =",penalty_BRTyz.round(2))

lambda_BRT = 3133.16


    You should get: lambda_BRT = 3133.16

In [29]:
# Run Lasso
fit_BRTyz = Lasso(penalty_BRTyz, max_iter=10000).fit(Z_stan,y)
coefs=fit_BRTdz.coef_

# Calculate residuals
resyz = y-fit_BRTyz.predict(Z_stan)

# Display first coefficient
print("First coefficient =",coefs[0].round(2))

First coefficient = -0.55


    You should get: First coefficient = -0.55

Step 2: Lasso D and Z

In [30]:
penalty_BRTdz = BRT(Z_stan, d)
print("lambda_BRT =",penalty_BRTdz.round(2))

lambda_BRT = 0.05


        You should get: lambda_BRT = 0.05

In [31]:
# Run Lasso
fit_BRTdz = Lasso(penalty_BRTdz, max_iter=10000).fit(Z_stan,d)
coefs=fit_BRTdz.coef_

# Calculate residuals
resdz = d-fit_BRTdz.predict(Z_stan)

# Display first coefficient
print("First coefficient =",coefs[0].round(2))

First coefficient = -0.55


        You should get: First coefficient = -0.55

d) Estimate alpha

In [32]:
# Calculate alpha
num = resdz.T@resyz
denom = resdz.T@resdz
alpha_PPOL = num/denom

# Display alpha
print("alpha_PPOL = ",alpha_PPOL.round(2))

alpha_PPOL =  41175.15


        You should get: alpha_PPOL =  41175.15

### Question 5.2

The variance of the PPOL estimator is given by

$$
\breve{\sigma}^2 = \frac{N^{-1}\sum_i \hat{\zeta}_i^2 \hat{\nu}_i^2}{(N^{-1}\sum_i \hat{\nu}_i^2)^2}
$$

where it can be shown that 
$$
\sqrt{N} (\breve{\alpha} - \alpha_0)/\breve{\sigma} \xrightarrow{d} N(0,1)
$$

Calculate the implied variance estimate, $\check{\sigma}^2$, and calculate the standard deviation of $\breve{\alpha}$.

In [33]:
# Calculate variance    
num = resyz**2 @ resdz**2 / N
denom = (resdz.T@resdz / N)**2
sigma2_PPOL = num/denom

# Display variance
print("sigma2_PDL = ",sigma2_PPOL.round(2))

sigma2_PDL =  15304055350.41


        You should get: sigma2_PDL =  15304055350.41

In [34]:
# Calculate standard error
se_PPOL = np.sqrt(sigma2_PPOL/N)

# Display standard error
print("se_PDL = ",se_PPOL.round(2))

se_PDL =  865.44


        You should get: se_PDL =  865.44

### Question 5.3
Calculate the confidence interval for $\breve{\alpha}$.

In [35]:
# Calculate the quantile of the standard normal distribution that corresponds to the 95% confidence interval of a two-sided test
q = norm.ppf(1-0.025)

# Calculate confidence interval
CI_low_PPOL  = alpha_PPOL - q * se_PPOL
CI_high_PPOL = alpha_PPOL + q * se_PPOL

# Display confidence interval
print("CI_PDL = ",(CI_low_PPOL.round(2),CI_high_PPOL.round(2)))

CI_PDL =  (39478.92, 42871.38)


        You should get: CI_PDL =  (39478.92, 42871.38)

### Question 5.4
Compare OLS, PDL and PPOL

In [36]:
# Create a dictionary with the results
results = {'OLS'   : [alpha_OLS,    se_OLS,    CI_low_OLS,    CI_high_OLS], 
           'PSL'   : [alpha_PSL,    se_PSL,    CI_low_PSL,    CI_high_PSL],
           'PDL'   : [alpha_PDL,    se_PDL,    CI_low_PDL,    CI_high_PDL],
           'PPOL'  : [alpha_PPOL,   se_PPOL,   CI_low_PPOL,   CI_high_PPOL]}

# Create a dataframe from the dictionary
df_results = pd.DataFrame.from_dict(results, orient='index', columns=['Estimate of alpha', 'Standard error', 'Low bound of CI', 'High bound of CI'])

# Format the dataframe to two digits after the comma
df_results = df_results.round(2)

# Display the dataframe
df_results


Unnamed: 0,Estimate of alpha,Standard error,Low bound of CI,High bound of CI
OLS,37149.54,394.33,36376.67,37922.41
PSL,38147.07,268.92,37620.01,38674.14
PDL,40788.63,472.26,39863.01,41714.24
PPOL,41175.15,865.44,39478.92,42871.38


    You should get:
|      | Estimate of alpha | Standard error | Low bound of CI | High bound of CI |
|-----:|------------------:|---------------:|----------------:|-----------------:|
|  OLS |          37138.55 |         394.31 |        36365.72 |         37911.38 |
|  PSL |          38147.07 |         268.92 |        37620.01 |         38674.14 |
|  PDL |          40788.63 |         472.26 |        39863.01 |         41714.24 |
| PPOL |          41175.15 |         865.44 |        39478.92 |         42871.38 |

Why is the PDL and PPOL estimates not identical? 

## (Optional) Part 6: Repeat with BCCH and CV

* Repeat Exercises using the Belloni-Chen-Chernozhukov-Hansen (BCCH) penalty level for each Lasso (which may be justified without any independence/homoscedasticity assumptions).
* Repeat Exercises using cross-validation (CV) for each Lasso.