# PSet 2 - Question 3 (Quantile regression)

Dataset consumer.csv stores information on a representative sample of American consumers in January, 2013.

**a** _Suppose you are interested in understanding whether consumers have an understanding of the Taylor principle in monetary policy. You then decide to run the following model._

$$PX1 = \gamma_0 + \gamma_1 RATEX + \gamma_2 UNEMP + \beta'X + u$$

_where PX1 is the expected inflation rate in the subsequent 12 months, RATEX equals 1 if the individual thinks interest rates will be higher in 12 months, 0 if they will remain the same and -1 if they will go down 1 ; and UNEMP follows the {1, 0, −1} pattern for expectations over the change in the unemployment rate (1 if unemployment is expected to be higher). We also include controls of age (AGE), gender (SEX), family size (FAMSIZE) and the log of income (INCOME)._

_Estimate the above model. Report your results (use robust standard errors) and comment on your estimates._

## Solution

In [4]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy.special import factorial
from scipy.optimize import minimize

In [7]:
pd.set_option('display.precision', 2)
df = pd.read_csv("consumer.csv")

In [8]:
ols_reg = smf.ols("PX1 ~ RATEX + UNEMP + AGE + SEX + FAMSIZE + np.log(INCOME)", df)
ols_fit = ols_reg.fit(cov_type = 'HC1')

In [9]:
ols_fit.summary2().tables[1].loc[:, ['Coef.', 'Std.Err.']]

Unnamed: 0,Coef.,Std.Err.
Intercept,9.82,2.79
RATEX,-0.18,0.3
UNEMP,1.12,0.27
AGE,0.01,0.01
SEX,0.07,0.37
FAMSIZE,0.11,0.19
np.log(INCOME),-3.79,1.5


**b** _A friend of yours tells you that you should also run a quantile regression, as it would be interesting to see what happens at higher/lower quantiles of the (conditional) inflation expectation distribution. Run a quantile linear regression model using the controls in (a) for τ∈{0.25,0.5,0.75}. Do not bootstrap standard errors! Interpret your estimates. What do they mean? What did you find? Hint: you may use the quantreg package. Note: We are not yet able to test hypotheses involving distinct quantiles, as we do not know the asymptotic covariance between these estimators. We’ll be able to do so once we see quantile regression in the context of GMM estimation._

## Solution

In [10]:
def fit_models(q):
    fit = quant_reg.fit(q)
    q_df = fit.summary2().tables[1].loc[:, ['Coef.', 'Std.Err.']]
    return q_df

In [11]:
quantiles = [0.25, 0.5, 0.75]
quant_reg = smf.quantreg("PX1 ~ RATEX + UNEMP + AGE + SEX + FAMSIZE + np.log(INCOME)", df)
table = pd.DataFrame()
for q in quantiles:
    table = pd.concat([table, fit_models(q)], axis=1)

table

Unnamed: 0,Coef.,Std.Err.,Coef..1,Std.Err..1,Coef..2,Std.Err..2
Intercept,2.0,1.76,6.27,2.0,17.9,3.37
RATEX,2.66e-07,0.212,0.00919,0.27,-0.07,0.43
UNEMP,1.0,0.184,1.15,0.22,1.75,0.37
AGE,1.15e-08,0.00918,0.0116,0.01,0.03,0.02
SEX,-1.23e-06,0.268,-0.108,0.32,0.03,0.5
FAMSIZE,-7.27e-07,0.111,0.116,0.14,0.17,0.25
np.log(INCOME),-1.05e-06,0.994,-2.24,1.11,-8.3,1.81


# PSet 3 - Question 4 (Count Models)

In this exercise, we'll use the data in `smoke.csv`. I advise you
not to use packages except for those in R's base installation. If you decide to do it though, you
must explain carefully what your package is calculating at each step. Take your time interpreting
what's being asked and write down your model, scores and Hessians.

The idea here is to model how many cigarettes the individual smokes per day as a function of
price and other covariates. In this case, we're counting the number of cigarettes, so a Poisson-like
model may seem suitable.

- **a** Estimate an heteroskedasticity-robust OLS model of _cigs_ on _ln(cigpric), ln(income), restaurn, white, educ, age, age^2_ . Interpret and discuss the coefficient on _cigpric_.

- **b** Estimate a CMLE model, where $cigs_i |x_i$ follows a Poisson distribution which expected value is $\lambda_i = e^{x_i'\beta}$ . Using asymptotic theory, estimate standard errors and comment on the results on this new model.

- **c** A Poisson model can be a good choice for modeling the effects of some variable on a variable that counts cigarettes, but nothing guarantees that cigs i follows a conditional Poisson. Comment on the extent that your results can be valid or not. Could you make your estimation more flexible using Quasi-MLE? In what extent? Calculate these new standard errors. What changes?

- **d** You can impose a bit more structure in your estimates by modeling $var(cig_i |x_i ) = \sigma^2 \lambda_i$. Show that this implies $\mathbb{E}[s_i(\beta)s_i'(\beta)] = \sigma^2 \mathbb{E}[H_i(\beta)]$, where s is the score function of the CMLE model, $H_i$ is the Hessian. What is the asymptotic variance of the estimates for $\beta$ in this case? Calculate these new standard errors. How do you interpret these? Is it likely that your conditional distribution was a Poisson, to begin with?

- **e** How many people don't smoke at all on your data? How could you write a model that accommodates this?

## Solution

### item a

In [7]:
df = pd.read_csv('smoke.csv')
df = df.drop('Unnamed: 0', axis=1)

In [5]:
ols_reg = smf.ols('cigs ~ lcigpric+lincome+restaurn+white+educ+age+agesq', df)
ols_fit = ols_reg.fit(cov_type = 'HC1')

In [6]:
ols_fit.summary2().tables[1].loc[:, ['Coef.', 'Std.Err.']]

Unnamed: 0,Coef.,Std.Err.
Intercept,-2.682435,25.901938
lcigpric,-0.850904,6.054396
lincome,0.869014,0.597972
restaurn,-2.865621,1.017275
white,-0.559236,1.378283
educ,-0.501753,0.16241
age,0.774502,0.138032
agesq,-0.009069,0.001459


### item b

$\mathbb{E}[cigs|x_i]=\exp\{x_i'\beta_0\}$ and the density of cigs is Poisson: $f(y_i|x_i;\beta_0)=\frac{\exp\{x_i'\beta_0\}^{y_i}\exp\{-\exp\{x_i'\beta_0\}\}}{y_i!}$. The log-likelihood function ($\mathcal{l}_i(\beta)$), score ($s_i(\beta)$) and Hessian ($H_i(\beta)$), respectively are:

- $\mathcal{l}_i(\beta)=y_i x_i'\beta - \exp\{x_i'\beta\} - \ln(y_i!)$
- $s_i(\beta)=(y_i - \exp\{x_i'\beta\})x_i$
- $H_i(\beta)=- \exp\{x_i'\beta\}x_i x_i'$

In [99]:
def neg_lik_i(beta, X, Y):
    nrows = Y.shape[0]
    nl = 0.0
    for i in range(nrows):
        xbeta = X[i,:].dot(beta)
        nl -= Y[i]*xbeta-np.exp(xbeta)-np.log(factorial(Y[i]))
    
    return nl/nrows

def score_i(beta, X, Y):
    nrows = Y.shape[0]
    s_dim = X.shape[1]
    score = np.zeros(s_dim)
    for i in range(nrows):
        xbeta = X[i,:].dot(beta)
        score -= (Y[i]-np.exp(xbeta))*X[i,:]
    
    return score/nrows

def hessian_i(beta, X, Y):
    nrows = Y.shape[0]
    s_dim = X.shape[1]
    hess = np.zeros((s_dim, s_dim))
    for i in range(nrows):
        xbeta = X[i,:].dot(beta)
        xtx = np.multiply.outer(X[i,:], (X[i,:]))
        hess -= -np.exp(xbeta)*xtx
    
    return hess/nrows

In [16]:
def neg_log_lik(beta, X, Y):
    xbeta = X.dot(beta)
    yxbeta = Y.dot(xbeta)
    lny = sum(np.log(factorial(Y)))
    nrows = Y.shape[0]
    return -(yxbeta - sum(np.exp(xbeta)) - lny)/nrows

def score(beta, X, Y):
    nrows = Y.shape[0]
    expxbeta = np.exp(X.dot(beta))
    sbeta = X.T.dot(Y - expxbeta)
    return -sbeta/nrows

def hessian(beta, X, Y):
    nrows = Y.shape[0]
    expxbeta = np.exp(X.dot(beta))
    return -(-X.T.dot(X)*sum(expxbeta))/nrows

In [8]:
Y_train = df['cigs'].to_numpy()
X_train = df.drop('cigs', axis=1).to_numpy()

In [51]:
initial_guess = 0.0001*np.ones(9)

In [71]:
#%%timeit
neg_lik_i(initial_guess, X_train, Y_train)

13.907723165028608

In [72]:
#%%timeit
neg_log_lik(initial_guess, X_train, Y_train)

13.90772316502863

In [73]:
#%%timeit
score_i(initial_guess, X_train, Y_train)

array([5.41282868e+01, 2.15969696e+02, 3.07050226e+00, 1.73413745e+02,
       1.40627756e+05, 1.80114129e+00, 3.86266908e+01, 9.61870716e+03,
       1.45233250e+01])

In [75]:
#%%timeit
score(initial_guess, X_train, Y_train)

array([5.41282868e+01, 2.15969696e+02, 3.07050226e+00, 1.73413745e+02,
       1.40627756e+05, 1.80114129e+00, 3.86266908e+01, 9.61870716e+03,
       1.45233250e+01])

In [104]:
hessian_i(initial_guess, X_train, Y_train)

array([[2.21499276e+03, 9.71034062e+03, 1.41015367e+02, 6.81116986e+03,
        4.19531190e+06, 4.55268371e+01, 1.62380053e+03, 3.26806099e+05,
        6.57699238e+02],
       [9.71034062e+03, 4.49689342e+04, 6.47935106e+02, 3.16287701e+04,
        1.90623964e+07, 2.10969950e+02, 7.46546657e+03, 1.53761801e+06,
        3.03406871e+03],
       [1.41015367e+02, 6.47935106e+02, 1.07359298e+01, 4.57737338e+02,
        2.77402533e+05, 2.81075460e+00, 1.08459078e+02, 2.21100440e+04,
        4.39835219e+01],
       [6.81116986e+03, 3.16287701e+04, 4.57737338e+02, 2.53178063e+04,
        1.33672791e+07, 1.44659285e+02, 5.26635200e+03, 1.35435812e+06,
        2.14034626e+03],
       [4.19531190e+06, 1.90623964e+07, 2.77402533e+05, 1.33672791e+07,
        8.66987826e+09, 9.15769190e+04, 3.21239879e+06, 6.40206563e+08,
        1.29134986e+06],
       [4.55268371e+01, 2.10969950e+02, 2.81075460e+00, 1.44659285e+02,
        9.15769190e+04, 3.42815492e+00, 3.47993161e+01, 6.92850270e+03,
        1.4

In [105]:
hessian(initial_guess, X_train, Y_train)

array([[1.62561910e+06, 7.41969892e+06, 1.08010072e+05, 4.97854402e+06,
        2.46600393e+09, 3.11096379e+04, 1.19800347e+06, 2.34341314e+08,
        5.03781420e+05],
       [7.41969892e+06, 3.60762056e+07, 5.21459775e+05, 2.45436230e+07,
        1.14992946e+10, 1.49426580e+05, 5.76264417e+06, 1.18515996e+09,
        2.43939185e+06],
       [1.08010072e+05, 5.21459775e+05, 8.66328879e+03, 3.55695820e+05,
        1.66814131e+08, 1.99170109e+03, 8.38494796e+04, 1.70479349e+07,
        3.54685432e+04],
       [4.97854402e+06, 2.45436230e+07, 3.55695820e+05, 1.96242308e+07,
        7.75191447e+09, 9.74589441e+04, 3.93424040e+06, 1.06926544e+09,
        1.66591998e+06],
       [2.46600393e+09, 1.14992946e+10, 1.66814131e+08, 7.75191447e+09,
        4.49814694e+12, 5.10571963e+07, 1.90310261e+09, 3.61910472e+11,
        7.80089873e+08],
       [3.11096379e+04, 1.49426580e+05, 1.99170109e+03, 9.74589441e+04,
        5.10571963e+07, 2.43158599e+03, 2.38201389e+04, 4.56580979e+06,
        1.0

In [43]:
opt_res = minimize(neg_log_lik, initial_guess, args=(X_train, Y_train), method='Nelder-Mead')

In [44]:
opt_res.message

'Optimization terminated successfully.'

In [80]:
opt_res.x

array([6.65667927e-05, 1.08806996e-04, 1.15295308e-04, 1.00522002e-04,
       7.48238094e-05, 1.09907386e-04, 1.17427332e-04, 1.42733840e-04,
       7.05175775e-05])

In [107]:
opt_res2 = minimize(neg_log_lik, initial_guess, args=(X_train, Y_train), method='trust-ncg', jac=score, hess=hessian_i)

In [108]:
opt_res2.message

'Optimization terminated successfully.'

In [109]:
opt_res2.x

array([-5.83594742e-02,  5.45697553e-03, -4.87919219e-02,  1.14617583e-01,
       -1.02407663e-05, -3.61038271e-01,  2.35816981e-01, -1.37465091e-03,
       -3.60542559e-01])