Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include stage-1 errors (or whole model) #1

Closed
grantmcdermott opened this issue Sep 4, 2020 · 24 comments
Closed

Include stage-1 errors (or whole model) #1

grantmcdermott opened this issue Sep 4, 2020 · 24 comments
Assignees
Labels
enhancement New feature or request

Comments

@grantmcdermott
Copy link

grantmcdermott commented Sep 4, 2020

Currently ivreg() retains some features from the stage-1 regression. Notably, the "residuals1", "qr1", "rank1", and "coefficients1" components of the return object. It would be great if we could retain the associated stage-1 standard errors too, or perhaps even the whole model object.

Motivation

It's very common to see coefficient estimates from the first stage of an IV presented in regression tables. Currently, this is not possible to automate and requires quite a bit tinkering or re-running the first-stage manually. However, since we already have the stage-1 coefficients, all we would need is the associated standard errors to automate the table construction.

Alternatively, one could include the whole stage-1 model object, which would allow for even more flexibility. This is the approach that lfe::felm adopts for its IV method. (Example below.) You could even go a step further and return it as a "coeftest" object to reduce size.

library(ivreg)
library(lfe)
#> Loading required package: Matrix

data("CigaretteDemand", package = "ivreg")

## ivreg 
m1 = ivreg(log(packs) ~ log(rprice) + log(rincome) | salestax + log(rincome),
                  data = CigaretteDemand)
## "Stage-1" elements of the return object (missing coefficient standard errors)
grep('1', attr(m1, 'names'), value = TRUE)
#> [1] "residuals1"    "qr1"           "rank1"         "coefficients1"

## lfe::felm equivalent
m2 = felm(log(packs) ~ log(rincome)  | 0 | (log(rprice) ~ salestax), 
                 data = CigaretteDemand)
## Retains whole stage-1 model object
summary(m2$stage1)
#> 
#> Call:
#>    NULL 
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.163799 -0.033049  0.001907  0.049322  0.185542 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.590811   0.225558  15.920  < 2e-16 ***
#> log(rincome) 0.389283   0.085104   4.574 3.74e-05 ***
#> salestax     0.027395   0.004077   6.720 2.65e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.07848 on 45 degrees of freedom
#> Multiple R-squared(full model): 0.6389   Adjusted R-squared: 0.6228 
#> Multiple R-squared(proj model): 0.6389   Adjusted R-squared: 0.6228 
#> F-statistic(full model):39.81 on 2 and 45 DF, p-value: 1.114e-10 
#> F-statistic(proj model): 39.81 on 2 and 45 DF, p-value: 1.114e-10 
#> F-statistic(excl instr.):45.16 on 1 and 45 DF, p-value: 2.655e-08

Created on 2020-09-04 by the reprex package (v0.3.0)

Related discussions:

P.S. Thanks so much for the work on this package. Having a self-contained R library for IV makes a ton of sense to me. The new features and pkgdown site are excellent too.

@zeileis
Copy link
Owner

zeileis commented Sep 4, 2020

Grant, thanks for the suggestion. However, it's not the case that we compute these quantities and then throw them away, we never compute them. As noted in the documentation we only call lm.fit() and never set up a full-blown lm() and hence are more efficient as we do less pre- and post-processing. I think that for a model-fitting function it is good practice not to do computations that are not strictly necessary and only needed in certain summaries.

Doing the necessary computations later on, is possible though, just with the information in the fitted model object. My impression is, though, that it is only common to include the first-stage results if there is just a single endogenous variable, is that correct? (I'm not the biggest IV user myself to be honest, just implemented it for our "AER" book.) To mimic the outcome from stage1 in the felm() results, we have everything available in m1. We just need to infer the endogenous variable and avoid including the intercept twice:

x <- model.matrix(m1, component = "regressors")
z <- model.matrix(m1, component = "instruments")
x <- x[, which(!(colnames(x) %in% colnames(z)))]
zint <- attr(terms(m1, component = "instruments"), "intercept")
if(length(zint) > 0) z <- z[, -zint, drop = FALSE]
stage1 <- lm(x ~ z)

If you just want the conventional covariance matrix, you can obtain it similarly:

x <- model.matrix(m1, component = "regressors")
z <- model.matrix(m1, component = "instruments")
r <- m1$residuals1[, which(!(colnames(x) %in% colnames(z)))]
v <- sum(r^2) / (nrow(z) - ncol(z)) * solve(crossprod(z))

While the former code also works in the same way if there are multiple endogenous variables, the latter needs some tweaking in that case. I'm not sure what would be a natural place to put this code and (as mentioned above) how to report this in case of multiple endogenous variables.

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 4, 2020 via email

@grantmcdermott
Copy link
Author

Thanks for the quick reply, Achim.

My impression is, though, that it is only common to include the first-stage results if there is just a single endogenous variable, is that correct?

Yes, I think that's quite fair. In cases with more than one endogenous variable, I think your already-reported diagnostic tests (Weak instrument, Sargan ,etc.) should more than suffice.

With that in mind, your second solution is very elegant. I don't want to make assertions about the computation time involved. But if you feel it's not too burdensome, then having the stage1 vcov matrix returned by default is something I'd certainly find helpful as a user. Failing that, I like John's suggestion of the logical argument, which could ofc be passed down from other functions/methods that rely on ivreg.

@zeileis
Copy link
Owner

zeileis commented Sep 4, 2020

I think that pre-computing the stage-1 vcov makes no sense. We don't even pre-compute the stage-2 vcov.

What we should pre-compute, though, is the information which columns of x are endogenous and which are exogenous. I also remembered that this computation can be more involved in case there are interactions among the regressors. The relevant code is in ivdiag() though.

Then, we should introduce an argument for vcov() and probably additionally coef() and residuals(). We already have a type argument for residuals() and could leverage that. Somewhat inconsistently (for historical reasons) we use component in model.matrix() and terms(). So open questions:

  • What to call the additional argument in coef() and vcov()? component = "stage1"? Or type = "stage1"?
  • Should these always only return the results for the endogenous variable(s)?
  • What to do if there are multiple endogenous variables?

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 4, 2020 via email

@zeileis
Copy link
Owner

zeileis commented Sep 4, 2020

OK, thanks. I also just had a look at how I compute endogenous variables in ivdiag(). That relies on column names and terms. Thus, it only works in ivreg() but not ivreg.fit(). So maybe it would be better to do this numerically. What do you think about this:

endo <- which(colMeans(m1$residuals1^2) > sqrt(.Machine$double.eps))
instr <- which(rowMeans(m1$coefficients1[, -endo]^2) < sqrt(.Machine$double.eps))

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 5, 2020 via email

@zeileis zeileis pinned this issue Sep 5, 2020
@zeileis zeileis unpinned this issue Sep 5, 2020
@zeileis zeileis added the enhancement New feature or request label Sep 5, 2020
@zeileis zeileis self-assigned this Sep 5, 2020
@zeileis
Copy link
Owner

zeileis commented Sep 5, 2020

Thanks! I just pushed my proposed solution here:

efaa332

Your solution should also work, I think, but I was worried that a column of x might also be linearly dependent on a combination of z variables. Admittedly, this would be very unusual but not impossible, I think. Also, I'm not sure whether we could be even stricter than sqrt(.Machine$double.eps), e.g., .Machine$double.eps^0.7.

Re: sleep. Yes, soon... :-)

@zeileis
Copy link
Owner

zeileis commented Sep 5, 2020

And now I just pushed extended coef() and vcov() methods: 2b49ec8

In the case of multiple endogenous variables, I essentially return what you would get for the corresponding mlm object. The only difference is that coef() is "flattened" to a vector (rather than a matrix) with names matching those of vcov().

It would be great if you could take a look whether this does the job. Examples should also be added to the documentation...

P.S.: John, in case of aliased coefficients our vcov() currently drops the rows/columns pertaining to the aliased coefficients. For 'lm() objects this are retained with NAs by default. We should probably mimic what lm() does. Can you take a look? Should I open an issue for this?

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 5, 2020 via email

@zeileis
Copy link
Owner

zeileis commented Sep 5, 2020

Great, thanks! In order to learn more about handling issues in GitHub, I've also created a separate issue for the complete coef vs. vcov problem.

@grantmcdermott
Copy link
Author

I feel like I dropped hand grenade in a room and then left everyone else to clean up...

Suffice it to say that I really like the proposed changes. Testing a quick proof of concept using Achim's latest commits to the dev branch suggests that everything works exactly as expected. In the below reprex I'm recreating the same basic table as broom::tidy, since this is what gets passed on to table-producing libraries like modelsummary. Once you're both happy with everything on this side, I'll submit a PR on broom's side to allow for support on that side.

library(ivreg) ## dev version; remotes::install_github('john-d-fox/ivreg')
library(lfe)
library(broom)

## ivreg
m1 = ivreg(log(packs) ~ log(rprice) + log(rincome) | salestax + log(rincome),
                 data = CigaretteDemand)
## lfe::felm
m2 = felm(log(packs) ~ log(rincome) | 0 | (log(rprice) ~ salestax),
                 data = CigaretteDemand)

## extract components from ivreg stage 1
c1 = coef(m1, component = 'stage1')
v1 = vcov(m1, component = 'stage1')
se1 = sqrt(diag(v1))
t1 = c1/se1
p1 = 2*pt(-abs(t1),df=m1$df.residual1)

## Mimic broom::tidy output for ivreg stage1
tibble::tibble(term = names(c1), estimate = c1, std.error = se1,
                     statistic = t1, p.value = p1)
#> # A tibble: 3 x 5
#>   term         estimate std.error statistic  p.value
#>   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    3.59     0.226       15.9  4.17e-20
#> 2 salestax       0.0274   0.00408      6.72 2.65e- 8
#> 3 log(rincome)   0.389    0.0851       4.57 3.74e- 5

## Same as felm stage1
broom::tidy(m2$stage1)
#> # A tibble: 3 x 5
#>   term         estimate std.error statistic  p.value
#>   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    3.59     0.226       15.9  4.17e-20
#> 2 log(rincome)   0.389    0.0851       4.57 3.74e- 5
#> 3 salestax       0.0274   0.00408      6.72 2.65e- 8

Created on 2020-09-05 by the reprex package (v0.3.0)

@zeileis
Copy link
Owner

zeileis commented Sep 5, 2020

Grant, thanks for checking and confirming that we're on the right track, much appreciated. Let's see whether John finds further improvements.

Re: hand grenade. Not at all, this was a fun bit of coding for the start into the weekend. Also, I think that the detection of endogenous variables and corresponding instruments is now more robust so that's a nice side effect.

@grantmcdermott
Copy link
Author

grantmcdermott commented Sep 6, 2020

Re: hand grenade. Not at all, this was a fun bit of coding for the start into the weekend. Also, I think that the detection of endogenous variables and corresponding instruments is now more robust so that's a nice side effect.

Super. I'm glad to hear that there were positive spillovers. One very minor thought is to also add an ivreg.confint method. As things stand, it's trivial to calculate it manually with the information already available. But it might be nice to provide something that supports the first stage instead of silently failing for users.

Up to you, of course, but something as simple as the below does the trick.

confint.ivreg <-
  function (object, parm, level = 0.95, component = c("stage2", "stage1"), ...) {
    component <- match.arg(component)
    cf <- coef(object, component = component)
    ses <- sqrt(diag(vcov(object, component = component)))
    pnames <- names(ses)
    if (is.matrix(cf)) 
      cf <- setNames(as.vector(cf), pnames)
    if (missing(parm)) 
      parm <- pnames
    else if (is.numeric(parm)) 
      parm <- pnames[parm]
    a <- (1 - level)/2
    a <- c(a, 1 - a)
    dof <- if (component=='stage1'){
      object$df.residual1
    } else {
      object$df.residual
    }
    fac <- qt(a, dof)
    pct <- stats:::format.perc(a, 3)
    ci <- array(NA_real_, dim = c(length(parm), 2L), dimnames = list(parm, 
                                                                     pct))
    ci[] <- cf[parm] + ses[parm] %o% fac
    ci
  }

@zeileis
Copy link
Owner

zeileis commented Sep 6, 2020

Hmmm, too much copying of code, I think. If we want to support this, then I'd rather tag an additional class on the object that has different defaults for the extractor:

stage1 <- function(object) {
  class(object) <- c("stage1", class(object))
  return(object)
}
coef.stage1 <- function(object, component = "stage1", ...) coef.ivreg(object, component = component, ...)
vcov.stage1 <- function(object, component = "stage1", ...) vcov.ivreg(object, component = component, ...)

I haven't tested this, yet, but it should then work with stats::confint() and lmtest::coeftest().

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 6, 2020 via email

@zeileis
Copy link
Owner

zeileis commented Sep 6, 2020

Thanks! I streamlined the code a little bit and tried to avoid duplication of the post-processing steps: 2bf33d3

So I think the last decision we need to make is whether it is worth to include something like this stage1() extractor function to facilitate getting confint() and coeftest(). Or maybe you have a better idea...

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 7, 2020 via email

@zeileis
Copy link
Owner

zeileis commented Sep 7, 2020

That's what Grant suggested, see his code above. My reaction was that this copies more code than necessary. Also, we don't get the coefficient tests by that.
The question is a bit whether we provide this mainly as infrastructure to obtain stage-1 model summaries in other packages (like broom etc.) or whether we want to integrate stage-1 results in our output. I'm not sure what the best answer is here...

@john-d-fox
Copy link
Collaborator

john-d-fox commented Sep 7, 2020 via email

@zeileis
Copy link
Owner

zeileis commented Sep 7, 2020

Thanks, John, this helped! I think that the point about the t distribution is a good one. So I changed my mind and vote for including a confint.ivreg() method. (As a short note: coefci() in "lmtest" chooses between normal and t distribution based on df.residual() which we do provide.)

@grantmcdermott
Copy link
Author

grantmcdermott commented Sep 8, 2020

Thanks gents. Sorry if this has been more work than originally anticipated. I'm grateful for whatever functionality you decide to roll in (or not). Even following your discussion has been most informative. Cheers, G.

@zeileis
Copy link
Owner

zeileis commented Jan 31, 2021

I've now added the dedicated confint() method for ivreg objects (f347094). I think we can also close this issue then.

@grantmcdermott
Copy link
Author

Agree. Thanks so much for adding this functionality!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants