-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include stage-1 errors (or whole model) #1
Comments
Grant, thanks for the suggestion. However, it's not the case that we compute these quantities and then throw them away, we never compute them. As noted in the documentation we only call Doing the necessary computations later on, is possible though, just with the information in the fitted model object. My impression is, though, that it is only common to include the first-stage results if there is just a single endogenous variable, is that correct? (I'm not the biggest IV user myself to be honest, just implemented it for our "AER" book.) To mimic the outcome from
If you just want the conventional covariance matrix, you can obtain it similarly:
While the former code also works in the same way if there are multiple endogenous variables, the latter needs some tweaking in that case. I'm not sure what would be a natural place to put this code and (as mentioned above) how to report this in case of multiple endogenous variables. |
Hi,
On 2020-09-04 6:12 p.m., Achim Zeileis wrote:
Grant, thanks for the suggestion. However, it's not the case that we
compute these quantities and then throw them away, we never compute
them. As noted in the documentation we only call |lm.fit()| and never
set up a full-blown |lm()| and hence are more efficient as we do less
pre- and post-processing. I think that for a model-fitting function it
is good practice not to do computations that are not strictly necessary
and only needed in certain summaries.
Doing the necessary computations later on, is possible though, just with
the information in the fitted model object. My impression is, though,
that it is only common to include the first-stage results if there is
just a single endogenous variable, is that correct? (I'm not the biggest
IV user myself to be honest, just implemented it for our "AER" book.) To
mimic the outcome from |stage1| in the |felm()| results, we have
everything available in |m1|. We just need to infer the endogenous
variable and avoid including the intercept twice:
|x <- model.matrix(m1, component = "regressors") z <- model.matrix(m1,
component = "instruments") x <- x[, which(!(colnames(x) %in%
colnames(z)))] zint <- attr(terms(m1, component = "instruments"),
"intercept") if(length(zint) > 0) z <- z[, -zint, drop = FALSE] stage1
<- lm(x ~ z) |
If you just want the conventional covariance matrix, you can obtain it
similarly:
|x <- model.matrix(m1, component = "regressors") z <- model.matrix(m1,
component = "instruments") r <- m1$residuals1[, which(!(colnames(x) %in%
colnames(z)))] v <- sum(r^2) / (nrow(z) - ncol(z)) * solve(crossprod(z)) |
While the former code also works in the same way if there are multiple
endogenous variables, the latter needs some tweaking in that case. I'm
not sure what would be a natural place to put this code and (as
mentioned above) how to report this in case of multiple endogenous
variables.
We could add a logical argument to vcov.ivreg(), say stage1, defaulting
to FALSE. If TRUE, the result could be the standard covariance matrix of
the coefficients if there's a single endogenous regressor or a list of
covariance matrices if there's more than one.
Best,
John
…
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQT7NJBB73H6JLX4ZETSEFQ3RANCNFSM4QZJU4FQ>.
|
Thanks for the quick reply, Achim.
Yes, I think that's quite fair. In cases with more than one endogenous variable, I think your already-reported diagnostic tests (Weak instrument, Sargan ,etc.) should more than suffice. With that in mind, your second solution is very elegant. I don't want to make assertions about the computation time involved. But if you feel it's not too burdensome, then having the stage1 vcov matrix returned by default is something I'd certainly find helpful as a user. Failing that, I like John's suggestion of the logical argument, which could ofc be passed down from other functions/methods that rely on ivreg. |
I think that pre-computing the stage-1 vcov makes no sense. We don't even pre-compute the stage-2 vcov. What we should pre-compute, though, is the information which columns of Then, we should introduce an argument for
|
Hi,
On 2020-09-04 7:13 p.m., Achim Zeileis wrote:
I think that pre-computing the stage-1 vcov makes no sense. We don't
even pre-compute the stage-2 vcov.
I agree.
What we should pre-compute, though, is the information which columns of
|x| are endogenous and which are exogenous. I also remembered that this
computation can be more involved in case there are interactions among
the regressors. The relevant code is in |ivdiag()| though.
OK, I frankly hadn't thought of that, though I was aware that the
regressors, not explanatory variables, have to be handled properly.
Then, we should introduce an argument for |vcov()| and probably
additionally |coef()| and |residuals()|. We already have a |type|
argument for |residuals()| and could leverage that. Somewhat
inconsistently (for historical reasons) we use |component| in
|model.matrix()| and |terms()|. So open questions:
* What to call the additional argument in |coef()| and |vcov()|?
|component = "stage1"|? Or |type = "stage1"|?
I'd prefer "component" for coef() and vcov().
* Should these /always/ only return the results for the endogenous
variable(s)?
What would be the point of returning this information for the exogenous
regressors, where for each there's one coefficient = 1 and all others =
0 and where residuals are all 0 and hence SEs are 0?
* What to do if there are multiple endogenous variables?
I'd just return a list with named elements. We could give that a class
but is there really a point? That is, I doubt whether this feature will
be used much.
Best,
John
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQRKRJCHT3ZKMFP3MJLSEFYC7ANCNFSM4QZJU4FQ>.
|
OK, thanks. I also just had a look at how I compute endogenous variables in
|
Hi,
On 2020-09-04 7:45 p.m., Achim Zeileis wrote:
OK, thanks. I also just had a look at how I compute endogenous variables
in |ivdiag()|. That relies on column names and |terms|. Thus, it only
works in |ivreg()| but not |ivreg.fit()|. So maybe it would be better to
do this numerically. What do you think about this:
|endo <- which(colMeans(m1$residuals1^2) > sqrt(.Machine$double.eps))
instr <- which(rowMeans(m1$coefficients1[, -endo]^2) <
sqrt(.Machine$double.eps)) |
Actually, something similar occurred to me, but shouldn't endo and instr
partition the regressors? So the second computation is redundant.
But the second computation should be adaptable to make it more efficient
than the first because it doesn't involve n cases -- that is, the
exogenous regressors are the ones that have one coefficient equal to 1
(within rounding error) and the rest 0. Something like (untested)
coef1 <- m1$coefficients1
p <- ncol(coef1)
instr <- which(rowSums(abs(coef1 - 1) < sqrt(.Machine$double.eps)) == 1
& rowSums(abs(coef1) < sqrt(.Machine$double.eps)) == (p - 1))
Anyway, shouldn't you be sleeping?
Best,
John
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQWT5JQMTY6DSZJOQ2DSEF3ZHANCNFSM4QZJU4FQ>.
|
Thanks! I just pushed my proposed solution here: Your solution should also work, I think, but I was worried that a column of Re: sleep. Yes, soon... :-) |
And now I just pushed extended In the case of multiple endogenous variables, I essentially return what you would get for the corresponding It would be great if you could take a look whether this does the job. Examples should also be added to the documentation... P.S.: John, in case of aliased coefficients our |
Hi,
On 2020-09-04 10:23 p.m., Achim Zeileis wrote:
And now I just pushed extended |coef()| and |vcov()| methods: 2b49ec8
<2b49ec8>
In the case of multiple endogenous variables, I essentially return what
you would get for the corresponding |mlm| object. The only difference is
that |coef()| is "flattened" to a vector (rather than a matrix) with
names matching those of |vcov()|.
It would be great if you could take a look whether this does the job.
Examples should also be added to the documentation...
P.S.: John, in case of aliased coefficients our |vcov()| currently drops
the rows/columns pertaining to the aliased coefficients. For |'lm()|
objects this are retained with |NA|s by default. We should probably
mimic what |lm()| does. Can you take a look? Should I open an issue for
this?
I'll look at all this when I have a chance -- almost surely before the
end of next week. I don't know whether it merits examples (I don't
expect that it will be used much), but we probably should have some
tests. You can open an issue if you wish, but it's not really necessary
to do so. BTW, I believe that vcov.lm() has an argument for how to
handle aliased coefficients with NAs the default (suggested some time
ago by Terry Therneau and me) -- I'll confirm that too.
Best,
John
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQXCR6RZZT2NLNMM3K3SEGOLTANCNFSM4QZJU4FQ>.
|
Great, thanks! In order to learn more about handling issues in GitHub, I've also created a separate issue for the complete coef vs. vcov problem. |
I feel like I dropped hand grenade in a room and then left everyone else to clean up... Suffice it to say that I really like the proposed changes. Testing a quick proof of concept using Achim's latest commits to the dev branch suggests that everything works exactly as expected. In the below reprex I'm recreating the same basic table as library(ivreg) ## dev version; remotes::install_github('john-d-fox/ivreg')
library(lfe)
library(broom)
## ivreg
m1 = ivreg(log(packs) ~ log(rprice) + log(rincome) | salestax + log(rincome),
data = CigaretteDemand)
## lfe::felm
m2 = felm(log(packs) ~ log(rincome) | 0 | (log(rprice) ~ salestax),
data = CigaretteDemand)
## extract components from ivreg stage 1
c1 = coef(m1, component = 'stage1')
v1 = vcov(m1, component = 'stage1')
se1 = sqrt(diag(v1))
t1 = c1/se1
p1 = 2*pt(-abs(t1),df=m1$df.residual1)
## Mimic broom::tidy output for ivreg stage1
tibble::tibble(term = names(c1), estimate = c1, std.error = se1,
statistic = t1, p.value = p1)
#> # A tibble: 3 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 3.59 0.226 15.9 4.17e-20
#> 2 salestax 0.0274 0.00408 6.72 2.65e- 8
#> 3 log(rincome) 0.389 0.0851 4.57 3.74e- 5
## Same as felm stage1
broom::tidy(m2$stage1)
#> # A tibble: 3 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 3.59 0.226 15.9 4.17e-20
#> 2 log(rincome) 0.389 0.0851 4.57 3.74e- 5
#> 3 salestax 0.0274 0.00408 6.72 2.65e- 8 Created on 2020-09-05 by the reprex package (v0.3.0) |
Grant, thanks for checking and confirming that we're on the right track, much appreciated. Let's see whether John finds further improvements. Re: hand grenade. Not at all, this was a fun bit of coding for the start into the weekend. Also, I think that the detection of endogenous variables and corresponding instruments is now more robust so that's a nice side effect. |
Super. I'm glad to hear that there were positive spillovers. One very minor thought is to also add an Up to you, of course, but something as simple as the below does the trick. confint.ivreg <-
function (object, parm, level = 0.95, component = c("stage2", "stage1"), ...) {
component <- match.arg(component)
cf <- coef(object, component = component)
ses <- sqrt(diag(vcov(object, component = component)))
pnames <- names(ses)
if (is.matrix(cf))
cf <- setNames(as.vector(cf), pnames)
if (missing(parm))
parm <- pnames
else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
dof <- if (component=='stage1'){
object$df.residual1
} else {
object$df.residual
}
fac <- qt(a, dof)
pct <- stats:::format.perc(a, 3)
ci <- array(NA_real_, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ci[] <- cf[parm] + ses[parm] %o% fac
ci
} |
Hmmm, too much copying of code, I think. If we want to support this, then I'd rather tag an additional class on the object that has different defaults for the extractor:
I haven't tested this, yet, but it should then work with |
Hi,
I wrote some tests that uncovered problems with the code for stage-1
results, fixed the problems, and added a "complete" argument to the
coef() and vcov() methods. I plan to push these changes to GitHub later
today. The new code can probably use some cleaning up and the tests
might be extended.
Best,
John
…On 2020-09-05 6:03 p.m., Achim Zeileis wrote:
Grant, thanks for checking and confirming that we're on the right track,
much appreciated. Let's see whether John finds further improvements.
Re: hand grenade. Not at all, this was a fun bit of coding for the start
into the weekend. Also, I think that the detection of endogenous
variables and corresponding instruments is now more robust so that's a
nice side effect.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQUHLKPNEVCLIQPN2DTSEKYSBANCNFSM4QZJU4FQ>.
|
Thanks! I streamlined the code a little bit and tried to avoid duplication of the post-processing steps: 2bf33d3 So I think the last decision we need to make is whether it is worth to include something like this |
Hi,
On 2020-09-06 6:17 p.m., Achim Zeileis wrote:
Thanks! I streamlined the code a little bit and tried to avoid
duplication of the post-processing steps: 2bf33d3
<2bf33d3>
So I think the last decision we need to make is whether it is worth to
include something like this |stage1()| extractor function to facilitate
getting |confint()| and |coeftest()|. Or maybe you have a better idea...
I guess I have to think about it. After all, stage-1 is just a
multivariate linear model and the "natural" tests are multivariate. Both
Anova() and linearHypothesis() in the car package support that, and I
wonder whether it really makes sense to duplicate this functionality for
ivreg objects.
OTOH, unless I'm missing something, I think that it would be very easy
to write a confint() method for the individual coefficients given coef()
and vcov(). Shall I just do that?
Best,
John
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQUONBDEYR2TDNQKME3SEQC6NANCNFSM4QZJU4FQ>.
|
That's what Grant suggested, see his code above. My reaction was that this copies more code than necessary. Also, we don't get the coefficient tests by that. |
On 2020-09-06 9:57 p.m., Achim Zeileis wrote:
That's what Grant suggested, see his code above. My reaction was that
this copies more code than necessary. Also, we don't get the coefficient
tests by that.
OK, I see. Yes, Grant's suggestion is essentially what I was proposing.
Sorry I didn't look at that carefully before.
I think this has gotten out of hand. From my point of view, we provided
coef() and vcov() methods that cover stage-1 essentially for symmetry
and because the information was already there, not because it was very
interesting to do so.
We don't at present provide a confint() method. The default method works
and gives stage-2 confidence intervals based on the normal distribution,
and it ignores the component argument. That's OK. Writing a specific
ivreg method would be a slight improvement, using the t-distribution. If
we do that, the same symmetry consideration suggests that we include a
component argument. Grant's suggested function is quite simple and
transparent, and I'm not bothered by the small duplication of code. Your
suggestion is more elegant if somewhat more opaque, and would I think
produce z rather than t intervals. I'm indifferent about which of these
approaches, if either, is implemented.
The question is a bit whether we provide this mainly as infrastructure
to obtain stage-1 model summaries in other packages (like broom etc.) or
whether we want to integrate stage-1 results in our output. I'm not sure
what the best answer is here...
I think that we should avoid spending much more time here. There are (at
least?) three ways to think about a multivariate linear model: as a
ravelled vector of coefficients with a covariance matrix, as a set of
related univariate linear models, and multivariately. For most
applications of MLM, the third view makes most sense. For 2SLS the
second view probably makes most sense. The second view is what you get,
e.g., from the summary() method for mlms. At present, we provide the
first view with coef() and vcov(), and may now do so with confint(),
because it's the easiest to implement, but it's also the least useful,
in my opinion.
As I said, I don't think that we should pursue this tangent much
further. Someone who is really interested in the first stage regressions
beyond the information we already provide can just fit the mlm (with
only the endogenous regressors as responses, of course) and use the
extensive tools available for that.
The only awkwardness, and a possible argument for automating the
first-stage regression, is that in some cases it will be necessary to
use model.matrix() to generate the LHS of endogenous regressors for the
mlm. If so, and if we judge it worth the cost, we can save the
first-stage mlm object in the ivreg object, as I think Grant originally
suggested. That would simplify everything since we could just use
existing mlm methods.
Best,
John
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLSAQROZGRLQKJ6OKMNYSDSEQ4Z3ANCNFSM4QZJU4FQ>.
|
Thanks, John, this helped! I think that the point about the t distribution is a good one. So I changed my mind and vote for including a confint.ivreg() method. (As a short note: coefci() in "lmtest" chooses between normal and t distribution based on df.residual() which we do provide.) |
Thanks gents. Sorry if this has been more work than originally anticipated. I'm grateful for whatever functionality you decide to roll in (or not). Even following your discussion has been most informative. Cheers, G. |
I've now added the dedicated |
Agree. Thanks so much for adding this functionality! |
Currently
ivreg()
retains some features from the stage-1 regression. Notably, the "residuals1", "qr1", "rank1", and "coefficients1" components of the return object. It would be great if we could retain the associated stage-1 standard errors too, or perhaps even the whole model object.Motivation
It's very common to see coefficient estimates from the first stage of an IV presented in regression tables. Currently, this is not possible to automate and requires quite a bit tinkering or re-running the first-stage manually. However, since we already have the stage-1 coefficients, all we would need is the associated standard errors to automate the table construction.
Alternatively, one could include the whole stage-1 model object, which would allow for even more flexibility. This is the approach that
lfe::felm
adopts for its IV method. (Example below.) You could even go a step further and return it as a "coeftest" object to reduce size.Created on 2020-09-04 by the reprex package (v0.3.0)
Related discussions:
P.S. Thanks so much for the work on this package. Having a self-contained R library for IV makes a ton of sense to me. The new features and pkgdown site are excellent too.
The text was updated successfully, but these errors were encountered: