Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustered SE different from that produced by reghdfe #1

Closed
weilu opened this issue Jul 3, 2018 · 29 comments · Fixed by #26
Closed

Clustered SE different from that produced by reghdfe #1

weilu opened this issue Jul 3, 2018 · 29 comments · Fixed by #26

Comments

@weilu
Copy link

weilu commented Jul 3, 2018

My model is something like felm(y ~ x1 + x2 | prefecture + province_year | 0 | province, data, weights=data$population), which was translated from Stata code: reghdfe y x1 x2 [aw=population], absorb(prefect province_year) cluster(province)

The point estimates are identical, but the clustered SE are quite different between R and Stata. I've manually removed the singletons from the data so the number of observations matches that reported by Stata, but the resulting clustered SE is still higher than what's reported by reghdfe. I suspected that the higher SEs are caused by fixed effects swept away by the within-group transformation are nested within clusters. I've tried various DoF adjustments I can find on google, like this and this, but none produced matching clustered SEs. I started reading your implementation then I found this comment in the code:

need to check this computation, the SE's are slightly numerically different from Stata's.
it seems stata does not do the small-sample adjustment

Could you shine some lights on what caused the difference and how I can adjust for it?

@ChandlerLutz
Copy link

Can you please provide a reproducible example with both the R and stata code and output?

@weilu
Copy link
Author

weilu commented Sep 20, 2018

The data is private to the professor I've done work for. I can ask if I can share it for the purpose of this ticket. Do you have an email address I can send the data to if the prof says yes?

@karldw
Copy link
Contributor

karldw commented Jan 31, 2019

Here's an example with very slight differences. felm gives a standard error of 0.00017561, while reghdfe gives 0.00017453.
I'm guessing the difference is from degrees of freedom, as @weilu mentioned. The discussion from Cameron and Miller (2015, pp.14-15) on clusters with fixed effects is relevant here.

R

library(lfe)
library(nycflights13)
library(haven)
# Write out for Stata in the next block.
haven::write_dta(nycflights13::flights, "nycflights.dta")

lfe::felm(arr_delay ~ distance | tailnum + origin | 0 | tailnum, data = nycflights13::flights)$cse

Stata

// ssc install reghdfe
use nycflights  // created by the R block above
encode tailnum, gen(tailnum_n)
encode origin, gen(origin_n)
reghdfe arr_delay distance, absorb(tailnum_n origin_n) vce(cluster tailnum_n) cformat(%9.8f)

@vikjam
Copy link

vikjam commented Feb 5, 2019

I think @karldw is correct about the discrepancy being due to the treatment of the degrees-of-freedom adjustment. Gormley and Matsa (RFS 2014) describe the difference in the last section, "Stata programs that can be used to estimate models with multiple high-dimensional FE". They also include a description on how to manually adjust the standard errors.

It might be useful to include an option in felm for users if they want to match reghdfe when they have fixed effects nested within clusters.

@karldw
Copy link
Contributor

karldw commented Jul 5, 2019

Here's a more precise take on the issue above. reghdfe adjusts the degrees of freedom adjustments when the FE are nested within clusters. Could lfe do something similar, and would you be interested in a PR? How would you want to deal with backward compatibility?

The Cameron and Miller PDF has more details on the degrees-of-freedom correction.

Below a contrived example where the degrees-of-freedom calculation changes the significance level of the result (assuming you're particularly interested in a null hypothesis significance test at 5%).

R with lfe

set.seed(0)

# Generate 500 FE levels, with 5 obs each, nested within 50 clusters
n_clusters <- 50
n_obs_per_fe_level <- 5
n_fe_levels <- 500
n_obs <- n_obs_per_fe_level * n_fe_levels
n_fe_per_cluster <- n_fe_levels / n_clusters
n_obs_per_cluster <- n_obs / n_clusters

# Make 500 FE levels
fe <- sort(rep(1:n_fe_levels, length.out = n_obs))
# Make clusters that nest FE levels
cl <- ceiling(fe / n_fe_per_cluster)


draw_one_cluster <- function(N) {
  # Each cluster has its own mean and se for epsilon.
  cluster_eps_mean <- rnorm(1)
  cluster_eps_sd <- 10 * runif(1)
  rnorm(N, mean=cluster_eps_mean, sd = cluster_eps_sd)
}
eps <- unlist(
  replicate(n_clusters, draw_one_cluster(n_obs_per_cluster), simplify = FALSE)
)

x <- rnorm(n_obs, 0, 10)
y <- 0.02 * x + fe / 100 + eps

DF <- data.frame(cl, y, x, fe)
haven::write_dta(DF, "~/test_data.dta")

print(broom::tidy(lfe::felm(y ~ x | fe | 0 | cl, data=DF)))
#>  term estimate std.error statistic p.value
#> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
#> x       0.0284    0.0148      1.91  0.0562


# Edit (2019-08-27): I previously had these values:
#> x       0.00936    0.0148     0.630   0.529

Stata with reghdfe

use "~/test_data", clear

reghdfe y x, absorb(fe) vce(cluster cl)
/*

(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =      2,500
Absorbing 1 HDFE group                            F(   1,     49) =       4.56
Statistics robust to heteroskedasticity           Prob > F        =     0.0377
                                                  R-squared       =     0.2373
                                                  Adj R-squared   =     0.0466
                                                  Within R-sq.    =     0.0018
Number of clusters (cl)      =         50         Root MSE        =     6.4882

                                    (Std. Err. adjusted for 50 clusters in cl)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .0283565   .0132782     2.14   0.038      .001673    .0550401
       _cons |   2.372491   .0022073  1074.83   0.000     2.368056    2.376927
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          fe |       500         500           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation



Edit (2019-08-27): I previously had these values:
           x |   .0283565   .0132782     2.14   0.038      .001673    .0550401
       _cons |   2.372491   .0022073  1074.83   0.000     2.368056    2.376927

Edit (2019-09-12): The "update" was incorrect, based on an old version of reghdfe.
Those values were:
          x |   .0283565   .0132755     2.14   0.038     .0016783    .0550347
*/

@grantmcdermott
Copy link
Contributor

FWIW, this clusters-nested-within-FE issue was recently addressed in Julia's analogous FixedEffectModels.jl package. See discussion and PRs here:

https://github.com/matthieugomez/FixedEffectModels.jl/pull/48
https://github.com/matthieugomez/FixedEffectModels.jl/issues/49
https://github.com/matthieugomez/FixedEffectModels.jl/pull/50

I imagine everyone agrees that consistency between all of these packages is desirable. However, as Cameron Gelbach & Miller (2011; section 2.3) make clear in their original article, there are several (potentially?) valid bias correction measures when it comes to nested and multiway clusters. A note along those lines in the package documentation would be probably be helpful for users. @karldw are you still thinking of putting in a PR?

@karldw
Copy link
Contributor

karldw commented Aug 21, 2019

@grantmcdermott, I'm up for writing a PR, but wanted to hear back from @sgaure before sinking a bunch of time into it.

@edrubin
Copy link

edrubin commented Aug 22, 2019

@grantmcdermott @karldw @weilu Is it possible that the discrepancy is on Stata's side? Following @karldw's simulation, I found the same standard errors from lfe::felm() and estimatr::lm_robust():

# Add magrittr
library(magrittr)

# With lfe::felm()
lfe::felm(
  y ~ x | fe | 0 | cl, data=DF
) %>% summary() %>% coef()

#     Estimate Cluster s.e.  t value   Pr(>|t|)
# x 0.02835652   0.01484326 1.910397 0.05622501

# With estimatr::lm_robust()
estimatr::lm_robust(
  y ~ x, fixed_effects = ~ fe, clusters = cl, data = DF, se_type = "stata"
) %>% summary() %>% coef()

#     Estimate Std. Error  t value   Pr(>|t|)     CI Lower   CI Upper DF
# x 0.02835652 0.01484326 1.910397 0.06194321 -0.001472122 0.05818516 49

Neither set of estimates matches Stata's output (from @karldw's estimates). Has anyone run the same analysis in Julia?

I guess it is still worth noting that lfe and estimatr do not agree on p-values (emphasizing differences in degrees of freedom).

Finally, there are two other options that might cause differences between Stata and lfe (though it seems strange that estiamtr and lfe agree). Could the discrepancy come (in part) from non-exact degrees of freedom? I believelfe::felm() estimates degrees of freedom by default (changed via exactDOF = T). There's also the possibility that Stata and lfe are dealing with non-definite variance matrices in different ways (psdef option). I'm not enough of a Stata pro to know the answer here.

@grantmcdermott
Copy link
Contributor

grantmcdermott commented Aug 22, 2019

@edrubin Here's the Julia version. Everything is the same as Stata... except, bizarrely enough, the 95% confidence intervals [UPDATE: and p-values]. (The plot thickens!)

#= Skipping the RCall.jl part where I replicate @karldw's simulation and 
then transfer the DF object to Julia...
=#

using DataFrames, FixedEffectModels
DF.fe =  categorical(DF.fe)
DF.cl =  categorical(DF.cl)
reg(DF, @model(y ~ x, fe = fe, vcov = cluster(cl)))

#=
Fixed Effect Model                     
=============================================================
Number of obs:           2500   Degrees of freedom:         2
R2:                     0.237   R2 within:              0.002
F-Statistic:          4.56065   p-value:                0.038
Iterations:                 1   Converged:               true
=============================================================
      Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
-------------------------------------------------------------
x    0.0283565 0.0132782 2.13557    0.033 0.0023191 0.0543939
=============================================================
=#

@grantmcdermott
Copy link
Contributor

FWIW, I'm not surprised a priori that estimatr produces different results to reghdfe even when the former's se_type option is set to "stata". That's because (I think) estimatr is replicating Stata's clustering/SE approach for the vanilla reg command. More here. And I don't think that reg makes the necessary DOF corrections for nesting.

A relevant section from Cameron and Miller (2015):

In Stata the within command xtreg y x, fe vce(robust) gives the desired CRVE. The alternative LSDV commands regress y x i.id_clu, vce(cluster id_clu) and, equivalently, regress y x, absorb(id_clu) vce(cluster id_clu) use the wrong degrees-of-freedom correction.

I'm rushing this morning so I may well be missing something, but I wanted to flag this before I head out.

PS -- Does anyone know what the constant (_cons) in the Stata output is referring to? Is that just an artieact of the model's numerical routine?

@edrubin
Copy link

edrubin commented Aug 22, 2019

Thanks, @grantmcdermott!

The p-values differ too. Stata and Julia are using different degrees of freedom (when calculating p-values and confidence intervals). In calculating the p-value and confidence interval, lfe::felm() is using df=1999, while estimatr::lm_robust() is using df=49. Stata appears to be using df=49, and Julia is using df=2500 (which probably needs to be changed).

So I guess any attempt to align the estimates needs to keep in mind the DF used in inference—in addition to the DF adjustment for the CRVE. I'm probably getting off track from the original question/issue.

@grantmcdermott
Copy link
Contributor

Good catch @edrubin! I'll let the FixedEffectModels.jl package maintainer know about the CI discrepancy in a separate issue.

Out of interest, I just ran the default estimatr cluster robust model (i.e. without specifying "stata" standard errors). This default implements a new small cluster correction method proposed by Pustejovsky and Tipton (2018). The results are very close to reghdfe's:

estimatr::lm_robust(y ~ x, fixed_effects = ~ fe, clusters = cl, data = DF)
#     Estimate Std. Error  t value  Pr(>|t|)    CI Lower   CI Upper       DF
# x 0.02835652  0.0132763 2.135875 0.0379786 0.001641456 0.05507158 46.56475

I don't want to get too far off track here with other packages, except to say I'm pretty confident felm is the outlier at this point...

@edrubin
Copy link

edrubin commented Aug 22, 2019

I guess the question for @sgaure is whether lfe and felm() should match Stata—or if there should be an option that makes it easier to match.

@karldw
Copy link
Contributor

karldw commented Aug 28, 2019

@grantmcdermott and @edrubin, I just ran my code again and came up with different results. See the updated numbers above. (This resolves the questions about _cons and the p-value differing between estimatr and lfe.)
I don't know if the differences are due to changes in the programs just my mistakes in copy-paste. Probably the latter!

@grantmcdermott
Copy link
Contributor

grantmcdermott commented Aug 28, 2019

@karldw Thanks for the update. I probably should have double checked too. I guess it's (yet another) good argument in favour of using the reprex package.

In other news, I've got a bunch of things on my plate, but I'm up for helping on a PR as much as I can if you want to distribute the workload. At least, until we hear back from Simen...

@grantmcdermott
Copy link
Contributor

Okay, I finally sat down and wrote a simple correction myself, using the Gormley and Matsa approach already mentioned above. I can confirm that this now produces the correct SEs using @karldw's simulation example. However, I'd like for other folks to kick the tyres with some more complicated examples before I submit a PR. (I haven't tested with multiple clusters, for example.)

Here's the relevant fork: https://github.com/grantmcdermott/lfe/tree/clusternest
Clone it to your machine and make sure you checkout the "clusternest" branch. Then, run devtools::load_all to simulate a local package build and you be able to play around as if you'd loaded everything normally.

Of course, I'm happy to accept any PRs on this branch if you spot any mistakes.


PS. Here's a reprex using my fork. As you can see, the standard errors, t-stats, p-values, and CIs now all match the Stata output from earlier.

devtools::load_all(path = "~/Documents/Projects/lfe") ## Change to path where you cloned the repo
#> Loading lfe
#> Loading required package: Matrix
set.seed(0)

# Generate 500 FE levels, with 5 obs each, nested within 50 clusters
n_clusters <- 50
n_obs_per_fe_level <- 5
n_fe_levels <- 500
n_obs <- n_obs_per_fe_level * n_fe_levels
n_fe_per_cluster <- n_fe_levels / n_clusters
n_obs_per_cluster <- n_obs / n_clusters

# Make 500 FE levels
fe <- sort(rep(1:n_fe_levels, length.out = n_obs))
# Make clusters that nest FE levels
cl <- ceiling(fe / n_fe_per_cluster)


draw_one_cluster <- function(N) {
  # Each cluster has its own mean and se for epsilon.
  cluster_eps_mean <- rnorm(1)
  cluster_eps_sd <- 10 * runif(1)
  rnorm(N, mean=cluster_eps_mean, sd = cluster_eps_sd)
}
eps <- unlist(
  replicate(n_clusters, draw_one_cluster(n_obs_per_cluster), simplify = FALSE)
)

x <- rnorm(n_obs, 0, 10)
y <- 0.02 * x + fe / 100 + eps

DF <- data.frame(cl, y, x, fe)
# haven::write_dta(DF, "~/test_data.dta")

mod = lfe::felm(y ~ x | fe | 0 | cl, data=DF)
summary(mod)
#> 
#> Call:
#>    lfe::felm(formula = y ~ x | fe | 0 | cl, data = DF) 
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -23.7582  -2.6064  -0.0222   2.5281  24.1545 
#> 
#> Coefficients:
#>   Estimate Cluster s.e. t value Pr(>|t|)  
#> x  0.02836      0.01328   2.136   0.0377 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.488 on 1999 degrees of freedom
#> Multiple R-squared(full model): 0.2373   Adjusted R-squared: 0.04657 
#> Multiple R-squared(proj model): 0.001831   Adjusted R-squared: -0.2478 
#> F-statistic(full model, *iid*):1.244 on 500 and 1999 DF, p-value: 0.0007646 
#> F-statistic(proj model): 4.561 on 1 and 49 DF, p-value: 0.03774
confint(mod)
#>         2.5 %     97.5 %
#> x 0.001672966 0.05504007

Created on 2019-09-05 by the reprex package (v0.3.0)

@zweiven
Copy link

zweiven commented Sep 6, 2019

Thanks for doing this. Haven't had the chance to really put it through its paces yet, but looking at the output for @karldw's simulation led to a very minor question. At line 682:
dfadj <- dfadj * z$df/(z$df + totvar - 1)

Why the minus one in the denominator? Shouldn't the adjustment just add back in the absorbed df for the nested fe, i.e., totvar? Or am I missing something?

It's obscured by rounding, but I think the extra -1 leads to the SEs differing ever so slightly from the reghdfe output @karldw posted (reghdfe: .0132755 vs. updated felm: 0.0132782), which also propagates to the CIs. Without the -1 they should match.

@karldw
Copy link
Contributor

karldw commented Sep 7, 2019

Hey @grantmcdermott, a couple of super small things below. I'll try this out with more interesting examples next week.

  • If z$fe is already a factor, length(levels(as.factor(z$fe[[fe_k]]))) gives the number of levels in z$fe, even if the values don't appear in z$fe[[fe_k]]. I haven't traced out is_nested enough to know if that matters. Do you know?
  • Below is a slightly tweaked version, with minor replacements like seq_along(x) instead of seq_along(1:length(x)) and vapply instead of sapply.
fe_cl_grid <- expand.grid(fe_k=seq_along(z$fe), cl_g=seq_along(cluster))
any_nested <- vapply(
  seq_len(nrow(fc_cl_grid)), function(n) {
    fe_k <- fe_cl_grid$fe_k[n]
    cl_g <- fe_cl_grid$cl_g[n]
    is_nested(z$fe[[fe_k]], cluster[[cl_g]])
  }, 
  FUN.VALUE = logical(1)
)
if (any(any_nested)) {
  # Will use the simple correction proposed by Gormley and Matsa. See:
  # https://www.kellogg.northwestern.edu/faculty/matsa/htm/fe.htm
  dfadj <- dfadj * z$df / (z$df + totvar - 1) 
  # In addition to the above, matching Stata's output in the case of 
  # nested clusters also requires that the regressor p-values are
  # calculated according to the Wald Test "df2" degrees of freedom; i.e.
  # reduce the degrees of freedom to the number of clusters-1
  z$df <- min(nlevels(z$clustervar[[1]]) - 1L, z$df)
  z$df.residual <- z$df
}

@grantmcdermott
Copy link
Contributor

grantmcdermott commented Sep 8, 2019

@zweiven Don't you still lose one degree of freedom with the nested case? At least, that's how I read the Gormley and Matsa correction. (It's a little bit confusing because of the way dfadj is iteratively defined in the code, but see fn 15 of their paper, for example.)

FWIW, I did a sanity check by comparing my results locally with Stata and Julia output... But here's where things get weird because I still get @karldw's original Stata output (e.g. SE of 0.0132782 instead of the updated value 0.0132755). Karl, can you double check to make sure the update that you mention here wasn't in error??

FWIW, I'm running Stata 13 with reghdfe version 5.7.2 (the latest AFAIK).

@grantmcdermott
Copy link
Contributor

grantmcdermott commented Sep 9, 2019

@karldw

... a couple of super small things below. I'll try this out with more interesting examples next week.

Thanks, those suggestions look good to me. (Ugh, I had meant to fully swap out seq for seq_along, but must have got distracted before finishing...) TBH I haven't thought carefully about potentially unobserved levels of z$fe and the is_nested function. But I have to say that my instinct is to leave everything as is. I don't think those kind of adjustments are made anywhere else in the code and I don't want to introduce inconsistency.

--

In meantime, I've also realised that the "df2" calculation here is not robust to a change in cluster variable order (i.e. when there is more than one of these), since currently it just goes by whatever is the first object. Following GGM (2011, sec 2.3), I think I'll change this to default to whatever is the minimum-length cluster variable.

Actually... come to think of it, I stole this bit of "df2" code from elsewhere in the lfe package (here to be specific). I wonder, then, whether this is another source of discrepancy between felm::lfe and reghdfe in cases where there is more than one cluster?

@zweiven
Copy link

zweiven commented Sep 9, 2019

Thanks @grantmcdermott. I'll have to defer to you and others with Stata as to its output. Also I realize this is immaterial for many datasets -- I don't mean to waste your time.

I agree with your read of the Gormley and Matsa footnote. My question was motivated by my perhaps incorrect interpretation of Cameron and Miller (section IIIB, bottom of page 330 to middle of 331), which seems to suggest that the total number of groups is to be added back, at least in the simple case with a single (e.g. unit-level) fixed effect.

If I'm interpreting both sources correctly, it's possible their recommendations differ because of the model formulation: G&M include an intercept--and I assume drop one of the FE levels rather than the intercept--while C&M omit the intercept and keep all fixed effects. I think felm is doing the latter, while Stata (for xtreg, at least) seems to keep the intercept and impose a mean constraint on (rather than drop one of) the fixed effects. Fun times?

@grantmcdermott
Copy link
Contributor

Also I realize this is immaterial for many datasets -- I don't mean to waste your time.

Not at all, @zweiven. I think catching and documenting these minor discrepancies is really valuable. Personally, I get unreasonably frustrating when I can't reconcile results exactly when trying to reproduce some analysis that was run in another language. Or, the same language for that matter... Speaking of which, I think you could well be right about the intercept issue. My suitably lazy heuristic at the moment is to match the estimates across platforms and ask question later (or, never!).

@grantmcdermott
Copy link
Contributor

Okay everyone, I think we're just about there. Yesterday, I was playing around with multiway cluster cases and managed to plug what I believe is the only remaining gap between felm and reghdfe. You can skip to the reprex example at the bottom if you just want to see my proposed implementation, but here are the gory details for anyone still interested, summarizing much that has already been discussed in this thread:

Details

CGM (2011, Section 2.3) describe two possible small cluster corrections that are relevant to multiway clustering.

  1. The first approach adjusts each sub-component of the CRVE by its own "c_i" adjustment factor. For example, the first component (on cluster G) is c_1 = G/(G-1) * (N-1)/(N-K), the second component (on cluster H) is c_2 = H/(H-1) * (N-1)/(N-K)... etc.
  2. The second approach just uses a constant adjustment on all components: c = J/(J-1) * (N-1)/(N-K), where J = min(G,H) is the minimum dimension of G and H in the case of two clusters.

Both approaches are described as valid and they will obviously yield exactly the same results when there is only a single cluster. Still, CGM2011 adopt the first approach in their paper and their companion cgmreg routine. This is also the approach that felm() adopts, since @sgaure was following CGM2011 in his implementation for R. However, reghdfe (and several other implementations from what I can tell) adopt the second approach.

(As an aside, a quick search suggests that this discrepancy has been a point of confusion for Stata users too. See here and here , for example).

So, my proposed solution is to simply add a new "reghdfe" option to felm's (silent) cmethod argument to go alongside the existing "gcm" (default) and "gaure" (not sure what this does TBH) options. Again, see the reprex below for an example.

Summary

Here's my summary of everything this GH thread has uncovered thus far:

  • Discrepancies between felm and reghdfe appear to be driven by two DoF adjustments: i) An adjustment for when clusters are nested within FEs, and ii) A choice between adjustment factors for small-cluster sizes that manifests in the case of multiway clustering.
  • While (i) and (ii) are somewhat related, they should be treated differently from an implementation perspective. In my view, (i) is a (minor) bug and felm() should adopt the reghdfe approach to account for nesting by default. This what my earlier patch does. OTOH, (ii) is clearly a matter of opinion. As such, it should be left to the package owner (or user) to decide. Hence, my decision to role this into a new cmethod="reghdfe" argument for anyone who wants to match reghdfe exactly.
  • That said, I'm open to being convinced otherwise. It's trivial to roll both (i) and (ii) into the same cmethod="reghdfe" if people feel strongly about it.
  • Speaking of which, calling the new cmethod option "reghdfe" is pretty arbitrary. I thought about calling "cgm2" but thought this would be an easier mental touchpoint for users. Again, I'm open to other names and/or being convinced otherwise...

There are some minor changes that still need to be made before I issue a PR (e.g. add the new cmethod to the documentation). However, I think we're pretty much there from the perspective of reconciling any differences between felm and reghdfe. Please everyone give these latest changes a run ASAP, or forever hold your peace!

Example

As promised, here is a simple multiway cluster example comparing felm() with the two Stata implementations (cgmreg and reghdfe). I've confirmed that the same principles apply with more complicated cases (e.g. more fixed effects, more clusters), but feel free to test that yourself.

Stata (for reference)

First cgmreg

. use "http://www.stata-press.com/data/r14/nlswork.dta", clear

. // Need to download and install cmreg ado manually.
. // See: http://cameron.econ.ucdavis.edu/research/cgmreg.ado
. qui cgmreg ln_wage age i.race, cluster(idcode year)
. preserve
. // ssc install regsave
. qui regsave age, tstat pval ci
. li
     +--------------------------------------------------------------------------------------+
     | var       coef     stderr      tstat   pval   ci_lower   ci_upper       N         r2 |
     |--------------------------------------------------------------------------------------|
  1. | age   .0196731   .0014538   13.53187      0   .0168237   .0225226   28510   .0946353 |
     +--------------------------------------------------------------------------------------+
. restore

Then reghdfe

. qui reghdfe ln_wage age, absorb(race) cluster(idcode year)
. preserve
. qui regsave age, tstat pval ci
. li
     +------------------------------------------------------------------------------------------+
     | var       coef     stderr      tstat       pval   ci_lower   ci_upper       N         r2 |
     |------------------------------------------------------------------------------------------|
  1. | age   .0196731   .0014594   13.48067   2.07e-09   .0165431   .0228032   28510   .0946353 |
     +------------------------------------------------------------------------------------------+
. restore

R felm()

Reminder: Clone my fork and switch to the clusternest branch to run the below..

devtools::load_all(path = "~/Documents/Projects/lfe") ## Change to path where you cloned the repo
#> Loading lfe
#> Loading required package: Matrix
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

## Vanilla felm. Will give same results as cgmreg in Stata
est1 = 
  felm(
    ln_wage ~ age | race | 0 | idcode + year, 
    data = nlswork
    )
data.frame(broom::tidy(est1, conf.int=T))
#>   term   estimate   std.error statistic      p.value  conf.low  conf.high
#> 1  age 0.01967314 0.001453812  13.53211 1.359083e-41 0.0168236 0.02252268

## New cmethod="reghdfe" argument if you want same results as reghdfe
est2 = 
  felm(
    ln_wage ~ age | race | 0 | idcode + year, 
    data = nlswork, 
    cmethod="reghdfe" ## Added
    )
data.frame(broom::tidy(est2, conf.int=T))
#>   term   estimate   std.error statistic      p.value   conf.low  conf.high
#> 1  age 0.01967314 0.001459359  13.48067 2.072479e-09 0.01654312 0.02280315

Created on 2019-09-11 by the reprex package (v0.3.0)

@sergiocorreia
Copy link

Hi folks,

Reghdfe author here; thanks for all the work in ensuring consistency between implementations. I'm compiling a few thoughts below that might be useful.

Why might SEs differ

I agree with Grant that there are several things that can lead to different results. In fact, I think there's a fourth way of ending up with different SEs: whether or not singletons are dropped. In summary, the number of observations in a regression N is used when computing the degrees-of-freedom adjustment. If there are individuals that only appear once in the sample and we have individual FEs, then that observation gives zero information (the residuals will be zero for that variable, etc.) and the best approach is to drop it, as otherwise we would be biasing the SEs downwards.

So in total these are the five things to keep in mind if we want to have the same SEs:

  1. How is N computed (singletons option)
  2. How is K computed. Many FEs are collinear so this is not trivial. It's relatively straightforward to do so when there are only two sets of FEs, but with more all approaches that I know of (e.g.) are quite slow. Things get even more complex with fixed slopes (or heterogeneous slopes.. is there a standard name for this?)
  3. Also related to K: whether to decrease K if clusters are nested.
  4. How to adjust by the number of clusters (the two approaches outlined by CGM)
  5. How to adjust for a non positive semi-definite variance matrix.

How to pin down why results differ

When I was writing reghdfe I tried to always compare it against other estimators, so there are ways to turn most of these options off, and also to see exactly how the DoF are computed. For instance

sysuse auto
reghdfe price weight length, a(turn trunk) verbose(1) vce(cluster turn)

Here, we include a bit of intermediate output due to the verbose(1) option (less than with verbose(2) or verbose(3)). In particular, there's a line that says:

- Small-sample-adjustment: q = (65 - 1) / (65 - 15) * 13 / (13 - 1) = 1.38666667

And from the standard output, you can see that there are 9 singleton observations, and that turn is "nested within cluster". We can turn these off:

sysuse auto
reghdfe price weight length, a(turn trunk) verbose(1) vce(cluster turn) keepsingletons dof(none)

Now we have:

   - Small-sample-adjustment: q = (74 - 1) / (74 - 37) * 18 / (18 - 1) = 2.08903021

So we have 74 obs instead of 64, and 37 parameters instead of 15. The dof() option also allows you to turn off only some components, but it gets a bit complex.

One final note in terms of software: if you use cgmreg for benchmarking, note that they actually compute K-1 instead of K, so in the test folder of reghdfe I have a fixed version (see the diff here.

Cheers,
Sergio

@sergiocorreia
Copy link

sergiocorreia commented Sep 12, 2019

Finally, I think a broader and still not fully solved problem is that when you have heteroskedasticity and many fixed effects (K/N does not go to zero asymptotically), inference is biased even asymptotically.

This problem is a generalization of Stock and Watson, "Heteroskedasticity-robust standard errors for fixed-effects panel-data regression," Econometrica 76 (2008): 155-174, AFAIK cannot be solved by the usual methods (wild bootstrap, jacknife, clustering) and is covered by several recent papers such as:

The problem of course is that none of the solutions that I have looked up (from these papers) appear to be feasible (from a computational point of view):

  • The papers involving the Hadarmard operator (Cattaneo, Dobriban-Su, etc) propose something that is impossible to compute with any large dataset, even if you are the NSA or Google: inverting a dense N*N matrix (the dot-square of the annihilator matrix). If you have a large dataset then this approach is completely unfeasible. Interestingly, the Dobriban-Su paper also mentions the singletons case as an extreme case that should also be dropped.
  • The paper by Valentin Verdier involved (if I'm recalling correctly) hundreds of cores running nonstop for a couple of week, in a dataset that was not that large, and only worked with certain types of FE configurations.
  • The leave-one-out paper by Kline et al requires you to run one fixed effect regression for every observation. Of course, as Simen and others might realize when reading the paper and the Matlab code available on Github is that their approach is fast for a balanced panel, and we know that two-way fixed effects is a trivial problem to solve in a balanced panel (as pointed in Baltagi's textbook, there is no need for any type of iterative process as convergence is immediate). There are some possible shortcuts that can be taken that might make this feasible (such as precomputing some results, or only doing their approach on a random set of obs.), but I have my doubts.

So, to recap, there's a problem with inference that kinda makes our discussion on degrees-of-freedom to seem trivial. At the same time, there's no clear solution to this problem, so we are in a tough spot, and any suggestions would be welcome.

PS: The papers linked above that discuss this are actually extremely good (if Valentin or any other author reads this, this is not at all a rant on the papers, but on just how difficult is it to come with a practical solution to the problem).

@grantmcdermott
Copy link
Contributor

Thanks @sergiocorreia, that's really helpful background and context.

As an aside: A huge proportion of the applied economics seminars/papers that I see rely on either reghdfe or lfe. I find it remarkable that a) you and @sgaure essentially put these packages together all by yourselves, and b) that most of us are still relying on you two individuals to sort through all the remaining technical difficulties on our collective behalves. Massive kudos to you both!

@sergiocorreia
Copy link

Thanks Grant, at least in my case it was not really intended but started as a by-product of more concrete needs.

BTW, one thing where the Stata and R communities have more in common than you would have thought, compared to e.g. Julia and Python, is that projects in the latter are more colaborative. For instance, I think all commits to reghdfe have been from me, and most commits to lfe have been from Simen (compare with e.g. Julia's GLM).

@ghost
Copy link

ghost commented Jan 1, 2020

This is an interesting discussion! Do you have any insights on which degrees of freedom to use for the robust F test in case of multi-way clustering? (See also the question on Cross Validated: https://stats.stackexchange.com/questions/229678/double-clustered-standard-errors-and-degrees-of-freedom-in-wald-style-f-test-for)

@kendonB
Copy link

kendonB commented Jan 8, 2020

Great work here folks. Just adding multiwayvcov as an extra comparison. It seems to match the cmethod = "cgm" option:

library(lfe)
#> Loading required package: Matrix
library(multiwayvcov)
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

## Vanilla felm. Will give same results as cgmreg in Stata
est1 = 
  felm(
    ln_wage ~ age | race | 0 | idcode + year, 
    data = nlswork
  )
data.frame(broom::tidy(est1, conf.int=T))
#>   term   estimate   std.error statistic      p.value  conf.low  conf.high
#> 1  age 0.01967314 0.001453812  13.53211 1.359083e-41 0.0168236 0.02252268

## New cmethod="reghdfe" argument if you want same results as reghdfe
est2 = 
  felm(
    ln_wage ~ age | race | 0 | idcode + year, 
    data = nlswork, 
    cmethod="reghdfe" ## Added
  )
data.frame(broom::tidy(est2, conf.int=T))
#>   term   estimate   std.error statistic      p.value   conf.low  conf.high
#> 1  age 0.01967314 0.001459359  13.48067 2.072479e-09 0.01654312 0.02280315

est3 = 
  lm(
    ln_wage ~ age + factor(race), 
    data = nlswork,
  )
# The multiwayvcov standard error here matches est1 i.e. cmethod = "cgm"
sqrt(multiwayvcov::cluster.vcov(model = est3, cluster = nlswork[, c("idcode", "year")])[2,2])
#> [1] 0.001453812

Created on 2020-01-09 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants