[ENH] Add CRV3 Inference via the Cluster Jackknife to OLS and WLS #8461 #8596

s3alfisc · 2022-12-27T19:34:16Z

Hi all!

This PR closes #8461.

It adds support for CRV3 cluster robust inference for OLS and WLS via a cluster jackknife.

In short, the PR adds a new cov_type, cluster-jackknife, which can be run in the following way:

crv3 = model.fit(cov_type='cluster-jackknife', cov_kwds={'groups': id})

I have updated the docstring in linear_model.py.

I have not yet added tests, as I'd first like to get feedback on where and how to test the new cov_type (and I also have to get up to speed with pytest) . There are multiple strategies how to test the new feature:

test against implementations in other languages (e.g. Stata's summclust, R's summclust, or R's sandwich::vcovJK).
internal tests: under appropriate small sample corrections and when clusters are singletons, the cluster jackknife is equivalent to the HC3 estimator.

Below is a small example of the ladder test, using the new cluster-jackknife covtype and HC3 inference:

from statsmodels.regression.linear_model import OLS
from statsmodels.datasets import grunfeld
from statsmodels.tools.tools import add_constant
import pandas as pd
import numpy as np

dtapa = grunfeld.data.load_pandas()
dtapa_endog = dtapa.endog[:200]
dtapa_exog = dtapa.exog[:200]
exog = add_constant(dtapa_exog[["value", "capital"]], prepend=False)
N = exog.shape[0]
id = pd.Series(range(0, N))

res_hc3 = OLS(dtapa_endog, exog).fit(cov_type = "HC3")
res_crv3 = OLS(dtapa_endog, exog).fit(cov_type = "cluster-jackknife", cov_kwds={'groups': id})

res = pd.DataFrame({'tstat hc3:': res_hc3.tvalues * np.sqrt((N-1) / (N)), 'tstat crv3:' : res_crv3.tvalues})
res
#          tstat hc3:  tstat crv3:
# value     16.093573    16.093573
# capital    3.932688     3.932688
# const     -3.040458    -3.040458

Let me know which types of tests you'd like to see =)

statsmodels/stats/sandwich_covariance.py

josef-pkt · 2022-12-27T20:21:10Z

I'm soon going skiing for a few days. Will look at the details in January.

I think unit tests for OLS, WLS should go into statsmodels.regression.tests.test_robustcov
(overall unit tests for other models like GLM, discrete cov_type are spread out across directories. stats.test.test_sandwich.py has only a few tests that don't use models)

In statsmodels.regression.tests.test_robustcov there are test classes with names that end in fit. Those are tests after we decided to put the cov_type option into the model.fit method. The other unit tests don't follow the current pattern, and I never refactored those.

The integration into the current sandwich cov look good (base on skimming the changes).
However, we need to disable it in the generic version because it only applies to OLS/WLS and would be inherited by other model classes. However, it might also apply to IRLS in GLM. I need to check that later.

Thanks for the PR

The current unit tests fail in tests unrelated to this PR
FAILED statsmodels/imputation/tests/test_mice.py::TestMICE::test_combine - As...
FAILED statsmodels/stats/tests/test_mediation.py::test_mixedlm - AssertionErr...
I just realized that those unit tests fail now in many test environments and I will either fix or disable those unit tests #8597

josef-pkt · 2022-12-27T20:50:23Z

I merged the disabling of the two failing unit tests.

If you rebase on main, then the unit tests should be green (except for pre-release testing)

s3alfisc · 2022-12-28T10:13:56Z

Thanks for your feedback, @josef-pkt !

I will add the tests to statsmodels.regression.tests.test_robustcov / fit
I'll disable the jackknife in the generic version. In R's sandwich package, it is possible to employ a cluster jackknife for glm() etc, though my understanding is that the literature on its empirical performance is quite thin. The implementation in this PR should be amenable to glm's, but would require minor changes. Also, the jackknife might not be equivalent to HC3/CRV3 vcov's for glm's.
I'll rebase the PR.

Enjoy your skiing vacation! Best, Alex

…not use it)

josef-pkt · 2023-01-03T19:16:51Z

style check failures are mostly trailing whitespace, and
statsmodels/stats/sandwich_covariance.py:601:18: E712 comparison to False should be 'if cond is False:' or 'if not cond:'

josef-pkt · 2023-01-03T19:36:28Z

Do you have the Stata package summclust to write unit tests against?

I read partway through their working paper, but not enough yet to understand the details.

Aside: GEE has an option for bias reduced cluster robust standard errors. OLS is a special case of GEE and should crv should have similar standard errors.

an old script of mine
https://gist.github.com/josef-pkt/89585d0b084739a4ed1c
(I don't remember where the data csv file is, or whether I have an updated notebook )

s3alfisc · 2023-01-04T07:19:42Z

Yes, I should still have access to stata (unless my license has expired), so I can prepare tests against summclust and HC3 errors produced by statsmodels. I'll also take a look at the GEE implementation & script. My goal is to do most of this by the end of the week.

josef-pkt · 2023-01-04T14:58:48Z

When you have unit tests, then I will merge this without going through the math and theory. There is too much and too many articles to read for me to go over it now.

question:
Do we need any options for variation on the cluster-jackknife?
There seems to be different variants and options in the literature, but I don't know which of those are important. But those can be added later if the current version can remain as default.

s3alfisc · 2023-02-05T12:36:13Z

Hi @josef-pkt - sorry for taking so long to get back to you about this PR. Early January I was traveling, then there was more work to do than expected, and outside of work, I had to fix/push some things with my R projects.

Indeed there are two variants of the cluster jackknife discussed in MNW, the difference is in whether one should use the "mean" of the jackknifed regression coefficients, or the sample estimate in one point of the vcov formula. What is implemented in the PR is the "regression coefficient" approach, which is equal to the HC3 estimator when each cluster is a singleton. I will add an option for the mean version as well, as it is easy to do.

Regarding tests - I have to admit that I struggle a little bit with the tests in statsmodels.regression.tests.test_robustcov. The code comments suggest that statsmodels implementations are tested against Stata, but where is the Stata code actually run (or results from Stata code actually changed)?

Regarding all the math in MNW - it is indeed a lot of linear algebra. But what I effectively implement is quite straightforward - equation (31) (and potentially 30) in their summclust paper.

To speed things up, I could also create a separate test file for the new vcov estimator, which you could then refactor into the appropriate test file? But I'd also be happy to do this myself, but would need some guidance =) Best, Alex

josef-pkt · 2023-02-05T14:29:54Z

no problem with the delay. I was busy with work for our release that is already late, (and got distracted by other topics).
I will still be busy for several days or a week with release preparation.

about the unit tests
The current tests for cov_type use results exported from Stata.
Results exported from R use a different pattern and naming convention. So we will need a new tests that can copy and adjust (parts of) the existing test method. If R doesn't have confidence intervals, then we can skip those.
Unit test for cluster robust use an adjusted grunfeld data set (setup_class in CheckOLSRobustCluster). We can use either this or some other dataset if you prefer that.

I can help with this, and look into it if you have the R code and statsmodels code that should produce equivalent results.
(I have not exported regression results from R in some time and it will have to look it up again.)

s3alfisc · 2023-02-20T15:47:53Z

Hi @josef-pkt, I have finally added support for the second jackknife type in MNW and added a new option, cluster-crv3. I have also found where you save the Stata results (not sure how I could miss these for this long😅 ) Are these results auto generated, or handcoded? If autogenerated, is there a do-file I should update?

josef-pkt · 2023-02-20T16:20:12Z

We have an ado file that exports the model data.
The do file needs to be written for the individual cases

I will look for my latest version of #1716 and an example.
The old version is not pep-8 compatible, and, IIRC, I made some changes.
The latest Stata version that I have is 14.

josef-pkt · 2023-02-20T17:11:36Z

The version that I used last is now in #8691
https://raw.githubusercontent.com/statsmodels/statsmodels/e628769b08f7027ac008caa32b6c1f0ad5229618/tools/estmat2nparray.ado

I don't think we have a proper helpfile or docstring for it.

s3alfisc · 2023-02-23T20:45:12Z

I have added a first set of tests for OLS and oneway-clustering, which both pass on my laptop. I test the CRV3 estimator against the HC3 estimator when all clusters are singleton, and against my R implementation, but using the Stata schema (unfortunately my Stata license has expired. The R version is tested against Stata here). Let me know if you'd prefer that I integrate the new test into the old test structure. I will add tests for WLS and two-way clustering over the weekend.

s3alfisc · 2023-02-25T11:10:54Z

Hi @josef-pkt, I think this PR is now ready for review. There are two things potentially lacking: tests for two-way clustering (as I have not yet implemented this in my R package) and a cleanup of the test files. Plus of course anything else that you will spot =)

josef-pkt · 2023-02-25T13:18:47Z

Thanks,
I'm leaving tomorrow for a 4 day skiing trip. So, I will not be able to look at it until next week.

Did you try to compare it with the GEE version?

josef-pkt · 2023-02-25T13:21:11Z

style check fails, needs pep-8 corrections

s3alfisc · 2023-02-25T15:53:20Z

I have not yet compared with the GEE version. I'll try to take a closer look tomorrow. So far, the statsmodels implementation is tested against my R package (which is again tested against the Stata package + the cluster jackknife in sandwich::vcovJK()). There also statsmodels-internal tests of HC3 vs CRV3, which need to match exactly (and they do!).

josef-pkt · 2023-02-25T16:38:00Z

comparison with GEE is more for reference, that is, I might have to answer question about differences, or compare those for extension of CRV3 to GLM.

s3alfisc · 2023-04-16T10:21:17Z

Hi @josef-pkt , I have some free time in the next week (vacation coming up). Would you be so kind to retrigger the CI workflow so that I can see what I still need to fix? I vaguely recall that it was mostly style checks that were failing?

josef-pkt · 2023-04-16T13:31:16Z

I have not found a way to restart an old, deleted testrun on Azure.
I usually add a commit and push, or for older PRs rebase on current main and force push.
Some developers close and reopen a PR to trigger CI runs, but I never tried that.

I'm finally done with 0.14 (after floating around for 3 months trying to figure out topics for 0.15)
So, I can help out with this.

s3alfisc · 2023-04-16T14:06:08Z

Ok, great, I will then push an "empty" commit =)

josef-pkt · 2023-04-16T15:24:39Z

looks like style failures are all in the results module
statsmodels/regression/tests/results/results_grunfeld_ols_robust_cluster.py:821:1: E265 block comment should start with '# '
You can do just a global replace in the file

(one unrelated test failure)

s3alfisc · 2023-04-17T14:49:03Z

Only took me four attempts, but looks like I managed to fix all style checks!

josef-pkt · 2023-06-12T21:47:25Z

I'm running some examples from the test cases to try out the PR

crv3 without correction is the same as GEE biasreduced cov_type

res_crv3_nc.bse / res_gee_br.bse
value      1.0
capital    1.0
const      1.0
dtype: float64

res_crv3_nc.bse - res_gee_br.bse
value      6.938894e-17
capital    1.276756e-15
const      2.344791e-13
dtype: float64

s3alfisc · 2023-06-13T19:39:28Z

Let me know if I can be of any help, if still required =)

add crv3 cluster robust inference via the cluster jackknife for OLS/WLS

d994e8d

josef-pkt reviewed Dec 27, 2022

View reviewed changes

statsmodels/stats/sandwich_covariance.py Show resolved Hide resolved

s3alfisc added 2 commits December 28, 2022 11:20

Merge branch 'statsmodels:main' into main

ccb0a26

move computation of 'scale' inside if clause (cluster jackknife does …

e2dc74a

…not use it)

josef-pkt mentioned this pull request Dec 29, 2022

ENH: influence - deletion of multiple observations, leave one group out #7924

Open

josef-pkt mentioned this pull request Jan 4, 2023

score_tests cluster robust, sandwiches #2054

Open

add support for both variants from MNW (jackknife and CRV3)

a5ddd5a

s3alfisc added 6 commits February 23, 2023 19:55

minor changes to _cluster_jackknife()

c7ebeb6

add cluster-crv3 to fit cov_type

195c860

add results from R package summclust

f090e79

add OLS tests

fa479d1

fix jackknife test - add classmethod

26a7538

revert to relative path, delete WLS tests

6eaa460

s3alfisc added 4 commits February 23, 2023 22:18

achieve 100% diff coverage

e5eea3e

add tests for WLS

10e4ba2

rename cluster-jackknife to shorter cluster-jk

1b759d9

fix trailing whitespaces

60f06fb

pip-8 to pass style checks

83054b9

change description to trigger CI run

1496685

s3alfisc added 4 commits April 16, 2023 18:56

fix style

8bd239b

fix style checks II

bc6243b

fix style III

b43d9aa

fix style IV

514470b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add CRV3 Inference via the Cluster Jackknife to OLS and WLS #8461 #8596

[ENH] Add CRV3 Inference via the Cluster Jackknife to OLS and WLS #8461 #8596

s3alfisc commented Dec 27, 2022 •

edited

Loading

josef-pkt commented Dec 27, 2022

josef-pkt commented Dec 27, 2022

s3alfisc commented Dec 28, 2022 •

edited

Loading

josef-pkt commented Jan 3, 2023

josef-pkt commented Jan 3, 2023

s3alfisc commented Jan 4, 2023

josef-pkt commented Jan 4, 2023

s3alfisc commented Feb 5, 2023

josef-pkt commented Feb 5, 2023

s3alfisc commented Feb 20, 2023

josef-pkt commented Feb 20, 2023

josef-pkt commented Feb 20, 2023

s3alfisc commented Feb 23, 2023 •

edited

Loading

s3alfisc commented Feb 25, 2023

josef-pkt commented Feb 25, 2023

josef-pkt commented Feb 25, 2023

s3alfisc commented Feb 25, 2023 •

edited

Loading

josef-pkt commented Feb 25, 2023

s3alfisc commented Apr 16, 2023 •

edited

Loading

josef-pkt commented Apr 16, 2023 •

edited

Loading

s3alfisc commented Apr 16, 2023

josef-pkt commented Apr 16, 2023

s3alfisc commented Apr 17, 2023

josef-pkt commented Jun 12, 2023

s3alfisc commented Jun 13, 2023

[ENH] Add CRV3 Inference via the Cluster Jackknife to OLS and WLS #8461 #8596

Are you sure you want to change the base?

[ENH] Add CRV3 Inference via the Cluster Jackknife to OLS and WLS #8461 #8596

Conversation

s3alfisc commented Dec 27, 2022 • edited Loading

josef-pkt commented Dec 27, 2022

josef-pkt commented Dec 27, 2022

s3alfisc commented Dec 28, 2022 • edited Loading

josef-pkt commented Jan 3, 2023

josef-pkt commented Jan 3, 2023

s3alfisc commented Jan 4, 2023

josef-pkt commented Jan 4, 2023

s3alfisc commented Feb 5, 2023

josef-pkt commented Feb 5, 2023

s3alfisc commented Feb 20, 2023

josef-pkt commented Feb 20, 2023

josef-pkt commented Feb 20, 2023

s3alfisc commented Feb 23, 2023 • edited Loading

s3alfisc commented Feb 25, 2023

josef-pkt commented Feb 25, 2023

josef-pkt commented Feb 25, 2023

s3alfisc commented Feb 25, 2023 • edited Loading

josef-pkt commented Feb 25, 2023

s3alfisc commented Apr 16, 2023 • edited Loading

josef-pkt commented Apr 16, 2023 • edited Loading

s3alfisc commented Apr 16, 2023

josef-pkt commented Apr 16, 2023

s3alfisc commented Apr 17, 2023

josef-pkt commented Jun 12, 2023

s3alfisc commented Jun 13, 2023

s3alfisc commented Dec 27, 2022 •

edited

Loading

s3alfisc commented Dec 28, 2022 •

edited

Loading

s3alfisc commented Feb 23, 2023 •

edited

Loading

s3alfisc commented Feb 25, 2023 •

edited

Loading

s3alfisc commented Apr 16, 2023 •

edited

Loading

josef-pkt commented Apr 16, 2023 •

edited

Loading