New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: use shrinkage variance (semi-pooled) in oneway anova, multiple tests #8397
Comments
going from oneway I was browsing a bit the citations of the above article in google scholar, which are mostly application to multiarray data.
In WLS: using the shrinkage var_weights would correspond to estimating FGLS with a penalized variance function. (related to SUR: we could extend this to shrinking cross-sectional cov to a target cov) Both standard SUR/panel data and vectorized OLS assume balanced panel, which means shrinkage does not differ by individual group sample sizes as in oneway comparison or OLS with group dummies. Similar methods could apply to other models, e.g. Poisson QMLE with excess dispersion or NegBin and other (excess) dispersion models. Binary/Binomial ? Beta-Binomial if we have counts. |
It can also apply to two-sample t-test. Adding data dependent shrink methods will take more time and we first need to figure out which available "optimal" shrinking we implement. This empirical Bayes version has a large number of citations: but it estimates the prior variance and does not just use the standard pooled variance as shrinking target (AFAICS from brief skimming) I found the article because the following article is based on it, uses Smyth as main reference for shrinkage Aside, problem for implementation: |
interface idea: use a dict for var shrinkage if not None, and maybe an option for handling cases where the variance in a sample is zero, lower bound or ignore, (eg. nobsi=1 ?) Aside: currently, e.g. for cov_kwds of cov_type, we validate mainly case by case and not at source. |
(mainly parking a reference for an old idea)
variance and covariance estimates are not very good in small or very small samples.
One idea is to use penalized or shrinkage (co)variance to get better small sample properties. In oneway and similar cases with heteroscedasticity we can shrink to a pooled estimate.
related
#3197 cov shrinkage, penalized (also related outlier robust cov methods)
#2942 applied to GMM weight/cov matrix
#2882 similar issue for mean params shrinkage/penalization to a target
The following looks like a good starting point for oneway anova and multiple, e.g. pairwise, comparisons (context in #8396 )
Adding this to individual hypothesis test functions in
stats
is much easier than integrating it intocov_type
in models (which I have not yet tried out).Cui, Xiangqin, J. T. Gene Hwang, Jing Qiu, Natalie J. Blades, and Gary A. Churchill. 2005. “Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates.” Biostatistics 6 (1): 59–75. https://doi.org/10.1093/biostatistics/kxh018.
they include references to previous literature for semi-pooled variances in tests
608 citations, so there might be a lot more in this direction.
extension idea:
The pooled estimate would have enough observations to use also other estimators than simple variance, e.g. outlier robust estimators, i.e. we could shrink to a robust, pooled variance estimate. (possible problem if robust "variance" is just a dispersion measure possibly calibrated to normal distribution.)
The main objective here is to improve heteroscedasticity robust hypothesis tests in small samples. If some group sizes are small, then variance estimate using only within group information is too noisy.
e.g. Welch anova and multiple comparisons with unequal variance.
(related AFAIR
BF mean anova does not use weights in computing average, but uses group variances for inference.
In meta-analysis we also have the option to use variance weights or not for the weighted average, IIRC
)
The main decision is how much too shrink, e.g. choosing weights between sample and pooled variances. Those weights should depend on sample sizes. If group is large, then we do not need to shrink the group variance.
The text was updated successfully, but these errors were encountered: