Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use shrinkage variance (semi-pooled) in oneway anova, multiple tests #8397

Open
josef-pkt opened this issue Sep 6, 2022 · 3 comments
Open

Comments

@josef-pkt
Copy link
Member

josef-pkt commented Sep 6, 2022

(mainly parking a reference for an old idea)

variance and covariance estimates are not very good in small or very small samples.
One idea is to use penalized or shrinkage (co)variance to get better small sample properties. In oneway and similar cases with heteroscedasticity we can shrink to a pooled estimate.

related
#3197 cov shrinkage, penalized (also related outlier robust cov methods)
#2942 applied to GMM weight/cov matrix
#2882 similar issue for mean params shrinkage/penalization to a target

The following looks like a good starting point for oneway anova and multiple, e.g. pairwise, comparisons (context in #8396 )
Adding this to individual hypothesis test functions in stats is much easier than integrating it into cov_type in models (which I have not yet tried out).

Cui, Xiangqin, J. T. Gene Hwang, Jing Qiu, Natalie J. Blades, and Gary A. Churchill. 2005. “Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates.” Biostatistics 6 (1): 59–75. https://doi.org/10.1093/biostatistics/kxh018.

they include references to previous literature for semi-pooled variances in tests
608 citations, so there might be a lot more in this direction.

extension idea:
The pooled estimate would have enough observations to use also other estimators than simple variance, e.g. outlier robust estimators, i.e. we could shrink to a robust, pooled variance estimate. (possible problem if robust "variance" is just a dispersion measure possibly calibrated to normal distribution.)

The main objective here is to improve heteroscedasticity robust hypothesis tests in small samples. If some group sizes are small, then variance estimate using only within group information is too noisy.
e.g. Welch anova and multiple comparisons with unequal variance.
(related AFAIR
BF mean anova does not use weights in computing average, but uses group variances for inference.
In meta-analysis we also have the option to use variance weights or not for the weighted average, IIRC
)

The main decision is how much too shrink, e.g. choosing weights between sample and pooled variances. Those weights should depend on sample sizes. If group is large, then we do not need to shrink the group variance.

@josef-pkt
Copy link
Member Author

going from oneway stats to models:

I was browsing a bit the citations of the above article in google scholar, which are mostly application to multiarray data.
I did not read any articles but a few related ideas.

  • OLS with a oneway categorical plus possibly other exog
    If we allow for group heteroscedasticity, then the residual variances depend on groups (e.g. cov_type HC depending only on group dummies). If groups are small, then we might want to shrink to pooled estimate of residual variance.
    The cov_type correction would use group specific scale, variance but would otherwise be nonrobust.
  • In WLS we could use var_weights based on shrinkage group variances
  • vectorized OLS
    Each group forms separate regression with separate group specific residual variance, i.e. no pooling of any information across groups. In the current implementation we estimate scale for each group separately. Instead we could add an option to shrink scales towards a given target.
    If the vectorized OLS model includes all groups, then the pooled residual variance can be computed from the data in the model. If we batch groups, then the target needs to be predefined, i.e. we need separate step to compute pooled variance.

In WLS: using the shrinkage var_weights would correspond to estimating FGLS with a penalized variance function.
Vectorized OLS would correspond to SUR with a diagonal cross-sectional covariance matrix with a penalized variance function on the diagonal.

(related to SUR: we could extend this to shrinking cross-sectional cov to a target cov)

Both standard SUR/panel data and vectorized OLS assume balanced panel, which means shrinkage does not differ by individual group sample sizes as in oneway comparison or OLS with group dummies.

Similar methods could apply to other models, e.g. Poisson QMLE with excess dispersion or NegBin and other (excess) dispersion models. Binary/Binomial ? Beta-Binomial if we have counts.

@josef-pkt
Copy link
Member Author

josef-pkt commented Sep 12, 2022

It can also apply to two-sample t-test.
e.g. add additional option use_var="shrink" and extra keywords, method_shrink and shrink_arg
where we shrink initially only by number of "prior" observations, e.g. shrink_arg=10 means assume pooled variance corresponds to 10 observations and shrinking weights are nobsi / (nobsi + 10) and 10 / (nobsi + 10).

Adding data dependent shrink methods will take more time and we first need to figure out which available "optimal" shrinking we implement.

This empirical Bayes version has a large number of citations:
Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1). https://doi.org/10.2202/1544-6115.1027.

but it estimates the prior variance and does not just use the standard pooled variance as shrinking target (AFAICS from brief skimming)

I found the article because the following article is based on it, uses Smyth as main reference for shrinkage
Qiu, Jing, Yue Qi, and Xiangqin Cui. 2014. “Applying Shrinkage Variance Estimators to the TOST Test in High Dimensional Settings.” Statistical Applications in Genetics and Molecular Biology 13 (3): 323–41. https://doi.org/10.1515/sagmb-2013-0045.

Aside, problem for implementation:
AFAICS, the literature on variance shrinking for Microarray that include optimal shrinkage weights are all for balanced panels, i.e. equal nobs in all samples.
This makes it more difficult to shrink mainly variances of groups that have fewer observations.

@josef-pkt
Copy link
Member Author

josef-pkt commented Sep 15, 2022

interface idea:

use a dict for var shrinkage if not None, shrink_var=None in relevant test functions or methods of classes:
e.g.
shrink_option = {
"???": "add", # additive or multiplicative, keyword name ?
"weight": None # value of shrinkage parameter, additive or power coefficient, float in (0, 1), maybe tuple if sum < 1.
possibly string for "optimal" shrinkage method (if we have data to compute it)
"target": "mean" # string or float, shrinkage target, "mean" works if we have two or more samples.
and e.g. target="geom" for geometric mean instead of arithmetic mean.
"ddof": 0 # degrees of freedom correction, not sure whether we need to use it. e.g. Do we correct Welch df for shrinkage?
Which effective df do we use?
}

and maybe an option for handling cases where the variance in a sample is zero, lower bound or ignore, (eg. nobsi=1 ?)
(geometric mean is zero if any variance is zero)

Aside:
we might need to validate dict keys that they are valid and not misspelled or unused keywords.
maybe a helper function to update a dict with another dict or just one that validates keys.
kwds_default.update(kwds_user)

currently, e.g. for cov_kwds of cov_type, we validate mainly case by case and not at source.
That was easier in that case because valid cov_kwds depends on cov_type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant