Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: minimum (gof) distance estimator #7412

Open
josef-pkt opened this issue Apr 10, 2021 · 8 comments
Open

ENH: minimum (gof) distance estimator #7412

josef-pkt opened this issue Apr 10, 2021 · 8 comments

Comments

@josef-pkt
Copy link
Member

josef-pkt commented Apr 10, 2021

https://rdrr.io/rforge/distrMod/man/MDEstimator.html
https://stackoverflow.com/questions/67007706/how-to-calculate-efficient-minimum-distance-in-python-regression

edit https://rdrr.io/rforge/fitdistrplus/man/mgedist.html
edf based gof criteria for estimation, KS, CvM, AD following Luceno 2006

minimize gof statistic, Hellinger, AD, Cramer von Mises to estimate parametric models or parameters of distribution.

Those are robust to outliers.

I looked at Hellinger a long time ago, but no code in sandbox.
Only related in sandbox is mutual info, and that was more as correlation measure.

this might be useful if we want to estimate predictive distributions.
#7142

maybe:
For some distributions, MLE for parameter estimation had a bad reputation, e.g. genextreme, and alternative estimators become popular in some fields, e.g. minimum spacings.
(however, even in many of those case MLE works fine with some limitations and good starting values.)

One issue for this types of estimators, tests is whether they extend to conditional distribution, e.g. in regression setting with explanatory parameters.
related to gof testing in regression models:
#7154 gof EDF tests for regression
#5408
#3904

@josef-pkt
Copy link
Member Author

for copula

Weiß, G. Copula parameter estimation by maximum-likelihood and minimum-distance estimators: a simulation study. Comput Stat 26, 31–54 (2011). https://doi.org/10.1007/s00180-010-0203-7

I only looked at abstract. Sounds pretty negative on MD compared to MLE

@josef-pkt
Copy link
Member Author

an application browsing some literature
weighted likelihood representation for some minimum distance estimators for outlier robust estimation including discrete models
#4266

the following and related references

Markatou, Marianthi, Ayanedranath Basu, and Bruce Lindsay. 1997. “Weighted Likelihood Estimating Equations: The Discrete Case with Applications to Logistic Regression.” Journal of Statistical Planning and Inference, Robust Statistics and Data Analysis, Part II, 57 (2): 215–32. https://doi.org/10.1016/S0378-3758(96)00045-6.

(I downloaded this and added it to zotero in 2014 and in 2016 when working on robust estimators)

aside: Hellinger distance is a density distance not edf/cdf distance.

related literature: trimmed likelihood (downweighted to zero or dropped observations)

@josef-pkt
Copy link
Member Author

this article looks, looks explicit enough to translate to code
Lu, Zudi, Yer Van Hui, and Andy H. Lee. 2003. “Minimum Hellinger Distance Estimation for Finite Mixtures of Poisson Regression Models and Its Applications.” Biometrics 59 (4): 1016–26. https://doi.org/10.1111/j.0006-341X.2003.00117.x.

It looks like they use the distance for aggregate relative frequencies or probabilities, see equ. 3.2 to 3. 6, especially 3.3 empirical relative frequency.
AFAIR, that is also what we use in Vuong test for Poisson, and what I used as diagnostic (plot).
Hellinger uses sqrt.

@josef-pkt
Copy link
Member Author

two likely candidates for implementation

  • "density power divergence approach"
    includes bias correction term for fisher consistency of estimating equations, in literature available for Logit and Poisson, and gaussian
    references Basu, Gosh including article on GLM regression

  • "Maximum Lq-likelihood Estimation"

no correction for Fisher consistency include, requires adjustment, bias correction of estimated parameters

both are advertised as not needing a kernel density estimator, i.e. they use empirical density.

For discrete models, Hellinger distance or power divergence can be computed directly on finite/countable support.
-> check for predicted aggregate distribution or frequency for count models as in my ZI count notebook, as measure of fit when comparing models. I only used squared (chisquare ?) difference in probs, AFAIR.

maybe similarly for multinomial models, i.e. the chisquare distance or similar that I computed for ordered model, (similar to hosmer lemeshow test)
This would be more a gof statistic used for model selection, and not for estimating parameters of a given model.

(we might get measures for outlier identification based on the implied weights, even at MLE, even if we don't do robust estimation)

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 19, 2021

aside AIC for MD

very brief look at the following article
They have a version of AIC that uses ratio tr(inv(J) * K) which looks analogous to GAIC and TIC with information matrix ratio
(definition 1 and following parts)

Kurata, Sumito, and Etsuo Hamada. 2018. “A Robust Generalization and Asymptotic Properties of the Model Selection Criterion Family.” Communications in Statistics - Theory and Methods 47 (3): 532–47. https://doi.org/10.1080/03610926.2017.1307405.

same authors, has comparison to other related IC versions
Kurata, Sumito, and Etsuo Hamada. 2020. “On the Consistency and the Robustness in Model Selection Criteria.” Communications in Statistics - Theory and Methods 49 (21): 5175–95. https://doi.org/10.1080/03610926.2019.1615093.

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 21, 2021

aside
computational formulas for EDF based gof statistics are in GOF class in statsmodels.sandbox.distributions.gof_new
GOF class includes d, d+, d-, a2, w2, and u2, v, a. (I don't remember what A is, likely a variant of ad A2. needs docstrings)

Appendix in Luceno 2006 includes computational formulas for ks D, ad A2, cvm W2, and variations of AD with different weights in denominator.

Luceño, Alberto. 2006. “Fitting the Generalized Pareto Distribution to Data Using Maximum Goodness-of-Fit Estimators.” Computational Statistics & Data Analysis 51 (2): 904–17. https://doi.org/10.1016/j.csda.2005.09.011.

note:
GOF class computes cdf values in __init__ and assumes iid observations.
This needs extension to independent but not identically distributed observations, e.g. rv that depend on explanatory variables.
Also: creating random samples, rvs inside class is not really appropriate and should be removed. This was written in analogy to scipy ks test. However, we might need rvs for the bootstrap case as in the extra bootstrap functions.

@josef-pkt
Copy link
Member Author

test statistic has weighted sum of chisquare distributions #3363

Basu, A., A. Mandal, N. Martin, and L. Pardo. 2013. “Testing Statistical Hypotheses Based on the Density Power Divergence.” Annals of the Institute of Statistical Mathematics 65 (2): 319–48. https://doi.org/10.1007/s10463-012-0372-y.

@josef-pkt
Copy link
Member Author

josef-pkt commented May 7, 2021

maximizing Lq likelihood is off the table for now.

The correction, parameter transformation for (fisher) consistency in the literature is almost non-existing. No (clear) description of the transformation for specific cases, and I don't see a general way of deriving or computing it.

Also, reading some of the small print in Ferrari, Young 2010. It has assumption
"Let q_n be a sequence such that q_n ->1 as n -> inf"
for the consistency proof.
This means asymptotically the estimator is MLE, q = 1 corresponds to MLE.

It might be possible to implement some special cases, maybe for Beta regression (*). But that might require some guess work how the parameter transformation for consistency is actually done.

some later articles by other authors also don't say anything or much about reparameterization and (fisher) consistency.
Reparameterization, parameter transformation might be computationally simpler than a Fisher consistency term in the estimating equation, but theoretical derivation the transformation is very unclear.

(I didn't really read any of those articles, but was looking specifically for fisher consistency or bias correction.)

It is possible that the parameter transformation for canonical GLM/LEF is just theta = theta_e / q, but that's in comments, but I haven't seen any derivation or proof.

Ribeiro, Terezinha K. A., and Silvia L. P. Ferrari. 2020. “Robust Estimation in Beta Regression via Maximum Lq-Likelihood.” ArXiv:2010.11368 [Stat], October. http://arxiv.org/abs/2010.11368.
Seems to have explicit reparameterization whithout showing where it comes from.
(I cannot find the supplementary material mentioned in the article)

similar: density power divergence for Beta regression, (I didn't look much at it)
Ghosh, Abhik. 2019. “Robust Inference under the Beta Regression Model with Application to Health Care Studies.” Statistical Methods in Medical Research 28 (3): 871–88. https://doi.org/10.1177/0962280217738142.

For several models the fisher consistency term for density power divergence is analytically available.
Poisson and similar require summing over points of density, similar to what we have in predict_prob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant