ENH: right truncated count models for oulier trimming #9146

josef-pkt · 2024-02-09T17:38:01Z

related to robust regression, e.g. #3315 , #8690 (comment)

problem is imposing fisher consistency.

one possibility would be to have a full MLE model that is right truncated at the trimming points. Then we should get consistent estimates for the underlying distribution parameter(s)

for example trim at isf(alpha; theta) where alpha is e.g. 0.025, 0.01 or similar.
weight function then is indicator function for (y : y < isf(alpha; theta))
truncation point will differ across observations

need a preliminary estimator for parameter theta (mu in poisson case), e.g. robust initial estimate
iterate weighted/truncated MLE to update theta.
drop outliers and reestimate truncated MLE

for predict, results we consider it as untruncated model, i.e. predicted mean=mu. This will simplify post-estimation by a large amount.
(e.g. truncated mean does not have a closed form expression, and we would need to compute it through enumeration.)

First target is for only small counts, i.e. we only need right truncation and not look at left outliers.

possible theoretical problem:
Trimming/truncation is endogenous, data dependent through parameters and the support of the distribution depends on parameters.
MLE would not apply, but effect should be "negligible" if we only truncate in the far tail.

The text was updated successfully, but these errors were encountered:

josef-pkt · 2024-02-11T14:38:53Z

parking a (semi-random) idea, but should be more general for robust GLM, poisson, logit,...

The fisher consistency term needs expectation of estimating equation w.r.t. the distribution. The distribution depends on the estimated distribution parameter for each observation, e.g. mu in Poisson.
So, I thought it will require slow computation if we need to compute consistency factor by numerical integration for each observation.

The current idea is to estimate this expectation at the aggregate predictive distribution, e.g. using pdf values that we already compute in, e.g., get_prediction with aggregate=True, predictive probs for count data models.

E psi(y, x) dF(y, theta) dF(x), with estimated predicted distribution parameters theta f(theta | x)
(F(theta|x) = indicator(predict(x)), i.e. theta is just one point for each x)

detail:
We will need to truncate the aggregate predictive distribution for some y. This will be difficult if the distribution is heavy tailed, i.e. tail weight cannot be ignored except at very far tail.
However, redescending weights will be zero after some threshold c, so tail component will just be c * sf(c) = c * (1 - cdf(c)) and cdf(c) can be computed from the center y, or something similar if we trim/down weight at both tails.

(I need to get Ronchetti and similar for robust GLM again in more details)

josef-pkt added type-enh comp-discrete comp-robust labels Feb 9, 2024

This was referenced Feb 11, 2024

ENH roadmap outlier robust GLM, Poisson, Logit, asymmetric MLE #9148

Open

ENH: outlier robust extreme value statistics and distributions #9152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: right truncated count models for oulier trimming #9146

ENH: right truncated count models for oulier trimming #9146

josef-pkt commented Feb 9, 2024

josef-pkt commented Feb 11, 2024

ENH: right truncated count models for oulier trimming #9146

ENH: right truncated count models for oulier trimming #9146

Comments

josef-pkt commented Feb 9, 2024

josef-pkt commented Feb 11, 2024