Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: right truncated count models for oulier trimming #9146

Open
josef-pkt opened this issue Feb 9, 2024 · 1 comment
Open

ENH: right truncated count models for oulier trimming #9146

josef-pkt opened this issue Feb 9, 2024 · 1 comment

Comments

@josef-pkt
Copy link
Member

related to robust regression, e.g. #3315 , #8690 (comment)

problem is imposing fisher consistency.

one possibility would be to have a full MLE model that is right truncated at the trimming points. Then we should get consistent estimates for the underlying distribution parameter(s)

for example trim at isf(alpha; theta) where alpha is e.g. 0.025, 0.01 or similar.
weight function then is indicator function for (y : y < isf(alpha; theta))
truncation point will differ across observations

  • need a preliminary estimator for parameter theta (mu in poisson case), e.g. robust initial estimate
  • iterate weighted/truncated MLE to update theta.
  • drop outliers and reestimate truncated MLE

for predict, results we consider it as untruncated model, i.e. predicted mean=mu. This will simplify post-estimation by a large amount.
(e.g. truncated mean does not have a closed form expression, and we would need to compute it through enumeration.)

First target is for only small counts, i.e. we only need right truncation and not look at left outliers.

possible theoretical problem:
Trimming/truncation is endogenous, data dependent through parameters and the support of the distribution depends on parameters.
MLE would not apply, but effect should be "negligible" if we only truncate in the far tail.

@josef-pkt
Copy link
Member Author

parking a (semi-random) idea, but should be more general for robust GLM, poisson, logit,...

The fisher consistency term needs expectation of estimating equation w.r.t. the distribution. The distribution depends on the estimated distribution parameter for each observation, e.g. mu in Poisson.
So, I thought it will require slow computation if we need to compute consistency factor by numerical integration for each observation.

The current idea is to estimate this expectation at the aggregate predictive distribution, e.g. using pdf values that we already compute in, e.g., get_prediction with aggregate=True, predictive probs for count data models.

E psi(y, x) dF(y, theta) dF(x), with estimated predicted distribution parameters theta f(theta | x)
(F(theta|x) = indicator(predict(x)), i.e. theta is just one point for each x)

detail:
We will need to truncate the aggregate predictive distribution for some y. This will be difficult if the distribution is heavy tailed, i.e. tail weight cannot be ignored except at very far tail.
However, redescending weights will be zero after some threshold c, so tail component will just be c * sf(c) = c * (1 - cdf(c)) and cdf(c) can be computed from the center y, or something similar if we trim/down weight at both tails.

(I need to get Ronchetti and similar for robust GLM again in more details)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant