New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perfect fit (in rlm) #55
Comments
[ LP comment 1 by: joep, on 2011-05-31 16:02:55.408947+00:00 ] |
launchpad ticket contains attachment |
changed description: added markup |
additional reference for a related case: https://mailman.stat.ethz.ch/pipermail/r-sig-robust/2011/000317.html A case where more than 50 % of the observations have identical value. In the R case, these results in zero variance estimate, because there is no variation in the 50% trimmed sample. I haven't checked what happens in statsmodels in this case. but see also https://mailman.stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html |
Behavior in SM '0.5.0.dev-90729a3':
|
I think this was fixed when looking at the trim/divide by zero corner cases. Can we close this? I.e., is having a non-nan return sufficient for this case? |
Seems this is still a problem on the latest git revision. The original example now works correctly, but when using TukeyBiweight the problem reappears. from statsmodels.robust import norms
import statsmodels.api as sm
import numpy as np
dataset = [27.01, 27.01, 28.5, 27.01, 27.04]
yy = np.array(dataset)
rlm = sm.RLM(yy, np.ones(len(yy)), M=sm.robust.norms.TukeyBiweight())
res = rlm.fit(scale_est=sm.robust.scale.HuberScale()) Will then lead to the following error Traceback (most recent call last):
File "test.py", line 29, in <module>
res = rlm.fit(scale_est=sm.robust.scale.HuberScale())
File "statsmodels/robust/robust_linear_model.py", line 286, in fit
weights=self.weights).fit()
File "statsmodels/regression/linear_model.py", line 196, in fit
self.rank = rank(np.diag(singular_values))
File "statsmodels/tools/tools.py", line 404, in rank
D = svdvals(X)
File "site-packages/scipy/linalg/decomp_svd.py", line 146, in svdvals
check_finite=check_finite)
File "site-packages/scipy/linalg/decomp_svd.py", line 89, in svd
a1 = asarray_chkfinite(a)
File "site-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs I'm assuming this is the same issue. As a workaround, I'm prefiltering my input to add a very small value to any identical values. This seems to work well. |
I'm not sure it's the same problem, Thanks for the report and expecially for providing a quickly runnable example. using pdb on the example, weights is nan
scale is nan but resid are not zero
something wrong in HuberScale ?
this looks like the reduced test case :
|
In HuberScale in the example not all resid are the same, but
|
putting a lower bound on s in HuberScale now the fit doesn't stop, but the results have nan standard errors, and the estimated parameter is completely wrong
The results using
|
I think there is a conceptual problem with this combination bounding the scale away from zero by a bit doesn't help. The problem I think is that the residual from the starting estimates are huge compared to the scale that My impression is that TukeyBiweight() and, I guess, other redescending weight functions need good starting values, either an MM-estimation (LTS starting values), or maybe better starting weights. another question in the case of TukeyBiweight is whether we get only local convergence of the optimization, given the starting values, or whether different starting values would converge to different results. I guess, this only plays a role if the scale estimate by HuberScale is very small and the residuals are large in a bimodal distribution. |
to the last point: the small estimated scale in HuberScale is most likely a refactoring bug from the redefinition of If it is, then there is obviously a gap in the test coverage, otherwise we should have caught a refactoring bug like this. And there is no warning in |
Still not clear
I guess that's it. |
about test coverage: |
related: calling HuberScale with integer df_resid and nobs runs into integer division and returns nans
Is there an R function to test HuberScale against, directly instead of through RLM? |
No it's not, the original code used I'm still convinced I don't find a description of the algorithm in Huber/Ronchetti 2009. It looks like the implied (normal) scale from the winsorized residuals to me. |
I assume it's from Huber's old textbook like the rest of the code. I don't recall well the details right now though. |
I also checked/searched in Huber 1981, but didn't see anything directly for the algorithm. Maybe you filled in the details from somewhere else, or my search wasn't thorough enough. I'm trying to reverse engineer it from winsorized variance, but something still looks strange. |
I don't think I wrote this. Maybe refactored it. |
Ok, maybe still a inherited dark corner, and not much touched since. My "blame" search stopped when the function was moved into the current module. A overhaul and checking was required since #550 which was Huber instead of HuberScale. My earlier impression was that they were just not intended for use outside of RLM, I'll look into it. |
Without looking I have a vague recollection that this is just a standalone, univariate, robust estimator of scale. (One of them is at least). |
for example
and the bugfix candidate HuberScale divides by sqrt(nobs) instead of nobs
update edit my mistake it's df_resid not df_model or ddof (given that RLM works this would have been too big as a bug)
For the (almost) perfect prediction case in this issue: we still need some specific unit tests. |
They should be, but as we had in the discussion on stand_mad versus mad, they are only used for residuals in RLM. |
update on references: |
final comment: HuberScale is fine except for the starting value in the case when mad=0. I get exactly the same results by calculating the implied scale from the winsorized sum of squares using fsolve. I think the solution is unique, and the mad center change will not cause any problems. (open: I didn't try to figure out if the implied scale calculation can be extended to other loc-scale distributions. |
one more: mad with center=0 can also be 0 in almost perfect prediction cases
we still need another backup if mad is zero |
Unfortunately not always true. Back to the original problem, with redescending norm, so extreme outliers get zero weight, then parts of the results might be indeterminate in the perfect prediction case. unreliable cov_params, scale, .... ? Add a warning if we detect perfect prediction ? Or does the indeterminacy happen only in certain norm/scale combinations? |
Ok, to the original topic, variance is zero and perfect prediction: The problem in the actual implementation is that weights are all zero, (or nan ?). Under the interpretation that the largest value of weights should be normalized to 1, we could set weights to 1, 0 depending on whether the resid are zero or not, which is essentially a trimmed estimator. edit |
Another thought when I was playing with different ways to force weights and scale to be non-zero: I suspect, but I'm not sure, that we could get stuck in a solution where we have a few zero residual observations, scale=0, and all other values are ignored. (in redescending norms that put zero weight on outliers) A tiny noise always results in a locally unique solution, I guess. So everything is just for edge cases where we don't have a sample from a continuous distribution (as in binned or rounded measurements). |
Ensure RLM doesn't crash if perfect fit Clean norms so that they don't crash either Avoid calculations the load to nan Improve testing of RLM Replace fabs with abs closes statsmodels#5585 closes statsmodels#1341 closes statsmodels#5878 closes statsmodels#5356 closes statsmodels#3319 xref statsmodels#55
Ensure RLM doesn't crash if perfect fit Clean norms so that they don't crash either Avoid calculations the load to nan Improve testing of RLM Replace fabs with abs closes statsmodels#5585 closes statsmodels#1341 closes statsmodels#5878 closes statsmodels#5356 closes statsmodels#3319 xref statsmodels#55
Ensure RLM doesn't crash if perfect fit Clean norms so that they don't crash either Avoid calculations the load to nan Improve testing of RLM Replace fabs with abs closes statsmodels#5585 closes statsmodels#1341 closes statsmodels#5878 closes statsmodels#5356 closes statsmodels#3319 xref statsmodels#55
Original Launchpad bug 790770: https://bugs.launchpad.net/statsmodels/+bug/790770
Reported by: josef-pktd (joep).
look at what happens in the models, model results when we have a perfect fit, and decide what to do in this cases
example for rlm, but there might be similar problems in other cases
mailinglist 2011-05-31
for rlm having a zero residual doesn't cause an exception anymore, but having a perfect fit can produce nans in some results (scale=0 and 0/0 division)
The text was updated successfully, but these errors were encountered: