Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perfect fit (in rlm) #55

Open
wesm opened this issue Aug 25, 2011 · 29 comments
Open

perfect fit (in rlm) #55

wesm opened this issue Aug 25, 2011 · 29 comments

Comments

@wesm
Copy link
Member

wesm commented Aug 25, 2011

Original Launchpad bug 790770: https://bugs.launchpad.net/statsmodels/+bug/790770
Reported by: josef-pktd (joep).

look at what happens in the models, model results when we have a perfect fit, and decide what to do in this cases
example for rlm, but there might be similar problems in other cases

mailinglist 2011-05-31
for rlm having a zero residual doesn't cause an exception anymore, but having a perfect fit can produce nans in some results (scale=0 and 0/0 division)

import numpy as np
import scikits.statsmodels.api as sm
x = np.array([0, 0.96, 2.18])
levels = np.array([2.8, 2.8, 2.8])
res1 = sm.RLM(endog=levels,exog=sm.add_constant(x),M=sm.robust.norms.HuberT()).fit()
print res1.params
print res1.sresid
print res1.scale

>>> print res1.params
[  2.77555756e-17   2.80000000e+00]
>>> print res1.sresid
[ nan  nan  nan]
>>> print res1.scale
0.0
@wesm
Copy link
Member Author

wesm commented Aug 25, 2011

[ LP comment 1 by: joep, on 2011-05-31 16:02:55.408947+00:00 ]

@josef-pkt
Copy link
Member

@josef-pkt
Copy link
Member

changed description: added markup

@josef-pkt
Copy link
Member

additional reference for a related case: https://mailman.stat.ethz.ch/pipermail/r-sig-robust/2011/000317.html

A case where more than 50 % of the observations have identical value. In the R case, these results in zero variance estimate, because there is no variation in the 50% trimmed sample.

I haven't checked what happens in statsmodels in this case.

but see also https://mailman.stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html

@vincentarelbundock
Copy link
Contributor

Behavior in SM '0.5.0.dev-90729a3':

In [45]: x = np.array([0, 0.96, 2.18])
In [46]: levels = np.array([2.8, 2.8, 2.8])
In [47]: res1 = sm.RLM(endog=levels,exog=sm.add_constant(x),M=sm.robust.norms.HuberT()).fit()
In [48]: print res1.params
[  2.80000000e+00  -1.24900090e-16]

In [49]: print res1.sresid
[-0.67448975 -0.67448975  0.        ]

In [50]: print res1.scale
6.58407647738e-16

@jseabold
Copy link
Member

I think this was fixed when looking at the trim/divide by zero corner cases. Can we close this? I.e., is having a non-nan return sufficient for this case?

@kevcampb
Copy link

Seems this is still a problem on the latest git revision. The original example now works correctly, but when using TukeyBiweight the problem reappears.

from statsmodels.robust import norms
import statsmodels.api as sm
import numpy as np

dataset = [27.01, 27.01, 28.5, 27.01, 27.04]
yy = np.array(dataset)
rlm = sm.RLM(yy, np.ones(len(yy)), M=sm.robust.norms.TukeyBiweight())
res = rlm.fit(scale_est=sm.robust.scale.HuberScale())

Will then lead to the following error

Traceback (most recent call last):
  File "test.py", line 29, in <module>
    res = rlm.fit(scale_est=sm.robust.scale.HuberScale())
  File "statsmodels/robust/robust_linear_model.py", line 286, in fit
    weights=self.weights).fit()
  File "statsmodels/regression/linear_model.py", line 196, in fit
    self.rank = rank(np.diag(singular_values))
  File "statsmodels/tools/tools.py", line 404, in rank
    D = svdvals(X)
  File "site-packages/scipy/linalg/decomp_svd.py", line 146, in svdvals
    check_finite=check_finite)
  File "site-packages/scipy/linalg/decomp_svd.py", line 89, in svd
    a1 = asarray_chkfinite(a)
  File "site-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

I'm assuming this is the same issue. As a workaround, I'm prefiltering my input to add a very small value to any identical values. This seems to work well.

@josef-pkt
Copy link
Member

I'm not sure it's the same problem,

Thanks for the report and expecially for providing a quickly runnable example.

using pdb on the example, weights is nan

> e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\regression\linear_model.py(196)fit()
-> self.rank = rank(np.diag(singular_values))
(Pdb) singular_values
array([ nan])
(Pdb) locals().keys()
['self', 'singular_values', 'method', 'kwargs']
(Pdb) a
self = <statsmodels.regression.linear_model.WLS object at 0x0553AB70>
method = pinv
kwargs = {}
(Pdb) self.exog.shape
(5, 1)
(Pdb) 
(5, 1)
(Pdb) self.exog
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])
(Pdb) self.endog
array([ 27.01,  27.01,  28.5 ,  27.01,  27.04])
(Pdb) self.weights
array([ nan,  nan,  nan,  nan,  nan])

scale is nan but resid are not zero

(Pdb) u
> e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\robust\robust_linear_model.py(286)fit()
-> weights=self.weights).fit()
(Pdb) locals().keys()
['scale_est', 'conv', 'cov', 'update_scale', 'iteration', 'self', 'wls_results', 'init', 'criterion', 'tol', 'maxiter', 'converged', 'history']
(Pdb) self
<statsmodels.robust.robust_linear_model.RLM object at 0x04FECF30>
(Pdb) self.scale
nan

(Pdb) wls_results.resid
array([-0.304, -0.304,  1.186, -0.304, -0.274])

something wrong in HuberScale ?

(Pdb) self._estimate_scale(wls_results.resid)
nan
(Pdb) scale_est
<statsmodels.robust.scale.HuberScale object at 0x04FECF70>
(Pdb) self._estimate_scale(wls_results.resid)
nan

this looks like the reduced test case :

(Pdb) self.scale_est
<statsmodels.robust.scale.HuberScale object at 0x04FECF70>
(Pdb) self.scale_est(self.df_resid, self.nobs, wls_results.resid)
nan
(Pdb) (self.df_resid, self.nobs, wls_results.resid)
(4.0, 5.0, array([-0.304, -0.304,  1.186, -0.304, -0.274]))
(Pdb) self.scale_est(4.0, 5.0, np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))
nan

HuberScale.__call__ is not easy enough to read to tell what's going on

@josef-pkt
Copy link
Member

In HuberScale in the example

not all resid are the same, but mad is zero

(Pdb) resid
array([-0.304, -0.304,  1.186, -0.304, -0.274])
(Pdb) mad(resid)
0.0

@josef-pkt
Copy link
Member

putting a lower bound on s in HuberScale s = np.maximum(mad(resid), 1e-14)

now the fit doesn't stop, but the results have nan standard errors, and the estimated parameter is completely wrong

                                        Robust linear Model Regression Results                                        
======================================================================================================================
Dep. Variable:                                                              y   No. Observations:                    5
Model:                                                                    RLM   Df Residuals:                        4
Method:                                                                  IRLS   Df Model:                            0
Norm:                                                           TukeyBiweight                                         
Scale Est.:        <statsmodels.robust.scale.HuberScale object at 0x05530630>                                         
Cov Type:                                                                  H1                                         
Date:                                                        Thu, 23 Jan 2014                                         
Time:                                                                22:20:59                                         
No. Iterations:                                                             2                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const               0        nan        nan        nan           nan       nan
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

The results using scale_est='mad' look reasonable, but it doesn't discount the outlier completely

                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                    5
Model:                            RLM   Df Residuals:                        4
Method:                          IRLS   Df Model:                            0
Norm:                   TukeyBiweight                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Thu, 23 Jan 2014                                         
Time:                        22:20:59                                         
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         27.0157      0.010   2677.624      0.000        26.996    27.035
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

@josef-pkt
Copy link
Member

I think there is a conceptual problem with this combination
M=sm.robust.norms.TukeyBiweight()) scale_est=sm.robust.scale.HuberScale()

bounding the scale away from zero by a bit doesn't help.

The problem I think is that the residual from the starting estimates are huge compared to the scale that HuberScale estimates. The problem is that very large residuals get zero weight in TukeyBiweight(). If all residuals are large, then all weights are zero. If I force a large initial scale, then it doesn't converge to discounting the outlier very much, compared to the default settings of RLM.

My impression is that TukeyBiweight() and, I guess, other redescending weight functions need good starting values, either an MM-estimation (LTS starting values), or maybe better starting weights.

another question in the case of TukeyBiweight is whether we get only local convergence of the optimization, given the starting values, or whether different starting values would converge to different results.
Also, I don't know if the little downweighting of the "outlier" isn't what TukeyBiweight and HuberScale together are supposed to do. My guess is that it's an expected result, and not a feature specific to our implementation.

I guess, this only plays a role if the scale estimate by HuberScale is very small and the residuals are large in a bimodal distribution.

@josef-pkt
Copy link
Member

to the last point: the small estimated scale in HuberScale is most likely a refactoring bug from the redefinition of mad in PR #1107 (which is not yet in a released version)
(I have currently several other changes in the source and cannot be sure.)

If it is, then there is obviously a gap in the test coverage, otherwise we should have caught a refactoring bug like this. And there is no warning in mad that the definition (default) has changed.

@josef-pkt
Copy link
Member

Still not clear
HuberScale has full line test coverage https://coveralls.io/files/120873710
the resid or weighted resid should have mean zero, the problem might only be if the median differs by a large amount from the mean as in the example here:

>>> sm.robust.mad(np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))  # uses center=median as default, now
0.0
>>> sm.robust.mad(np.array([-0.304, -0.304,  1.186, -0.304, -0.274]), center=0)
0.45071107442570296
>>> sm.robust.mad(np.array([-0.304, -0.304,  1.186, -0.304, -0.274]), center=np.mean)
0.45071107442570296

I guess that's it.

@josef-pkt
Copy link
Member

about test coverage:
test_scale doesn't test HuberScale, but it's used in one of the RLM tests. So it looks like it didn't affect convergence unless there is a problem with the starting scale as in the current example.

@josef-pkt
Copy link
Member

related: calling HuberScale with integer df_resid and nobs runs into integer division and returns nans
(I'm raising a runtime error right now in my code when nscale is nan)

>>> sm.robust.HuberScale()(4, 5, np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\robust\scale.py", line 237, in __call__
    raise RuntimeError
RuntimeError
>>> sm.robust.HuberScale()(4., 5., np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))
0.67068924200019675

Is there an R function to test HuberScale against, directly instead of through RLM?

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Jan 26, 2014
@josef-pkt
Copy link
Member

to the last point: the small estimated scale in HuberScale is most likely a refactoring bug

No it's not, the original code used stand_mad. (found while trying to write regression tests against older version)
The 0.5.0 version also produces nans in the example.

I'm still convinced stand_mad (or mad with center median) doesn't make much sense, at least not in the usage in RLM which uses a different estimate of the center.
what do we do? and does it even matter for other case?
If the iterations converge to a unique estimate of the scale, once we get started with it, then switching to center=0 will not hurt (except for maybe small changes in the time it takes to converge).

I don't find a description of the algorithm in Huber/Ronchetti 2009.
@jseabold Do you remember where HuberScale is coming from?

It looks like the implied (normal) scale from the winsorized residuals to me.

@jseabold
Copy link
Member

I assume it's from Huber's old textbook like the rest of the code. I don't recall well the details right now though.

@josef-pkt
Copy link
Member

I also checked/searched in Huber 1981, but didn't see anything directly for the algorithm. Maybe you filled in the details from somewhere else, or my search wasn't thorough enough.
There are general descriptions of the scale estimators, but nothing that I can see that would directly translate to the implementation.

I'm trying to reverse engineer it from winsorized variance, but something still looks strange.

@jseabold
Copy link
Member

I don't think I wrote this. Maybe refactored it.

@josef-pkt
Copy link
Member

Ok, maybe still a inherited dark corner, and not much touched since. My "blame" search stopped when the function was moved into the current module.

A overhaul and checking was required since #550 which was Huber instead of HuberScale. My earlier impression was that they were just not intended for use outside of RLM, I'll look into it.

@jseabold
Copy link
Member

Without looking I have a vague recollection that this is just a standalone, univariate, robust estimator of scale. (One of them is at least).

@josef-pkt
Copy link
Member

for example
with 0.5.0 to make sure it's not any of my current changes

>>> import numpy as np
>>> x = np.sqrt(2) * np.random.randn(500)
>>> import statsmodels.robust.scale as rscale
>>> rscale.Huber()(x)
(array(-0.03878802012269981), array(1.344585028724749))
>>> rscale.HuberScale()(1., len(x), x)
30.253540609632406

and the bugfix candidate HuberScale divides by sqrt(nobs) instead of nobs

>>> rscale.HuberScale()(1., len(x), x) / np.sqrt(len(x))
1.3529794672637696

update edit

my mistake it's df_resid not df_model or ddof (given that RLM works this would have been too big as a bug)

>>> rscale.HuberScale()(len(x) - 1., len(x), x)
1.3355792631491707

For the (almost) perfect prediction case in this issue: we still need some specific unit tests.

@josef-pkt
Copy link
Member

Without looking I have a vague recollection that this is just a standalone, univariate, robust estimator of scale.

They should be, but as we had in the discussion on stand_mad versus mad, they are only used for residuals in RLM.
Huber is more standalone since it can estimate loc and scale at the same time, While for HuberScale the loc is supposed to be fixed and handled by the regression part in RLM, according to my interpretation.

@josef-pkt
Copy link
Member

update on references:
Huber Proposal 2 and similar was written by Skipper.
commit messages mention SAS
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_rreg_sect018.htm
(SCALE=HUBER<(D=d)>) defines h from HuberScale, and references Huber 1981 p. 179

@josef-pkt
Copy link
Member

final comment:

HuberScale is fine except for the starting value in the case when mad=0. I get exactly the same results by calculating the implied scale from the winsorized sum of squares using fsolve. I think the solution is unique, and the mad center change will not cause any problems.

(open: I didn't try to figure out if the implied scale calculation can be extended to other loc-scale distributions.
Eventually, we need to tie together the robust statistics, PR for skew kurtosis, robust ANOVA for mean and for variance, based on trimming and winsorizing, which might just be special cases of M-estimators.)

@josef-pkt
Copy link
Member

one more:

mad with center=0 can also be 0 in almost perfect prediction cases

>>> x = np.array([0,0,0,0,0,0, -1, 1])
>>> x.mean()
0.0
>>> np.median(x)
0.0
>>> rscale.mad(x)
0.0
>>> rscale.mad(x, center=0)
0.0
>>> moment_winsorized(x, 2.5, ddof=0)   # var of winsorized data
0.25
>>> x.var()
0.25
>>> huber_scale(x, 2.5, ddof=0)    # that's my function, not statsmodels'
(array([ 0.]), <function diff_scale_wins at 0x05042EF0>)
C:\Programs\Python27\lib\site-packages\scipy\optimize\minpack.py:227: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)

we still need another backup if mad is zero

@josef-pkt
Copy link
Member

I think the solution is unique

Unfortunately not always true.
In the corner case x = np.array([0,0,0,0,0,0, -1, 1]) we have the constant winsorized var if all non-zero elements are winsorized. i.e. in this case any variance small enough will result in perfect prediction.

Back to the original problem, with redescending norm, so extreme outliers get zero weight, then parts of the results might be indeterminate in the perfect prediction case. unreliable cov_params, scale, .... ?
Since we are in a linear model (not binary as in Logit), the parameter estimate should still be unique.

Add a warning if we detect perfect prediction ? Or does the indeterminacy happen only in certain norm/scale combinations?

@josef-pkt
Copy link
Member

Ok, to the original topic, variance is zero and perfect prediction:

The problem in the actual implementation is that weights are all zero, (or nan ?). Under the interpretation that the largest value of weights should be normalized to 1, we could set weights to 1, 0 depending on whether the resid are zero or not, which is essentially a trimmed estimator.
I think WLS will return then correctly nan for the scale and cov_params. We get always the parameter estimates, but will get nans in the inferential statistics in the perfect prediction of inliers case

edit
correction: WLS in the perfect prediction case should have scale=0 up to numerical precision and bse equal to zeros.
That's the result I get in RLM if I do set weights to 1, 0.

@josef-pkt
Copy link
Member

Another thought when I was playing with different ways to force weights and scale to be non-zero:

I suspect, but I'm not sure, that we could get stuck in a solution where we have a few zero residual observations, scale=0, and all other values are ignored. (in redescending norms that put zero weight on outliers)
That's only a possible problem if we have zero residuals (not a continuous sample) and a starting value that starts at this point. The latter might be possible if we start with something like an LTS.

A tiny noise always results in a locally unique solution, I guess. So everything is just for edge cases where we don't have a sample from a continuous distribution (as in binned or rounded measurements).

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Jan 27, 2014
bashtage added a commit to bashtage/statsmodels that referenced this issue Jul 17, 2019
Ensure RLM doesn't crash if perfect fit
Clean norms so that they don't crash either
Avoid calculations the load to nan
Improve testing of RLM
Replace fabs with abs

closes statsmodels#5585
closes statsmodels#1341
closes statsmodels#5878
closes statsmodels#5356
closes statsmodels#3319
xref statsmodels#55
bashtage added a commit to bashtage/statsmodels that referenced this issue Jul 17, 2019
Ensure RLM doesn't crash if perfect fit
Clean norms so that they don't crash either
Avoid calculations the load to nan
Improve testing of RLM
Replace fabs with abs

closes statsmodels#5585
closes statsmodels#1341
closes statsmodels#5878
closes statsmodels#5356
closes statsmodels#3319
xref statsmodels#55
bashtage added a commit to bashtage/statsmodels that referenced this issue Jul 17, 2019
Ensure RLM doesn't crash if perfect fit
Clean norms so that they don't crash either
Avoid calculations the load to nan
Improve testing of RLM
Replace fabs with abs

closes statsmodels#5585
closes statsmodels#1341
closes statsmodels#5878
closes statsmodels#5356
closes statsmodels#3319
xref statsmodels#55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants