perfect fit (in rlm) #55

wesm · 2011-08-25T02:19:02Z

Original Launchpad bug 790770: https://bugs.launchpad.net/statsmodels/+bug/790770
Reported by: josef-pktd (joep).

look at what happens in the models, model results when we have a perfect fit, and decide what to do in this cases
example for rlm, but there might be similar problems in other cases

mailinglist 2011-05-31
for rlm having a zero residual doesn't cause an exception anymore, but having a perfect fit can produce nans in some results (scale=0 and 0/0 division)

import numpy as np
import scikits.statsmodels.api as sm
x = np.array([0, 0.96, 2.18])
levels = np.array([2.8, 2.8, 2.8])
res1 = sm.RLM(endog=levels,exog=sm.add_constant(x),M=sm.robust.norms.HuberT()).fit()
print res1.params
print res1.sresid
print res1.scale

>>> print res1.params
[  2.77555756e-17   2.80000000e+00]
>>> print res1.sresid
[ nan  nan  nan]
>>> print res1.scale
0.0

The text was updated successfully, but these errors were encountered:

wesm · 2011-08-25T02:19:02Z

[ LP comment 1 by: joep, on 2011-05-31 16:02:55.408947+00:00 ]

josef-pkt · 2011-08-25T19:03:34Z

launchpad ticket contains attachment
https://bugs.launchpad.net/statsmodels/+bug/790770/+attachment/2149197/+files/try_rlm_perfect_fit.py

josef-pkt · 2011-08-25T19:04:30Z

changed description: added markup

josef-pkt · 2012-04-04T00:58:07Z

additional reference for a related case: https://mailman.stat.ethz.ch/pipermail/r-sig-robust/2011/000317.html

A case where more than 50 % of the observations have identical value. In the R case, these results in zero variance estimate, because there is no variation in the 50% trimmed sample.

I haven't checked what happens in statsmodels in this case.

but see also https://mailman.stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html

vincentarelbundock · 2013-08-26T02:40:36Z

Behavior in SM '0.5.0.dev-90729a3':

In [45]: x = np.array([0, 0.96, 2.18])
In [46]: levels = np.array([2.8, 2.8, 2.8])
In [47]: res1 = sm.RLM(endog=levels,exog=sm.add_constant(x),M=sm.robust.norms.HuberT()).fit()
In [48]: print res1.params
[  2.80000000e+00  -1.24900090e-16]

In [49]: print res1.sresid
[-0.67448975 -0.67448975  0.        ]

In [50]: print res1.scale
6.58407647738e-16

jseabold · 2013-08-26T13:13:54Z

I think this was fixed when looking at the trim/divide by zero corner cases. Can we close this? I.e., is having a non-nan return sufficient for this case?

kevcampb · 2014-01-24T01:49:14Z

Seems this is still a problem on the latest git revision. The original example now works correctly, but when using TukeyBiweight the problem reappears.

from statsmodels.robust import norms
import statsmodels.api as sm
import numpy as np

dataset = [27.01, 27.01, 28.5, 27.01, 27.04]
yy = np.array(dataset)
rlm = sm.RLM(yy, np.ones(len(yy)), M=sm.robust.norms.TukeyBiweight())
res = rlm.fit(scale_est=sm.robust.scale.HuberScale())

Will then lead to the following error

Traceback (most recent call last):
  File "test.py", line 29, in <module>
    res = rlm.fit(scale_est=sm.robust.scale.HuberScale())
  File "statsmodels/robust/robust_linear_model.py", line 286, in fit
    weights=self.weights).fit()
  File "statsmodels/regression/linear_model.py", line 196, in fit
    self.rank = rank(np.diag(singular_values))
  File "statsmodels/tools/tools.py", line 404, in rank
    D = svdvals(X)
  File "site-packages/scipy/linalg/decomp_svd.py", line 146, in svdvals
    check_finite=check_finite)
  File "site-packages/scipy/linalg/decomp_svd.py", line 89, in svd
    a1 = asarray_chkfinite(a)
  File "site-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

I'm assuming this is the same issue. As a workaround, I'm prefiltering my input to add a very small value to any identical values. This seems to work well.

josef-pkt · 2014-01-24T02:46:46Z

I'm not sure it's the same problem,

Thanks for the report and expecially for providing a quickly runnable example.

using pdb on the example, weights is nan

> e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\regression\linear_model.py(196)fit()
-> self.rank = rank(np.diag(singular_values))
(Pdb) singular_values
array([ nan])
(Pdb) locals().keys()
['self', 'singular_values', 'method', 'kwargs']
(Pdb) a
self = <statsmodels.regression.linear_model.WLS object at 0x0553AB70>
method = pinv
kwargs = {}
(Pdb) self.exog.shape
(5, 1)
(Pdb) 
(5, 1)
(Pdb) self.exog
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])
(Pdb) self.endog
array([ 27.01,  27.01,  28.5 ,  27.01,  27.04])
(Pdb) self.weights
array([ nan,  nan,  nan,  nan,  nan])

scale is nan but resid are not zero

(Pdb) u
> e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\robust\robust_linear_model.py(286)fit()
-> weights=self.weights).fit()
(Pdb) locals().keys()
['scale_est', 'conv', 'cov', 'update_scale', 'iteration', 'self', 'wls_results', 'init', 'criterion', 'tol', 'maxiter', 'converged', 'history']
(Pdb) self
<statsmodels.robust.robust_linear_model.RLM object at 0x04FECF30>
(Pdb) self.scale
nan

(Pdb) wls_results.resid
array([-0.304, -0.304,  1.186, -0.304, -0.274])

something wrong in HuberScale ?

(Pdb) self._estimate_scale(wls_results.resid)
nan
(Pdb) scale_est
<statsmodels.robust.scale.HuberScale object at 0x04FECF70>
(Pdb) self._estimate_scale(wls_results.resid)
nan

this looks like the reduced test case :

(Pdb) self.scale_est
<statsmodels.robust.scale.HuberScale object at 0x04FECF70>
(Pdb) self.scale_est(self.df_resid, self.nobs, wls_results.resid)
nan
(Pdb) (self.df_resid, self.nobs, wls_results.resid)
(4.0, 5.0, array([-0.304, -0.304,  1.186, -0.304, -0.274]))
(Pdb) self.scale_est(4.0, 5.0, np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))
nan

HuberScale.__call__ is not easy enough to read to tell what's going on

josef-pkt · 2014-01-24T03:20:41Z

In HuberScale in the example

not all resid are the same, but mad is zero

(Pdb) resid
array([-0.304, -0.304,  1.186, -0.304, -0.274])
(Pdb) mad(resid)
0.0

josef-pkt · 2014-01-24T03:30:58Z

putting a lower bound on s in HuberScale s = np.maximum(mad(resid), 1e-14)

now the fit doesn't stop, but the results have nan standard errors, and the estimated parameter is completely wrong

                                        Robust linear Model Regression Results                                        
======================================================================================================================
Dep. Variable:                                                              y   No. Observations:                    5
Model:                                                                    RLM   Df Residuals:                        4
Method:                                                                  IRLS   Df Model:                            0
Norm:                                                           TukeyBiweight                                         
Scale Est.:        <statsmodels.robust.scale.HuberScale object at 0x05530630>                                         
Cov Type:                                                                  H1                                         
Date:                                                        Thu, 23 Jan 2014                                         
Time:                                                                22:20:59                                         
No. Iterations:                                                             2                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const               0        nan        nan        nan           nan       nan
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

The results using scale_est='mad' look reasonable, but it doesn't discount the outlier completely

                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                    5
Model:                            RLM   Df Residuals:                        4
Method:                          IRLS   Df Model:                            0
Norm:                   TukeyBiweight                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Thu, 23 Jan 2014                                         
Time:                        22:20:59                                         
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         27.0157      0.010   2677.624      0.000        26.996    27.035
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

josef-pkt · 2014-01-24T05:07:53Z

I think there is a conceptual problem with this combination
M=sm.robust.norms.TukeyBiweight()) scale_est=sm.robust.scale.HuberScale()

bounding the scale away from zero by a bit doesn't help.

The problem I think is that the residual from the starting estimates are huge compared to the scale that HuberScale estimates. The problem is that very large residuals get zero weight in TukeyBiweight(). If all residuals are large, then all weights are zero. If I force a large initial scale, then it doesn't converge to discounting the outlier very much, compared to the default settings of RLM.

My impression is that TukeyBiweight() and, I guess, other redescending weight functions need good starting values, either an MM-estimation (LTS starting values), or maybe better starting weights.

another question in the case of TukeyBiweight is whether we get only local convergence of the optimization, given the starting values, or whether different starting values would converge to different results.
Also, I don't know if the little downweighting of the "outlier" isn't what TukeyBiweight and HuberScale together are supposed to do. My guess is that it's an expected result, and not a feature specific to our implementation.

I guess, this only plays a role if the scale estimate by HuberScale is very small and the residuals are large in a bimodal distribution.

josef-pkt · 2014-01-24T14:33:14Z

to the last point: the small estimated scale in HuberScale is most likely a refactoring bug from the redefinition of mad in PR #1107 (which is not yet in a released version)
(I have currently several other changes in the source and cannot be sure.)

If it is, then there is obviously a gap in the test coverage, otherwise we should have caught a refactoring bug like this. And there is no warning in mad that the definition (default) has changed.

josef-pkt · 2014-01-24T14:45:31Z

Still not clear
HuberScale has full line test coverage https://coveralls.io/files/120873710
the resid or weighted resid should have mean zero, the problem might only be if the median differs by a large amount from the mean as in the example here:

>>> sm.robust.mad(np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))  # uses center=median as default, now
0.0
>>> sm.robust.mad(np.array([-0.304, -0.304,  1.186, -0.304, -0.274]), center=0)
0.45071107442570296
>>> sm.robust.mad(np.array([-0.304, -0.304,  1.186, -0.304, -0.274]), center=np.mean)
0.45071107442570296

I guess that's it.

josef-pkt · 2014-01-24T14:57:21Z

about test coverage:
test_scale doesn't test HuberScale, but it's used in one of the RLM tests. So it looks like it didn't affect convergence unless there is a problem with the starting scale as in the current example.

josef-pkt · 2014-01-24T15:39:13Z

related: calling HuberScale with integer df_resid and nobs runs into integer division and returns nans
(I'm raising a runtime error right now in my code when nscale is nan)

>>> sm.robust.HuberScale()(4, 5, np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\robust\scale.py", line 237, in __call__
    raise RuntimeError
RuntimeError
>>> sm.robust.HuberScale()(4., 5., np.array([-0.304, -0.304,  1.186, -0.304, -0.274]))
0.67068924200019675

Is there an R function to test HuberScale against, directly instead of through RLM?

josef-pkt · 2014-01-26T14:39:23Z

to the last point: the small estimated scale in HuberScale is most likely a refactoring bug

No it's not, the original code used stand_mad. (found while trying to write regression tests against older version)
The 0.5.0 version also produces nans in the example.

I'm still convinced stand_mad (or mad with center median) doesn't make much sense, at least not in the usage in RLM which uses a different estimate of the center.
what do we do? and does it even matter for other case?
If the iterations converge to a unique estimate of the scale, once we get started with it, then switching to center=0 will not hurt (except for maybe small changes in the time it takes to converge).

I don't find a description of the algorithm in Huber/Ronchetti 2009.
@jseabold Do you remember where HuberScale is coming from?

It looks like the implied (normal) scale from the winsorized residuals to me.

jseabold · 2014-01-26T17:02:03Z

I assume it's from Huber's old textbook like the rest of the code. I don't recall well the details right now though.

josef-pkt · 2014-01-26T17:52:39Z

I also checked/searched in Huber 1981, but didn't see anything directly for the algorithm. Maybe you filled in the details from somewhere else, or my search wasn't thorough enough.
There are general descriptions of the scale estimators, but nothing that I can see that would directly translate to the implementation.

I'm trying to reverse engineer it from winsorized variance, but something still looks strange.

jseabold · 2014-01-26T17:58:05Z

I don't think I wrote this. Maybe refactored it.

josef-pkt · 2014-01-26T18:11:30Z

Ok, maybe still a inherited dark corner, and not much touched since. My "blame" search stopped when the function was moved into the current module.

A overhaul and checking was required since #550 which was Huber instead of HuberScale. My earlier impression was that they were just not intended for use outside of RLM, I'll look into it.

jseabold · 2014-01-26T18:21:44Z

Without looking I have a vague recollection that this is just a standalone, univariate, robust estimator of scale. (One of them is at least).

josef-pkt · 2014-01-26T18:22:59Z

for example
with 0.5.0 to make sure it's not any of my current changes

>>> import numpy as np
>>> x = np.sqrt(2) * np.random.randn(500)
>>> import statsmodels.robust.scale as rscale
>>> rscale.Huber()(x)
(array(-0.03878802012269981), array(1.344585028724749))
>>> rscale.HuberScale()(1., len(x), x)
30.253540609632406

and the bugfix candidate HuberScale divides by sqrt(nobs) instead of nobs

>>> rscale.HuberScale()(1., len(x), x) / np.sqrt(len(x))
1.3529794672637696

update edit

my mistake it's df_resid not df_model or ddof (given that RLM works this would have been too big as a bug)

>>> rscale.HuberScale()(len(x) - 1., len(x), x)
1.3355792631491707

For the (almost) perfect prediction case in this issue: we still need some specific unit tests.

josef-pkt · 2014-01-26T18:27:56Z

Without looking I have a vague recollection that this is just a standalone, univariate, robust estimator of scale.

They should be, but as we had in the discussion on stand_mad versus mad, they are only used for residuals in RLM.
Huber is more standalone since it can estimate loc and scale at the same time, While for HuberScale the loc is supposed to be fixed and handled by the regression part in RLM, according to my interpretation.

josef-pkt · 2014-01-26T19:07:13Z

update on references:
Huber Proposal 2 and similar was written by Skipper.
commit messages mention SAS
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_rreg_sect018.htm
(SCALE=HUBER<(D=d)>) defines h from HuberScale, and references Huber 1981 p. 179

josef-pkt · 2014-01-26T19:45:05Z

final comment:

HuberScale is fine except for the starting value in the case when mad=0. I get exactly the same results by calculating the implied scale from the winsorized sum of squares using fsolve. I think the solution is unique, and the mad center change will not cause any problems.

(open: I didn't try to figure out if the implied scale calculation can be extended to other loc-scale distributions.
Eventually, we need to tie together the robust statistics, PR for skew kurtosis, robust ANOVA for mean and for variance, based on trimming and winsorizing, which might just be special cases of M-estimators.)

josef-pkt · 2014-01-26T20:22:58Z

one more:

mad with center=0 can also be 0 in almost perfect prediction cases

>>> x = np.array([0,0,0,0,0,0, -1, 1])
>>> x.mean()
0.0
>>> np.median(x)
0.0
>>> rscale.mad(x)
0.0
>>> rscale.mad(x, center=0)
0.0
>>> moment_winsorized(x, 2.5, ddof=0)   # var of winsorized data
0.25
>>> x.var()
0.25
>>> huber_scale(x, 2.5, ddof=0)    # that's my function, not statsmodels'
(array([ 0.]), <function diff_scale_wins at 0x05042EF0>)
C:\Programs\Python27\lib\site-packages\scipy\optimize\minpack.py:227: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)

we still need another backup if mad is zero

josef-pkt · 2014-01-26T21:10:22Z

I think the solution is unique

Unfortunately not always true.
In the corner case x = np.array([0,0,0,0,0,0, -1, 1]) we have the constant winsorized var if all non-zero elements are winsorized. i.e. in this case any variance small enough will result in perfect prediction.

Back to the original problem, with redescending norm, so extreme outliers get zero weight, then parts of the results might be indeterminate in the perfect prediction case. unreliable cov_params, scale, .... ?
Since we are in a linear model (not binary as in Logit), the parameter estimate should still be unique.

Add a warning if we detect perfect prediction ? Or does the indeterminacy happen only in certain norm/scale combinations?

josef-pkt · 2014-01-26T22:25:45Z

Ok, to the original topic, variance is zero and perfect prediction:

The problem in the actual implementation is that weights are all zero, (or nan ?). Under the interpretation that the largest value of weights should be normalized to 1, we could set weights to 1, 0 depending on whether the resid are zero or not, which is essentially a trimmed estimator.
I think WLS will return then correctly nan for the scale and cov_params. We get always the parameter estimates, but will get nans in the inferential statistics in the perfect prediction of inliers case

edit
correction: WLS in the perfect prediction case should have scale=0 up to numerical precision and bse equal to zeros.
That's the result I get in RLM if I do set weights to 1, 0.

josef-pkt · 2014-01-26T22:43:35Z

Another thought when I was playing with different ways to force weights and scale to be non-zero:

I suspect, but I'm not sure, that we could get stuck in a solution where we have a few zero residual observations, scale=0, and all other values are ignored. (in redescending norms that put zero weight on outliers)
That's only a possible problem if we have zero residuals (not a continuous sample) and a starting value that starts at this point. The latter might be possible if we start with something like an LTS.

A tiny noise always results in a locally unique solution, I guess. So everything is just for edge cases where we don't have a sample from a continuous distribution (as in binned or rounded measurements).

Ensure RLM doesn't crash if perfect fit Clean norms so that they don't crash either Avoid calculations the load to nan Improve testing of RLM Replace fabs with abs closes statsmodels#5585 closes statsmodels#1341 closes statsmodels#5878 closes statsmodels#5356 closes statsmodels#3319 xref statsmodels#55

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Jan 26, 2014

BUG: fix initial scale in HuberScale see statsmodels#55 (part 2)

8d0f689

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Jan 27, 2014

REF/BUG: work around scale=0 problem, see comments in statsmodels#55

50ea676

This was referenced Jan 27, 2014

robust Huber fails to converge #550

Open

REF/ENH RLM and robust scale for almost perfect prediction #1341

Open

josef-pkt mentioned this issue Mar 1, 2014

RLM robust estimator Exact Fit Property #1439

Open

josef-pkt mentioned this issue Dec 15, 2016

Robust linear model: divide by zero error for perfect fits #3319

Closed

josef-pkt mentioned this issue May 10, 2019

Add check for convergence #5585

Closed

bashtage mentioned this issue Jul 17, 2019

BUG/ENH: Improve RLM in the case of perfect fit #5967

Merged

8 tasks

This was referenced Aug 25, 2022

SARMIXResults remove_data does not allow for persisted model to still execute forecast method #8383

Closed

[BUG] Cannot predict after SARIMAXResults remove_data method called #8384

Open

valer1435 mentioned this issue May 5, 2023

Error when forecasting a dumb time series aimclub/FEDOT#1095

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perfect fit (in rlm) #55

perfect fit (in rlm) #55

wesm commented Aug 25, 2011

wesm commented Aug 25, 2011

josef-pkt commented Aug 25, 2011

josef-pkt commented Aug 25, 2011

josef-pkt commented Apr 4, 2012

vincentarelbundock commented Aug 26, 2013

jseabold commented Aug 26, 2013

kevcampb commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 26, 2014

jseabold commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

jseabold commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

jseabold commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

perfect fit (in rlm) #55

perfect fit (in rlm) #55

Comments

wesm commented Aug 25, 2011

wesm commented Aug 25, 2011

josef-pkt commented Aug 25, 2011

josef-pkt commented Aug 25, 2011

josef-pkt commented Apr 4, 2012

vincentarelbundock commented Aug 26, 2013

jseabold commented Aug 26, 2013

kevcampb commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 24, 2014

josef-pkt commented Jan 26, 2014

jseabold commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

jseabold commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

jseabold commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jan 26, 2014