lowess silently returns nans #1798

anntzer · 2014-06-29T07:04:37Z

With statsmodels 0.5.0, lowess(arange(20), arange(20), frac=.4) returns a smoothed array entirely made of nans. This seems to be triggered by a value too small for frac, relative to the size of the data. With unevenly spaced data, sometimes only part of the smoothed data is nan; e.g. lowess(arange(20), arange(20) ** 2, frac=.2).
I don't know the exact details of the lowess implementation but it seems to me that if frac is too small (i.e. there aren't enough data points for the local smoothing?), the return value should just be set to the initially given value.

The text was updated successfully, but these errors were encountered:

josef-pkt · 2014-06-29T12:38:28Z

Thanks for reporting.

I'm not very familiar with the details (or don't remember)

I tried to see if it is related to the missing noise. It doesn't seem to be.
But even for the same x with more noise I get cases where we don't get the nans. So it cannot just only depend on the number of neighbors in x.

From the docstring, the number of neighbors should always be frac * nobs, so independent of the distance between the x.

>>> np.random.seed(5)
>>> sm.nonparametric.lowess(np.arange(20)+0.1*np.random.randn(20), np.arange(20) ** 2, frac=.2)
array([[  0.00000000e+00,   4.41227487e-02],
       [  1.00000000e+00,   5.93861341e-01],
       [  4.00000000e+00,   2.24307712e+00],
       [  9.00000000e+00,   2.97552081e+00],
       [  1.60000000e+01,   4.01096098e+00],
       [  2.50000000e+01,   4.86511308e+00],
       [  3.60000000e+01,   5.90907676e+00],
       [  4.90000000e+01,   6.91609292e+00],
       [  6.40000000e+01,   7.94557699e+00],
       [  8.10000000e+01,   8.93231829e+00],
       [  1.00000000e+02,   9.90208116e+00],
       [  1.21000000e+02,   1.09191158e+01],
       [  1.44000000e+02,   1.19687423e+01],
       [  1.69000000e+02,   1.29132845e+01],
       [  1.96000000e+02,   1.38906751e+01],
       [  2.25000000e+02,   1.49331609e+01],
       [  2.56000000e+02,   1.60789549e+01],
       [  2.89000000e+02,   1.70955848e+01],
       [  3.24000000e+02,   1.80655079e+01],
       [  3.61000000e+02,   1.90565134e+01]])
>>> np.random.seed(5)
>>> sm.nonparametric.lowess(np.arange(20)+0.01*np.random.randn(20), np.arange(20) ** 2, frac=.2)
array([[   0.        ,           nan],
       [   1.        ,           nan],
       [   4.        ,           nan],
       [   9.        ,           nan],
       [  16.        ,           nan],
       [  25.        ,           nan],
       [  36.        ,           nan],
       [  49.        ,    6.95891029],
       [  64.        ,    7.96594254],
       [  81.        ,    8.96681907],
       [ 100.        ,    9.96664979],
       [ 121.        ,   10.97013503],
       [ 144.        ,   11.97710646],
       [ 169.        ,   12.97595696],
       [ 196.        ,   13.97295998],
       [ 225.        ,   14.9769176 ],
       [ 256.        ,   15.99103018],
       [ 289.        ,   16.99093443],
       [ 324.        ,   17.98489548],
       [ 361.        ,   19.00644563]])

josef-pkt · 2014-06-29T12:38:53Z

The number of nans is increasing with each iteration:

>>> [np.isnan(sm.nonparametric.lowess(np.arange(20), np.arange(20.) ** 2, frac=.2, it=niter)[:,1]).sum() for niter in range(10)]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]

josef-pkt · 2014-06-29T13:12:42Z

This might be a perfect prediction problem. Maybe somewhere a error variance estimate goes to zero.

>>> for sdinv in range(1, 20):[np.isnan(sm.nonparametric.lowess(np.arange(20)+ 1./sdinv * noise, np.arange(20.) ** 2, frac=.2, it=niter)[:,1]).sum() for niter in range(10)]
... 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 2, 8, 12, 20]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 5]
[0, 0, 0, 0, 0, 0, 0, 0, 2, 8]
[0, 0, 0, 0, 0, 0, 1, 5, 10, 20]
[0, 0, 0, 0, 0, 2, 8, 12, 20, 20]
[0, 0, 0, 0, 2, 8, 12, 20, 20, 20]
[0, 0, 0, 1, 5, 10, 20, 20, 20, 20]
[0, 0, 0, 1, 6, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]

anntzer · 2014-06-29T20:15:26Z

I noticed that biopython also provides an implementation of lowess that suffers from a similar issue:

>>> from Bio.Statistics.lowess import lowess
>>> blowess(arange(20) ** 2, arange(20), .2)
/usr/lib/python3.4/site-packages/Bio/Statistics/lowess.py:88: RuntimeWarning: invalid value encountered in double_scalars
  beta1 = (A22 * b1 - A12 * b2) / determinant
/usr/lib/python3.4/site-packages/Bio/Statistics/lowess.py:89: RuntimeWarning: invalid value encountered in double_scalars
  beta2 = (A11 * b2 - A21 * b1) / determinant
/usr/lib/python3.4/site-packages/Bio/Statistics/lowess.py:88: RuntimeWarning: divide by zero encountered in double_scalars
  beta1 = (A22 * b1 - A12 * b2) / determinant
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

josef-pkt · 2014-06-29T22:23:10Z

@anntzer Can you check whether determinant is zero in the biopython code? I guess it is, and it would most likely confirm the perfect prediction diagnosis..
It might be easier to debug, and the same fix might apply to our code.

anntzer · 2014-06-29T22:47:28Z

Yes, it is equal to 0. Now I don't understand why the first two warnings are "invalid value" instead of "divide by zero", though.

anntzer · 2014-07-01T05:47:14Z

OK, so I checked matlab's implementation of lowess (smooth, in the curve fit toolbox). They explicitly use mldivide-least-squares (which doesn't return the least-squares solution with the smallest norm, but the one with the most zero elements -- but I guess lstsq would just work as well) here.
Statsmodels' cython implementation is a bit obscure for me but it explicitly says that it chooses not to use lstsq. Should this choice be reverted.

josef-pkt · 2014-07-01T11:27:31Z

No, not using least squares is a "feature" of lowess.
our lowess (and I think R's, and in the original articles) uses a (outlier) robust version. It is similar to robust RLM and not a standard local kernel regression based on least squares fitting.

The iteration=0, it=0 might be just LS, but I never thought about that before now.

But that might be a good clue, For RLM. I have an open pull request that fixes the weights when the variance is zero.

mldivide-least-squares (which doesn't return the least-squares solution with the smallest norm, but the one with the most zero elements)

I don't understand this part, if they use least squares or a version that works for singular matrices, then it returns the minimum, and doesn't select on the number of zero elements.
??

josef-pkt · 2014-07-01T16:01:01Z

based on this, I guess the problem is in

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/nonparametric/_smoothers_lowess.pyx#L537

np.median(std_resid) will be zero if at least half of the residuals are zero (perfect fit),
It's the same problem as in RLM.

The solution is to set 0/0 = 0 for the std_resid (i.e. replace nans by zero if np.median(std_resid) == 0

anntzer · 2014-07-01T17:01:37Z

I agree that my comment about mldivide is not clear -- mostly because indeed I don't exactly know what mldivide is doing. See e.g. http://scicomp.stackexchange.com/questions/5603/in-matlab-what-differences-are-between-linsolve-and-mldivide
I thought that "robust" lowess just meant setting weights of distant outliers to zero? At least that's what MATLAB's docs (http://www.mathworks.com/help/curvefit/smooth.html) seem to imply and also what Biopython's implementation seems to do.

josef-pkt · 2014-07-01T17:27:14Z

It uses bisquare weighting, which sets the weight of outliers to zero, but also lowers the weight of inliers compared to least squares.

the matlab documentation for smooth 'rlowess', doesn't say what the weight is for "inliers". Our bisquare also sets the weight of observations to zero if the residual is larger than 6.0 * np.median(std_resid).

anntzer · 2014-07-01T18:50:03Z

MATLAB also uses bisquare.

anntzer mentioned this issue Sep 16, 2014

Fix lowess for constant inputs. #1960

Closed

jseabold closed this as completed in f28610e Nov 24, 2014

josef-pkt mentioned this issue Jun 16, 2015

[Bug/Doc] lowess returns nan and does not warn if there are too few neighbors #2449

Open

lozybean mentioned this issue Feb 23, 2018

lowess errror: RuntimeWarning: invalid value encountered in double_scalars biopython/biopython#1540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lowess silently returns nans #1798

lowess silently returns nans #1798

anntzer commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

anntzer commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

anntzer commented Jun 29, 2014

anntzer commented Jul 1, 2014

josef-pkt commented Jul 1, 2014

josef-pkt commented Jul 1, 2014

anntzer commented Jul 1, 2014

josef-pkt commented Jul 1, 2014

anntzer commented Jul 1, 2014

lowess silently returns nans #1798

lowess silently returns nans #1798

Comments

anntzer commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

anntzer commented Jun 29, 2014

josef-pkt commented Jun 29, 2014

anntzer commented Jun 29, 2014

anntzer commented Jul 1, 2014

josef-pkt commented Jul 1, 2014

josef-pkt commented Jul 1, 2014

anntzer commented Jul 1, 2014

josef-pkt commented Jul 1, 2014

anntzer commented Jul 1, 2014