lowess silently returns nans #1798

Closed
anntzer opened this Issue Jun 29, 2014 · 12 comments

Projects

None yet

2 participants

@anntzer
Contributor
anntzer commented Jun 29, 2014

With statsmodels 0.5.0, lowess(arange(20), arange(20), frac=.4) returns a smoothed array entirely made of nans. This seems to be triggered by a value too small for frac, relative to the size of the data. With unevenly spaced data, sometimes only part of the smoothed data is nan; e.g. lowess(arange(20), arange(20) ** 2, frac=.2).
I don't know the exact details of the lowess implementation but it seems to me that if frac is too small (i.e. there aren't enough data points for the local smoothing?), the return value should just be set to the initially given value.

@josef-pkt
Member

Thanks for reporting.

I'm not very familiar with the details (or don't remember)

I tried to see if it is related to the missing noise. It doesn't seem to be.
But even for the same x with more noise I get cases where we don't get the nans. So it cannot just only depend on the number of neighbors in x.

From the docstring, the number of neighbors should always be frac * nobs, so independent of the distance between the x.

>>> np.random.seed(5)
>>> sm.nonparametric.lowess(np.arange(20)+0.1*np.random.randn(20), np.arange(20) ** 2, frac=.2)
array([[  0.00000000e+00,   4.41227487e-02],
       [  1.00000000e+00,   5.93861341e-01],
       [  4.00000000e+00,   2.24307712e+00],
       [  9.00000000e+00,   2.97552081e+00],
       [  1.60000000e+01,   4.01096098e+00],
       [  2.50000000e+01,   4.86511308e+00],
       [  3.60000000e+01,   5.90907676e+00],
       [  4.90000000e+01,   6.91609292e+00],
       [  6.40000000e+01,   7.94557699e+00],
       [  8.10000000e+01,   8.93231829e+00],
       [  1.00000000e+02,   9.90208116e+00],
       [  1.21000000e+02,   1.09191158e+01],
       [  1.44000000e+02,   1.19687423e+01],
       [  1.69000000e+02,   1.29132845e+01],
       [  1.96000000e+02,   1.38906751e+01],
       [  2.25000000e+02,   1.49331609e+01],
       [  2.56000000e+02,   1.60789549e+01],
       [  2.89000000e+02,   1.70955848e+01],
       [  3.24000000e+02,   1.80655079e+01],
       [  3.61000000e+02,   1.90565134e+01]])
>>> np.random.seed(5)
>>> sm.nonparametric.lowess(np.arange(20)+0.01*np.random.randn(20), np.arange(20) ** 2, frac=.2)
array([[   0.        ,           nan],
       [   1.        ,           nan],
       [   4.        ,           nan],
       [   9.        ,           nan],
       [  16.        ,           nan],
       [  25.        ,           nan],
       [  36.        ,           nan],
       [  49.        ,    6.95891029],
       [  64.        ,    7.96594254],
       [  81.        ,    8.96681907],
       [ 100.        ,    9.96664979],
       [ 121.        ,   10.97013503],
       [ 144.        ,   11.97710646],
       [ 169.        ,   12.97595696],
       [ 196.        ,   13.97295998],
       [ 225.        ,   14.9769176 ],
       [ 256.        ,   15.99103018],
       [ 289.        ,   16.99093443],
       [ 324.        ,   17.98489548],
       [ 361.        ,   19.00644563]])
@josef-pkt
Member

The number of nans is increasing with each iteration:

>>> [np.isnan(sm.nonparametric.lowess(np.arange(20), np.arange(20.) ** 2, frac=.2, it=niter)[:,1]).sum() for niter in range(10)]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
@josef-pkt
Member

This might be a perfect prediction problem. Maybe somewhere a error variance estimate goes to zero.

>>> for sdinv in range(1, 20):[np.isnan(sm.nonparametric.lowess(np.arange(20)+ 1./sdinv * noise, np.arange(20.) ** 2, frac=.2, it=niter)[:,1]).sum() for niter in range(10)]
... 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 2, 8, 12, 20]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 5]
[0, 0, 0, 0, 0, 0, 0, 0, 2, 8]
[0, 0, 0, 0, 0, 0, 1, 5, 10, 20]
[0, 0, 0, 0, 0, 2, 8, 12, 20, 20]
[0, 0, 0, 0, 2, 8, 12, 20, 20, 20]
[0, 0, 0, 1, 5, 10, 20, 20, 20, 20]
[0, 0, 0, 1, 6, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
[0, 3, 5, 7, 9, 11, 20, 20, 20, 20]
@anntzer
Contributor
anntzer commented Jun 29, 2014

I noticed that biopython also provides an implementation of lowess that suffers from a similar issue:

>>> from Bio.Statistics.lowess import lowess
>>> blowess(arange(20) ** 2, arange(20), .2)
/usr/lib/python3.4/site-packages/Bio/Statistics/lowess.py:88: RuntimeWarning: invalid value encountered in double_scalars
  beta1 = (A22 * b1 - A12 * b2) / determinant
/usr/lib/python3.4/site-packages/Bio/Statistics/lowess.py:89: RuntimeWarning: invalid value encountered in double_scalars
  beta2 = (A11 * b2 - A21 * b1) / determinant
/usr/lib/python3.4/site-packages/Bio/Statistics/lowess.py:88: RuntimeWarning: divide by zero encountered in double_scalars
  beta1 = (A22 * b1 - A12 * b2) / determinant
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])
@josef-pkt
Member

@anntzer Can you check whether determinant is zero in the biopython code? I guess it is, and it would most likely confirm the perfect prediction diagnosis..
It might be easier to debug, and the same fix might apply to our code.

@anntzer
Contributor
anntzer commented Jun 29, 2014

Yes, it is equal to 0. Now I don't understand why the first two warnings are "invalid value" instead of "divide by zero", though.

@anntzer
Contributor
anntzer commented Jul 1, 2014

OK, so I checked matlab's implementation of lowess (smooth, in the curve fit toolbox). They explicitly use mldivide-least-squares (which doesn't return the least-squares solution with the smallest norm, but the one with the most zero elements -- but I guess lstsq would just work as well) here.
Statsmodels' cython implementation is a bit obscure for me but it explicitly says that it chooses not to use lstsq. Should this choice be reverted.

@josef-pkt
Member

No, not using least squares is a "feature" of lowess.
our lowess (and I think R's, and in the original articles) uses a (outlier) robust version. It is similar to robust RLM and not a standard local kernel regression based on least squares fitting.

The iteration=0, it=0 might be just LS, but I never thought about that before now.

But that might be a good clue, For RLM. I have an open pull request that fixes the weights when the variance is zero.

mldivide-least-squares (which doesn't return the least-squares solution with the smallest norm, but the one with the most zero elements)

I don't understand this part, if they use least squares or a version that works for singular matrices, then it returns the minimum, and doesn't select on the number of zero elements.
??

@josef-pkt
Member

based on this, I guess the problem is in

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/nonparametric/_smoothers_lowess.pyx#L537

np.median(std_resid) will be zero if at least half of the residuals are zero (perfect fit),
It's the same problem as in RLM.

The solution is to set 0/0 = 0 for the std_resid (i.e. replace nans by zero if np.median(std_resid) == 0

@anntzer
Contributor
anntzer commented Jul 1, 2014

I agree that my comment about mldivide is not clear -- mostly because indeed I don't exactly know what mldivide is doing. See e.g. http://scicomp.stackexchange.com/questions/5603/in-matlab-what-differences-are-between-linsolve-and-mldivide
I thought that "robust" lowess just meant setting weights of distant outliers to zero? At least that's what MATLAB's docs (http://www.mathworks.com/help/curvefit/smooth.html) seem to imply and also what Biopython's implementation seems to do.

@josef-pkt
Member

It uses bisquare weighting, which sets the weight of outliers to zero, but also lowers the weight of inliers compared to least squares.

the matlab documentation for smooth 'rlowess', doesn't say what the weight is for "inliers". Our bisquare also sets the weight of observations to zero if the residual is larger than 6.0 * np.median(std_resid).

@anntzer
Contributor
anntzer commented Jul 1, 2014

MATLAB also uses bisquare.

@jseabold jseabold closed this in f28610e Nov 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment