Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iterated lowess is wrong #2108

Open
anntzer opened this issue Nov 24, 2014 · 7 comments · May be fixed by #9220
Open

iterated lowess is wrong #2108

anntzer opened this issue Nov 24, 2014 · 7 comments · May be fixed by #9220

Comments

@anntzer
Copy link
Contributor

anntzer commented Nov 24, 2014

Consider

In [1]: sm.nonparametric.lowess([0] * 10 + [1] * 10, np.arange(20), it=2)[:, 1].T
Out[1]: 
array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.26162324,  0.45476355,
        0.56315913,  0.70972324,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ])

(either with the old, 0.6.0 version, or the latest "nan-fixed" version)
This is likely incorrect as the output array isn't even symmetric around 0.5.

Compare with R's result:

> lowess(c(rep(0, 10), rep(1, 10)), iter=2)
$x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

$y
 [1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
 [7] 0.00000000 0.03796574 0.29511209 0.44982749 0.55017251 0.70488791
[13] 0.96203426 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000
[19] 1.00000000 1.00000000

which is symmetric.

Note that the issue doesn't appear if iter=1 (statsmodels gives the same result as R).

@jseabold
Copy link
Member

@carljv Any comment on this? I'm not familiar enough with this code to say anything without studying it for a bit.

@jseabold
Copy link
Member

Well, it's not quite as dire as the title sounds. It could just be in these odd cases, R is doing something different. If you want to dig in here and figure it out, it would be much appreciated. For 'normal' data, we agree with R.

R> lowess(cars, iter=3)
$x
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

$y
 [1]  4.965  4.965 13.124 13.124 15.859 18.580 21.280 21.280 21.280 24.129
[11] 24.129 27.120 27.120 27.120 27.120 30.027 30.027 30.027 30.027 32.963
[21] 32.963 32.963 32.963 36.758 36.758 36.758 40.435 40.435 43.463 43.463
[31] 43.463 46.885 46.885 46.885 46.885 50.793 50.793 50.793 56.491 56.491
[41] 56.491 56.491 56.491 67.586 73.080 78.643 78.643 78.643 78.643 84.329

Statsmodels

[~/]
[1]: cars = sm.datasets.get_rdataset("cars").data

[~/]
[2]: sm.nonparametric.lowess(cars.dist, cars.speed, it=3, delta=.01*np.ptp(cars.speed))
[2]: 
array([[  4.        ,   4.96545928],
       [  4.        ,   4.96545928],
       [  7.        ,  13.12449504],
       [  7.        ,  13.12449504],
       [  8.        ,  15.85863338],
       [  9.        ,  18.57969051],
       [ 10.        ,  21.28031258],
       [ 10.        ,  21.28031258],
       [ 10.        ,  21.28031258],
       [ 11.        ,  24.12927715],
       [ 11.        ,  24.12927715],
       [ 12.        ,  27.11954855],
       [ 12.        ,  27.11954855],
       [ 12.        ,  27.11954855],
       [ 12.        ,  27.11954855],
       [ 13.        ,  30.02727633],
       [ 13.        ,  30.02727633],
       [ 13.        ,  30.02727633],
       [ 13.        ,  30.02727633],
       [ 14.        ,  32.96250613],
       [ 14.        ,  32.96250613],
       [ 14.        ,  32.96250613],
       [ 14.        ,  32.96250613],
       [ 15.        ,  36.75772834],
       [ 15.        ,  36.75772834],
       [ 15.        ,  36.75772834],
       [ 16.        ,  40.43507456],
       [ 16.        ,  40.43507456],
       [ 17.        ,  43.46349178],
       [ 17.        ,  43.46349178],
       [ 17.        ,  43.46349178],
       [ 18.        ,  46.88547894],
       [ 18.        ,  46.88547894],
       [ 18.        ,  46.88547894],
       [ 18.        ,  46.88547894],
       [ 19.        ,  50.79315172],
       [ 19.        ,  50.79315172],
       [ 19.        ,  50.79315172],
       [ 20.        ,  56.49122409],
       [ 20.        ,  56.49122409],
       [ 20.        ,  56.49122409],
       [ 20.        ,  56.49122409],
       [ 20.        ,  56.49122409],
       [ 22.        ,  67.58582423],
       [ 23.        ,  73.07969527],
       [ 24.        ,  78.64316355],
       [ 24.        ,  78.64316355],
       [ 24.        ,  78.64316355],
       [ 24.        ,  78.64316355],
       [ 25.        ,  84.32869809]])

@josef-pkt
Copy link
Member

I'm trying to partially figure out how the neighborhood is determined, #2449

it looks to me there is an asymmetry
line269 radius = fmax(x[i] - x[left_end], x[right_end-1] - x[i])
which in my interpretation centers just to the left of the intersection between x and the midpoint 0.5 * (x_low + x_high)
However, maybe it's correct because we have already increased left_end by one, so we are trimming one observation on each side.

???

@josef-pkt
Copy link
Member

However, we are using right_end for slicing x[left_end:right_end] which does not include the right_end index, in in calculate_weights.

Also, what's the behavior at the lower boundary. Is x[0] necessarily included in the first interval?

@jbrockmendel
Copy link
Contributor

Does this apply to smoothers_lowess or smoothers_lowess_old?

@josef-pkt
Copy link
Member

It's for the cython code that is actually used.
The cython code is a translation of the python code with some extras, AFAIR.
in #2449 I went through the cython code, and might not have looked much at the python code.

@bashtage bashtage added this to bugs in 0.11 Jul 16, 2019
@bashtage bashtage added this to To do in 0.12 Jan 24, 2020
@bashtage bashtage removed this from Bugs (Not-started ) in 0.11 Jan 24, 2020
@kaktus42
Copy link

So, I looked into the question of indexing. And it looks fine.

Let's say you have 30 datapoints and f = 0.2 then you got a window size of k = 6
and the following is set: left_end, right_end = 0, k.
Which correctly includes 6 datpoints with indexing x[left_end:right_end].
However, there are just going to be 5 datapoints used, because they are weighted by their distance to x[i].
The algorithm says: use the k nearest neighbours and then divide by the distance h = x[i] - x[a] with a being the point k-furthest away (or the "k-smallest number among |x[i] - x[j]| ∀ j"). This means, the furthest point within the window is always downweighted to 0.

Does this help?

@jseabold jseabold linked a pull request Apr 18, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
0.12
  
To do
Development

Successfully merging a pull request may close this issue.

5 participants