Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with lowess() smoother in statsmodels #946

Closed
yarden opened this issue Jul 4, 2013 · 10 comments
Closed

Issue with lowess() smoother in statsmodels #946

yarden opened this issue Jul 4, 2013 · 10 comments

Comments

@yarden
Copy link

yarden commented Jul 4, 2013

Following Josef's suggestion (https://groups.google.com/forum/#!topic/pystatsmodels/A5KMexQA1D8), I am using lowess() from statsmodels.nonparametric.smoothers_lowess to do Lowess smoothing. When I try it on this data set of X and Y values "test_data.txt", available here: https://gist.github.com/yarden/5929702
then I get the error:

  File "/home/yarden/.local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-x86_64.egg/statsmodels/nonparametric/smoothers_lowess.py", line 162, in lowess
    x = np.array(x[sort_index])
IndexError: index 13632 is out of bounds for size 10146

My code just calls lowess on the x, y values in the file:

df = pandas.read_table("./test_data.txt", sep="\t")
from statsmodels.nonparametric.smoothers_lowess import lowess
y_vals = lowess(df["Y"], df["X"], return_sorted=False)

where NaN values in the test_data.txt file just represent missing values. What is going wrong here?

Also, if I do:

df = df.dropna(subset=["X", "Y"], how="any")

Then it seems to work, but I thought NaN values are by default dropped (based on the missing argument to lowess()), so I am not sure what caused the problem in this case.

Thanks very much for your help.

@josef-pkt
Copy link
Member

BUG, my mistake for not having a test case with nans.

argsort on line 161 should use x not exog

sort_index = np.argsort(x)
not
sort_index = np.argsort(exog)

exog is the original full length, x, y have fewer rows if there are missing values

@yarden
Copy link
Author

yarden commented Jul 4, 2013

thanks very much for your prompt reply and fix!

@josef-pkt
Copy link
Member

Fix will land in master within a day, with test case.

Thanks for reporting it.

@yarden
Copy link
Author

yarden commented Jul 4, 2013

Quick followup: if I pass lowess the argument missing="none", return_sorted=False, it runs with no error. Does that mean that missing values (nans) are ignored and nan and x values that are missing are simply returned as nan? That's the behavior I hope to achieve but I wanted to make sure I am not misunderstanding�.

@josef-pkt
Copy link
Member

missing="none" means we don't check for nans. (to save computation when we already know we don't have NaNs or infs.)
If there are nans, then they are treated as nans in the floating point operations. All code just runs with float (double). I never checked how NaNs propagate in this case.
My guess from similar code is that all smoothed values that have a nan in their neighborhood will also turn into nans.

maybe you want missing="drop", return_sorted=False which drops nans and sorts the array for the calculations, but then put's it back in the same order and shape as the original data, with nans in the position where either x or y had a nan in the data.

That's the intended behavior, the unit tests might not include a case with nans given the previous error. (not all option combinations are unit tested.)

@yarden
Copy link
Author

yarden commented Jul 4, 2013

Ah I see, on closer look, it does put nans on all values that are nearby nans which is definitely not what I intended. I am looking for missing="drop", return_sorted=False based on your description: dropping nans for the lowess operation, and then putting nans back to preserve you get the same length arrays.

@josef-pkt
Copy link
Member

I just tried with your data return_sorted=True and return_sorted=False return exactly the same valid points.

If you want to speed up the calculations with a large dataset, then you could use the delta option which skips points that are not a minimum distance apart. (I never really tried it out. Carl implemented it to have the same fast option as in R.)

@yarden
Copy link
Author

yarden commented Jul 4, 2013

Did you change anything in the code? I don't see updates to master branch and when I try it with a recently cloned repository and missing="drop" (with either set of return_sorted) I get the error I mentioned above about dimensions being incorrect.

@josef-pkt
Copy link
Member

no, I just made the change exog -> x locally, in the code of statsmodels that my python is using.

I was mainly looking at plots, and trying to figure out how to write additional unit tests for these options.

You could change it in your installed statsmodels, then you can run it right away.

josef-pkt added a commit that referenced this issue Jul 5, 2013
BUG fix lowess sort when nans closes #946
@josef-pkt
Copy link
Member

fix and more unit tests are in master

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants