Improved lowess #266

Closed
wants to merge 7 commits into from

4 participants

@carljv

A Cython version of lowess that, in addition to the original parameters, adds a 'delta' parameter that can be used to speed up computation time on moderate-to-large data by linearly interpolating between points. (Similar to R's functionality). Moved original test-suite data to /results along with new delta tests. An R file is now available to reproduce all the results CSV files.

There may be some confusion in the code, as you guys recently made changes deprecating lowess.py. I'm also doing a bit of guesswork as to how Cython modules are treated for compilation, based on your setup files. Let me know if there's anything I should fix along those, or other lines.

@josef-pkt
statsmodels member

Thanks Carl,

I had looked tried it out from your git repo, and this is a very welcome improvement.

I need to look at the structure of your pull requests.

The way we have it right now is that compilation is delegated to the subpackages, so I think we should be able to follow the same pattern as nonparametric and tsa to cythonize and/or compile, that Skipper set up.

So far, using the compiled version was optional, which will change at some point in the future (maybe soon).
For lowess it would also be possible to keep the pure python version if we want to avoid requiring compilation, although the additional option in your version makes it not completely identical.

If this cannot be merged in a backwards compatible way, then I would prefer to hold off with the merge for one week, since I'm just trying to sort out what to do with a 0.4.1 release.

@carljv

Josef,

Of course, merge at your convenience. If there's anything I can revise to make it easier, let me know.

From a user perspective, there's no real backward compatibility issue, I think. The default of the new parameter is zero, so any calls that omit the argument (i.e. to the original function), will result in exactly the same output. It will just be quite a bit faster (even without a delta argument).

We spoke a bit about adding confidence intervals, etc. earlier. That's probably all doable with easy additions to this code. But perhaps that sort of thing is better done under an all new API, where Lowess is a class that is fit(), similar to other procedures in sm?

Thanks,
Carl

@josef-pkt
statsmodels member

reply to last part is Yes.

Making lowess into a "standard" model class was one of my questions in the original lowess merge.

I also looked briefly at R's loess, which has also prediction intervals. Being able to get a class with fit() and predict() (and other statistics) would make it possible to use lowess/loess/local polynomial regression like other smoothers in nonparametrics, kernel regression, splines, ...
If you know lowess well enough to add this, then this would be very useful. (I didn't try yet to understand the details of the lowess algorithm.)

aside: for some nonparametric methods, bootstrap confidence intervals are recommended, and with a fast lowess this will be much easier to do.

@jseabold
statsmodels member

This looks good to me. It will apply cleanly after a rebase. We need to add the extension in nonparametric/setup.py, but I can do that.

The only sticking point is that we would now require a C compiler to build statsmodels. This is where I want to go anyway, so I don't have any objection to this...

@josef-pkt
statsmodels member

I don't think we need to require the C compiler. Except for one option we have the equivalent python version.
Without the compiled extension, we could just raise a NotImplementedError in the pure python version.

Given that the distribution of binaries has improved a lot, I think requiring a C compiler will not be a big issue anymore. (I still need to check my python 3 environment which doesn't have a c compiler yet, or we have to find an alternative for python 3 compatibility checks.)

But given that lowess is available in python, I don't think it's a crucial enough improvement to requiring a c compiler for this merge. I would just follow the same optional pattern as in the other current cython extensions.

@jseabold
statsmodels member

Missed this comment.

When we discussed the compiler changes, I implemented all the conditional stuff on the assumption that it was a stop-gap. We have more building / cython issues outstanding, and I don't really want to try to work around them anymore and I want to fix the build situation. I also don't really want to have to add back in the python code after this merge. It's just more work for me.

@jseabold
statsmodels member

This is still floating around, and I'd still like to merge. Any chance on moving to Cython-only building. Installing SDK or mingw32 with my instructions is still painless.

@rgommers
statsmodels member

What about supported dtypes? Right now this is float64 only it looks like. Do you require anything else? In principle not supporting other types is a backwards compatibility break.

@josef-pkt
statsmodels member

From what I understand of the usages of lowess, I don't think any other dtype is necessary.
int shouldn't make sense, and for float32, I don't think lowess will be applied to data where memory is a problem.
(I have no idea about complex but I'm doubtful that it works correctly.)

@rgommers
statsmodels member

Why doesn't int make sense? I can imagine using lowess on integer data. Data with continuous endog and discrete exog may not even be that uncommon.

@josef-pkt
statsmodels member

You still want to fit a smooth curve through it. Or not?
If you have a discrete exog, then the x axis would only have a few positions but we still want a smooth curve for y. and all the calculations will be in float.

I haven't looked at the implementation details (in a while), would there be anything to gain by not converting int exogs to float?

@rgommers
statsmodels member

Yes, you want a smooth curve. But why don't you think it's possible to use an x-axis of discrete values (say dollars in 1k$ increments for income, or number of people in range 1 - 7 billion, etc. discrete != a few values.

You can tell the user to just input his data as xdata.astype(np.float64) of course, but that's not all that user-friendly.

Either way, accepted types should be documented in the docstring.

@josef-pkt
statsmodels member

Ok, I agree, I thought we are doing the float conversion automatically before calling the c code.

I think we should accept all types that can be converted to float64, but have the c code only for float64.

@josef-pkt
statsmodels member

I looked briefly at the function. I haven't looked at cython in a while.

Can the type conversion be done inside the lowess functions before defining everything as a fixed type (double),
or is it better to do the conversion in a wrapper function?

@rgommers rgommers commented on the diff Dec 18, 2012
statsmodels/nonparametric/setup.py
@@ -27,6 +27,7 @@ def configuration(parent_package='', top_path=None):
#config.add_data_files('tests/results/results_kde_weights.csv')
if has_c_compiler():
cython(['fast_linbin.pyx'], working_path=cur_dir)
+ cython(['lowess.pyx'], working_path=cur_dir)
@rgommers
statsmodels member

This isn't complete, need also add_extension in the same way as for fast_linbin below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
statsmodels member

It looks to me like there wouldn't be a gain of using fused types here. Simply removing the type declarations on endog/exog and casting those inside lowess to float64 should do the job.

@josef-pkt josef-pkt closed this in 76c7f0b Jun 22, 2013
@josef-pkt
statsmodels member

@carljv Thank you
sorry that this turned into the oldest PR, merged rebased version in #856

@jhamman jhamman referenced this pull request in mwaskom/seaborn Aug 2, 2015
Open

add lowess bootstrap/ci functionality #553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment