[WIP] Earth (MARS) #2285

Closed
wants to merge 23 commits into from
@jcrudy

This pull request adds multivariate adaptive regression splines to scikit-learn.

  1. Adds the sklearn.earth module, which provides the Earth class
  2. Adds an earth page to the documentation
  3. Adds some earth examples

It probably could still use more work, but I'm looking for feedback on how best to proceed. Currently my plan is to focus on documentation going forward.

@glouppe
scikit-learn member

Thanks! I believe this is a great addition, but it will indeed need some work to be to our standards.

Also, don't expect any review right now - we are in the middle of releasing the new 0.14 version. Everyone is busy at it :)

(Note: don't hesitate to ping us again when 0.14 is released, to have a deeper look at this.)

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

same here ;)

@jcrudy

Thanks, guys:). I'm going to be very busy for the next two weeks anyway because I'm moving. After I'm settled I'll ping this thread again and probably put some more work into documentation.

@jcrudy

I'm going to have some time to work on this soon. Does anyone have time to give it a look and point me in the right direction? If not I will put time into documentation, but it would be great to get some feedback to move the code forward as well.

@glouppe
scikit-learn member

This PR is really huge! Can you point the most important parts for which you need a initial review? Also, can you explain us in a few words what is each Cython file used for? (to help us have a good overview of your code)

@jcrudy

The file that's most likely needs attention is earth.py. It contains the Earth class, which is the only class that users are expected to interact with directly. I have tried to make it conform to the sklearn interface standards, but I am not sure how successful I was. My main question is whether I need to change anything to make it more compatible with the rest of sklearn. The pyx files contain the following stuff:

_basis.pyx : classes that represent MARS basis functions and sets of basis functions
_forward.pyx : implements the MARS forward pass
_pruning.pyx : implements the MARS pruning pass
_record.pyx : classes for recording and reporting on the execution of the MARS algorithm (how many iterations were performed, what was the error after each iteration, what were the stopping conditions, etc.)
_util.pyx : functions that don't fit well anywhere else but are used repeatedly throughout the code

Perhaps one place I should focus is on commenting the parts of the code that are not user-facing. The following other things might be issues:

  1. When I build the documentation, my examples don't display as nicely as the ones already in sklearn. I think I am missing something about how examples are supposed to be added.

  2. My version is not as fast as the R package in most cases. It can be made faster with some work (and at some point I want to make a multicore version for my own education), but I'm not sure how much of a priority speed should be. See the pyearth_vs_earth.py file (which I now realize needs to be renamed) for a comparison.

  3. I have some public functionality on the Earth class not found in other types of sklearn models. I find it useful, but is its usefulness worth the heterogeneity? For example, the Earth class uses column names from certain types of data structures (such as pandas DataFrames) in a way that I don't think other sklearn models do, and reports information on the fitting process in a non-standard way. It also allows you to run different parts of the fitting process (forward pass, pruning pass, linear fit) independently.

@jcrudy

If you need a reference for MARS, wikipedia has a great general description of the algorithm:

http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

@ogrisel
scikit-learn member

I don't have time to have look tonight but you can start here:

http://scikit-learn.org/stable/developers/index.html#contributing-code

Also as the official name of the algorithm is MARS I think the name of the classes in scikit-learn should be one of:

  • MARS (but this is an acronym...)
  • MARSRegressor
  • MultiVariateAdaptiveRegressionSplines (maybe too long)
  • MultiVariateAdaptiveSplineRegressor (descriptive name, almost as long but more in line with the *Regressor convention already often used in scikit-learn)

I would also remove all references to "earth" and replace them by "mars" in module names, function names, variable names and so on.

We should also not commit images in the doc but replace them with plot automatically generated from the examples when possible. For instance, the plot in this rendered paragraph:

http://scikit-learn.org/stable/modules/ensemble.html#extremely-randomized-trees

is rendered by this example:

https://github.com/scikit-learn/scikit-learn/blob/master/examples/ensemble/plot_forest_iris.py

which is included in the reST source:

https://raw.github.com/scikit-learn/scikit-learn/master/doc/modules/ensemble.rst

@ogrisel ogrisel commented on an outdated diff Aug 27, 2013
doc/modules/earth.rst
+score is not actually based on cross-validation, but rather is meant to approximate a true
+cross-validation score by penalizing model complexity. The final result is a set of basis functions
+that is nonlinear in the original feature space, may include interactions, and is likely to
+generalize well.
+
+
+.. math::
+ y=1-2\text{h}\left(1-x\right)+\frac{1}{2}\text{h}\left(x-1\right)
+
+
+.. image:: ../images/piecewise_linear.png
+
+
+
+
+
@ogrisel
scikit-learn member
ogrisel added a note Aug 27, 2013

Also please never insert more than 2 consecutive blank lines in reST or python code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on an outdated diff Aug 27, 2013
examples/earth/classifier_comp.py
@@ -0,0 +1,134 @@
+'''
+====================================
+Classification comparison with Earth
+====================================
+
+
+This script recreates the scikit-learn classifier comparison example found at http://scikit-learn.org/dev/auto_examples/plot_classifier_comparison.html.
+It has been modified to include an :class:`Earth` based classifier. The Earth based classifier is made up of a Pipeline
+that uses an Earth model as a transformer and a LogisticRegression model as a predictor.
@ogrisel
scikit-learn member
ogrisel added a note Aug 27, 2013

Please update the existing examples to add the MARS algorithm when appropriate rather than duplicate the examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel and 1 other commented on an outdated diff Aug 27, 2013
examples/earth/pyearth_vs_earth.py
+=============================
+
+
+This script randomly generates earth-style models, then randomly generates data from those models and
+fits earth models to those data using both the python (:class:`Earth`) and R implementations. It records the sample size,
+m, the number of input dimensions, n, the number of forward pass iterations, the runtime, and the r^2
+statistic for each fit and writes the result to a CSV file. This script requires pandas, rpy2, and a
+functioning R installation with the earth package installed.
+'''
+from __future__ import print_function
+import numpy
+import pandas.rpy.common as com
+import rpy2.robjects as robjects
+import time
+import pandas
+from sklearn.earth import Earth
@ogrisel
scikit-learn member
ogrisel added a note Aug 27, 2013

We don't want to introduce any new dependencies besides numpy and scipy, even in the examples. So either this example is converted into a benchmark in the benchmark folder with a try / except ImportError statement around the additional import, either we move it somewhere else (e.g. a standalone gist script or http://scikit-learn.org/ml-benchmarks/ ).

@pprett
scikit-learn member
pprett added a note Aug 28, 2013

@jcrudy Its great that you compare your implementation against R's earth. I have similar benchmarks for gradient boosting to compare against R's GBM - they currently reside in a separate branch - we should probably move them to ml-benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jcrudy

Great feedback, guys. Keep it coming. Regarding the name, the problem with MARS is that, as I understand the situation, it is under copyright (only based on a comment on CRAN by Steve Milborrow: http://cran.r-project.org/web/packages/earth/index.html). I don't know if that matters for the name of a class within a larger software project like scikit-learn. Perhaps someone has more knowledge of copyright law than I do. Given that information, if all of the core developers are okay with MARSRegressor then so am I. If anyone is worried about copyrights, then I think MultiVariateAdaptiveSplineRegressor might be a good choice.

@GaelVaroquaux
scikit-learn member
@ogrisel
scikit-learn member

Actually MARS this is a trademark:

The term "MARS" is trademarked and licensed to Salford Systems. In order to avoid trademark infringements,
many open source implementations of MARS are called "Earth".

source: http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

Among other open source projects that use the Earth name there are:

So I am either for keeping the Earth name, maybe as EarthRegressor for the class name or to use MultiVariateAdaptiveSplineRegressor or AdaptiveSplinesRegressor for the class name and adaptive_splines for the module name (and just mention the MARS name in the docstring).

@jcrudy

I like EarthRegressor because it's short and consistent with other regressors. I'll change it to EarthRegressor unless anyone objects.

@ogrisel
scikit-learn member

Actually as MARS would only be used in class and python package names inside scikit-learn and not as the project name it might not be a subject to trade mark conflict: scikit-learn as a project name might be considered a trade mark but Python classes inside a project probably not. So using MARSRegressor as a class name might be alright after all. However I am not a lawyer.

Note that scikit-learn already has a class named RandomForestClassifier while "Random Forests" is a trade mark. However in this case "random" and "forest" are regular descriptive English words which is not the case of the acronym MARS.

Maybe should just ask Dan Steinberg of Salford Systems (the owner of both trade marks) if it's ok for an open source project like scikit-learn to use the MARS name in its classe and package names.

@jcrudy

I'll write him and report back. If he says it's okay then I'll use MARSRegressor. Otherwise, I'll use EarthRegressor.

jcrudy added some commits Oct 29, 2013
@jcrudy jcrudy Merged classifier comparison and removed benchmark vs the R package e2c027e
@jcrudy jcrudy Changed name of Earth to EarthRegressor d0997d5
@jcrudy jcrudy Fixed the remaining examples to be more consistent with the rest of s…
…klearn and to work with gen_rst
f32bfd5
@jcrudy jcrudy Fixed titles of examples 4e22199
@jcrudy jcrudy Changed doc page so that example is pulled from plot_v_functio.py 0db8e5a
@jcrudy jcrudy Incorporated some new features and bug fixes from py-earth
1. Added allow_linear argument
2. Fixed a bug in the way endspan argument was processed
3. Fixed a bug in the way sample_weights were being applied
4. Added simple test case for 3 (test needs to be improved, may be
platform dependent)
5. Fixed a typo where min_search_points was accidentally
min_searh_points
8c07970
@jcrudy

Brief update: I never got a response from Salford Systems, so I went with EarthRegressor. It's easy to change it to MARSRegressor. I fixed most of the problems pointed out in this thread, as well as a few bugs discovered by py-earth users.

I discovered one new issue. Some of my unit tests simply provide a data set, fit a model, and make sure the model is identical to a human-confirmed output file. Such tests can be platform dependent (a new version of anaconda caused the output to change slightly for me, breaking a test), so these tests need to be replaced or improved.

@jcrudy

After all this time, someone from Salford Systems just sent me an email. Here is their response:

"You had asked us if scikit-learn could use the name MARSRegressor. I checked and you cannot use the name MARS in your module; however, we are fine with scikit-learn using earth. Please do not hesitate to contact us if you should have further questions about this."

@dougvk

@jcrudy excited about this feature. how can I help?

@jcrudy

@dougvk I'd love to have some help. Depending on your interests, skills, and free time, there are many choices. As I mentioned briefly in this thread, the tests could use a little bit of attention. If you're more interested in bug fixes or adding features, then you should check out the py-earth project. This PR is essentially a fork of py-earth, so all open issues on py-earth also apply to the PR. What are your goals with respect to this project? Do you see anything you would be particularly interested in?

@raghavrv

I feel it would be nice to have a todo list for this PR :)

EDIT : Perhaps this should go into a gam directory? /sklearn/gam/earth.py

@amueller
scikit-learn member

I think someone needs starting to review to create a todo list. I am currently working on bugfixes for 0.16 (ugh sounds too much like the start of the discussion only two years later :-/) but I have quite some sklearn cycles at the moment, so maybe after that....

@raghavrv

so maybe after that....

Sure :) I'd like to help... ( after the todo is framed )

@mehdidc

Hello there, I want to work on this PR and I can do the todo list. I think the first step should be to fix the merge conflicts, no ?

@agramfort
scikit-learn member

to take over you should create a new PR with these commits and start addressing the issues. Style, examples, docs et perf benchmarks

@jcrudy

@mehdidc @ragv @amueller I am still happy to be involved, too, once there is a todo list. My time is limited right now, but at least I can be an extra set of eyes and point out how things work where it isn't clear. I would say that anyone working on this or taking it over should look at py-earth (https://github.com/jcrudy/py-earth), as there have been some improvements and bug fixes since I made this PR.

@mehdidc

great ! @jcrudy I was preparing a new PR based on the earth branch of your scikit learn fork, but it seems they are not synced (as you said), so now if I want to add a new PR, I start from py-earth, update the files that are not synced, then create the PR, right ?

@raghavrv

Also please move earth.py into sklearn/additive as discussed in #3482 in your new pr :)

@jcrudy

@mehdidc Yes, that's right. I would ignore my scikit-learn fork, as it is way out of date. Probably best to create your own fork of scikit-learn, make an earth branch, and add in the py-earth code under sklearn/additive as mentioned by @ragv . That way there shouldn't be so much work to do getting everything synced up

Also, I've been planning to merge the derivatives branch of py-earth into master for some time, but have not yet. It implements smoothing and first derivatives for MARS models, as described in Friedman's 1991 paper. Basically, it let's you calculate the gradient of your model. I'm not sure whether that's something that should go into sklearn, but if it is then that branch needs to be merged. Maybe @amueller @agramfort @ragv would have an opinion about whether that feature should be included?

@amueller
scikit-learn member

I don't have a good feeling on what these models are used for, so I don't have a strong opinion. What kind of datasets do people use pyearth on?
What kind of benefits do smoothing and first derivatives bring you?

@jcrudy

@amueller For me they have been useful for figuring out the sensitivity of my model's output to small changes in its inputs. Such knowledge sometimes has business uses for me. In my case, I am usually modeling healthcare costs or risk of hospitalization. I had one py-earth user (other than myself) request derivatives as a feature, but I don't know what he wanted them for. Friedman does not offer much justification for them in his paper, but does go to the trouble of describing how to accomplish them. Here is what he does offer, from section 3.7 of Friedman, 1991:

"If the intent is accurate estimation of the function (as opposed to its derivatives of various orders), there is little to be gained by imposing continuity beyond that of the function itself. If the true underlying function nowhere has a very large local second derivative, then a small additional increase in accuracy can be achieved by imposing continuous first derivatives. Also, continuous (first) derivative approximations have considerable more cosmetic appeal."

However, smoothing and derivatives are not a feature of the R package earth, nor any other implementation that I know of, so are probably not commonly used.

@amueller
scikit-learn member

Thanks for the explanation. That sounds like we would probably not include them for the time being. On the other hand, it might be awkward if you include them in pyearth and we don't include them. I understood your intention was to not have pyearth as a separate project at some future point.

@jcrudy

@amueller Yes, that's true. However, I have found maintenance of py-earth to not be much of a burden, and I'm sure even less so once it is in sklearn, so I am not too worried about that. I suppose it is possible that over time the two could diverge in other ways, too, however, which might make life more complicated. I guess my slight preference would be for inclusion, but for mostly reasons having to do with my own uses and ease of maintenance.

@mblondel
scikit-learn member

py-earth seems like a well-maintained project. I am not completely sure we need to merge it in scikit-learn.

@agramfort
scikit-learn member
@jcrudy

@agramfort It is compliant as far as I know how to make it, or at least it was when I wrote it. I expect there are probably some ways in which I have not followed conventions, but I am not aware of them. I agree that working to make py-earth more compliant before trying to integrate it makes sense. It would require someone more knowledgable about scikit-learn conventions than myself to give it a look and make a todo list. I expect it would be a fairly short list, and I would happily accept a PR from @mehdidc or anyone else to address any issues. I guess the issues are basically the same whichever order things are done in (fix and integrate or integrate and fix), so it really matters what works best for @mehdidc or whoever wants to push things forward.

@mblondel As far as leaving it as potentially a separate project, some downsides are:

  1. Users of scikit-learn who want to use MARS might not be aware of py-earth
  2. Py-earth requires a separate installation
  3. It might create issues around maintaining compatibility

If the consensus is that py-earth is better as a separate project, I can put a little more effort into making it known and available. I can at least put it on pypi. I would also still welcome a todo list of compatibility issues and any pull requests addressing those issues.

@amueller
scikit-learn member

One of the things we want to have soon is a simple test that you can import and run it on your estimators to check compatibility. Not there yet, though :-/

@jcrudy

@amueller That would make things a lot easier as far as maintaining compatibility.

@agramfort
scikit-learn member
@jcrudy

Great, thank you @agramfort. A clarification: what do you mean by checking import orders? @mehdidc, is there any part of that you're interested in taking on?

As one of these involves removing some functionality (avoiding public methods other than fit, predict, and transform), I would suggest that at least that change should be done as part of integration into sklearn. The rest can be done directly in py-earth.

@mehdidc

@jcrudy : In addition to what He suggested above, @agramfort gave a good todo list here : #4341, I am working on it.

@dsquareindia

Hey everyone, I am planning to take this project of integrating pyearth into sklearn forward. I'll create a new branch, make a new sklearn/additive, pull all the py-earth code in it and work on that. But on the other hand, to integrate into sklearn later I'll have to merge into the files such as classes.rst so I was thinking of manually adding into these files and making incremental commits so that the review process can be faster. Should I go ahead with the second method?

@mblondel
scikit-learn member

We recently created a new organization called scikit-learn-contrib whose goal is to host scikit-learn compatible projects:
https://github.com/scikit-learn-contrib

This is very much a work in progress but our goal is to gather nice projects in the scikit-learn ecosystem. We just created it so only lightning is there but more projects will be added in the future.

We are thinking py-earth could possibly be moved there. The advantage is that py-earth would become more collaborative. When lightning was hosted in my account I was a bottleneck since I was the only one who could merge PRs. But now others have joined and bugs can be fixed more quickly.

Earth / Mars is sufficiently mature to warrant inclusion in scikit-learn but moving py-earth to scikit-learn-contrib would probably involve much less work.

@agramfort
scikit-learn member
@dsquareindia

So probably one of the first things we should do is add a description of this in the scikit learn readme so that the users get to know about it.

@agramfort
scikit-learn member
@mblondel
scikit-learn member

Yes it's too early to advertise it. I just informed you so you don't loose time.

@dsquareindia

@agramfort just wanted to ask one more thing, are the remaining projects of adding GAM (and for LSS), SpAM and LISO to sklearn still of any interest?

@agramfort
scikit-learn member
@mblondel
scikit-learn member

@jcrudy This is very much a work in progress but I have uploaded the basic workflow for inclusion in scikit-learn-contrib. If you're interested in moving py-earth there, we can provide some assistance.

@jcrudy

@mblondel Yes, I would like to do that. I think it's a great idea. I'm probably too busy to look at it much for the next couple weeks, but I will as soon as I can, and I'll start with the workflow you uploaded. Is there anything else I should keep in mind?

@agramfort
scikit-learn member
@mehdidc

@agramfort Yes ! @jcrudy @mblondel Ok so I first go through the requirements then add a request in scikit-learn-contrib

@jcrudy

@mehdidc There is a branch of py-earth called rewrite_forward_pass that I've been working on. I'm planning to merge it into master in the not too distant future, but it isn't quite ready yet. It includes some bug fixes to missing data functionality and allows arbitrary (non-separable) sample weights. It also refactors/rewrites the forward pass in a way that is less messy, and uses householder reflections instead of gram-schmidt for orthogonalization of the B matrix (the main reason I did it, as I was having stability problems on a project). Just want you to be aware so you don't repeat work. If you find major changes are needed to pass the contrib requirements, let's talk before you do them. (Because it might be better to merge first.) Also, let me know if I can answer any questions or be of assistance in some other way.

@mehdidc

@jcrudy Great ! it will be nice to have arbitrary sample weights. Ok I see, thanks I will let you know

@springcoil

I see that @jcrudy made some progress with Py-Earth. Which I think make it even more appetising. I know there are lots of open PR's for Scikitlearn, but I think MARS should get in there. Has anyone made a todo list for what needs to be done for this port?

@mblondel
scikit-learn member

If it's ok with @jcrudy and @mehdidc, I think we should transfer [*] py-earth to the contrib organization and collaboratively work on meeting the requirements there. We need a project base so as to make the contrib more attractive. Currently, with lighting only, this feels a bit sad.

[*] Transferring rather than forking is important so as to not loose branches and stars.

@jcrudy

@mblondel That's fine with me. @mehdidc, will that mess up anything you're doing?

@mehdidc

@mblondel @jcrudy No, it's fine with me too

@jcrudy

@mblondel @mehdidc Okay, I'll transfer it over. Exciting times!

@jcrudy

@mblondel Do I need to be added to the contrib organization in order to make the transfer?

https://help.github.com/articles/transferring-a-repository/#transferring-from-a-user-to-an-organization

@agramfort
scikit-learn member
@jcrudy

@agramfort @mehdidc @mblondel Transfer complete. The py-earth project is now a part of scikit-learn-contrib!

How is progress on making it actually compliant with the requirements? It seems it might be actually there, since @mehdidc has addressed the problems it was having with check_estimator. If that is complete, I think the next step is to work toward a release.

@mblondel
scikit-learn member

@jcrudy Thanks a lot! You might also want to fork the project in your github. This way, the old project URL is still valid.

@mblondel mblondel closed this Mar 24, 2016
@mblondel
scikit-learn member

Closed this pull-request. Let's create an issue directly in the project to follow the progress of meeting the requirements.

@mblondel
scikit-learn member

I added a py-earth team. @jcrudy you should be able to add collaborators to the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment