Decision Trees, AdaBoost, and Bagging #134

Closed
wants to merge 41 commits into
from

7 participants

@ndawe
scikit-learn member

I have implemented decision trees from scratch in C++ which is interfaced with Cython.
AdaBoost and bagging are implemented in a new module "scikits.learn.ensemble". An implementation of gradient boosting will come soon.

For preliminary tests/sample usage see:
scikit-learn/scikits/learn/decisiontree/tests/test_decisiontree.py

ndawe added some commits Mar 31, 2011
@ndawe ndawe adding boosting and decision trees 476824a
@ndawe ndawe adding bagging and gradboost 5112d7c
@ndawe ndawe minor change c6cd221
@ndawe ndawe working on interfacing with Cython eeb9ba4
@ndawe ndawe minor updates 2886f05
@ndawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn e5d8ded
@ndawe ndawe pull from upstream 2bc0e76
@ndawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 1ed5718
@ndawe ndawe implemented AdaBoost 5a71b60
@ndawe ndawe refactoring 239cf8b
@ndawe ndawe minor fix 9a8cb21
@ndawe ndawe minor fix 17a06c7
@ndawe ndawe almost done... 73b1a3f
@ndawe ndawe it compiles\! d72d838
@ndawe ndawe now it really compiles 16d3990
@ndawe ndawe minor fix 129702f
@ndawe ndawe working on segfault a097bf8
@ndawe ndawe now it works 174fc2f
@ndawe ndawe trying to fix score bounds 2657354
@ndawe ndawe updates 6166e2e
@ndawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 058f5ae
@ndawe ndawe sanity check in adaboost a480d43
@ndawe ndawe more sanity checks in adaboost 86ca325
@ndawe ndawe fairly stable now 8067ea1
@ndawe ndawe fixed bug where node cuts were not set but left at 0 79d1830
@ndawe ndawe working on limiting cases cf10808
@ndawe ndawe updates 4d1c12c
@ndawe ndawe fixing bug in adaboost 39e1d3e
@ndawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 05b2395
@ndawe ndawe updates 8dc785f
@ndawe ndawe minor change c9a49a5
@ndawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 9f4a53f
@ndawe ndawe bagging now implemented 8a1a996
@ndawe ndawe removing committee for now c278f17
@ndawe ndawe updates b7267f0
@ndawe ndawe adding tests 6c74aa1
@ndawe ndawe better demonstration in test module b6c89ee
@ndawe ndawe minor change f641291
@ndawe ndawe bugfix 12c577f
@fabianp
scikit-learn member

Hi!

This is a really impressive contribution. However, to review such a big chunk is extremely time-consuming. Would it be possible that you send separate pull request for different parts, i.e. at least one for decisiontree and another one for ensemble ?

Thanks,

Fabian

@mblondel
scikit-learn member

There seems to be overlap with this pull request: #133

@agramfort
scikit-learn member

wouhaou !

1/ +1 for 2 separate pull requests and I would start with ensemble
2/ there is indeed a big overlap with satra's pull request and his tree_model package. This will need more work to evaluate / merge / test / document.

@agramfort
scikit-learn member

also some examples should be added. In /examples/ensemble
See also the nice examples of satra for trees.

@ndawe
scikit-learn member

Thanks. If it makes things easier for you I will submit separate pull requests for ensemble and decisiontree.
The main idea behind separating boosting/bagging from the decision tree module is that any classifier (even regressors) can be boosted/bagged/combined and so the ensemble module should be implemented in such a general way as to support any of the scikit learners.

I noticed satra's pull request before submitting mine (to my surprise! what a coincidence...) but I wanted to still present what I have as an option. If I understand satra's code correctly, it looks like he is using some of the milk module? The problem with the milk code is that all of the tree construction is done on the python side which really hurts the performance. On the other hand, the decisiontree module I have performs all tree construction (fit) and evaluation (predict) on the compiled c++ side, making it quite fast (even on my tiny notebook). decisiontree could really benefit from the addition of other node-splitting criteria I see in the milk code though. Currently I only have the Gini index implemented.

My future decisiontree development would include implementing tree pruning with cost-complexity, more separation criteria, and support for regression and multiclass classification.

@ndawe
scikit-learn member

I should let you know: this is my first experience on Github submitting pull requests etc. so don't hesitate in letting me know if I am not following proper procedures.

@GaelVaroquaux
scikit-learn member

Impressive amount of work!

The problem with the milk code is that all of the tree construction is done on the python side which really hurts the performance. On the other hand, the decisiontree module I have performs all tree construction (fit) and evaluation (predict) on the compiled c++ side, making it quite fast (even on my tiny notebook). decisiontree could really benefit from the addition of other node-splitting criteria I see in the milk code though. Currently I only have the Gini index implemented.

Thanks for your detailed analysis. I understand that compiled code brings performance, but it is also a maintenance cost. There is a balance to find between both approach. However, looking at your code, you do have a very limited amount of compiled code.

On the topic of compiled code, I must admit that I'd rather avoid C++ code as it has semantics differences from C that people are not always aware of. I'd also prefer if things were completely in Cython, without hand-crafted C. Maybe this is not realistic. However, the current part of the code that is hard to follow are the Node.cpp and Histogram.h file. It seems that something like Histogram.h should be trivial to implement in Cython, mixing C and numpy. I wonder how much of Node.cpp could be converted to Cython without inducing a computational cost. A sizeable fraction, at first glance.

A few remarks more related to style: not all docstrings respect the numpy standard as described on
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
as such they will not render properly in the HTML document. In addition you have (for example in AdaBoost) some parameters in the fit method that should be attributes, set in the init, as it is the convention in the scikit. Finally you are not PEP-008 compliant.

However, the most important point to address is how to merge the good points of your implementation with Satra's work, and to try to get rid of the C++ code in favor of Cython.

@satra
scikit-learn member

hi noel,

this is great. i had no idea somebody else was working on decision trees and ensemble methods. it'll take me some time to go over the code, but i would definitely love to work together in combining this with some of the features of the pull i had submiited - probably around the same time .. dang - there should be more communication!

i understand the c++ piece in speeding up things, but the basic decision tree that i have is fairly fast and based on fabian's comments, i'm going to take out even the cython code that's there. that said, the decision tree implements the id3 algo (no pruning) and it would be good to have a version that implements C/J4.8, which i think is closest to C5 (commercial only).

i'll check out your branch and merge some of my stuff into it and then perhaps we can modify it together? also it might be good to give it a performance test (after i've removed the cython code).

as an example, a quick look suggested that you are using gini index as the impurity criterion in the splitter. i think it would be nice to have this be settable by the user (which you already have as a parameter, but it doesn't change).

what do you think?

@ndawe
scikit-learn member

Hi Gael,

Thanks for your comments/suggestions.

About a year ago I approached Fabian about implementing decision trees and he recommended using Cython only for interfacing the compiled C/C++ for performance reasons. Has this preference changed?

@ndawe
scikit-learn member

Hi Satra,

I think that's a great idea. It's too bad we didn't connect earlier. Let's combine forces :)

@satra
scikit-learn member
@ogrisel
scikit-learn member

I am also in favor of using cython instead of c++ to make the code more readable / maintainable by the scikit-learn community that is more familiar with the python syntax than c++ idioms.

BTW: thanks to you and satra for you those contributions, the scikit was really lacking in this domain.

@ogrisel
scikit-learn member

@satra please use larger datasets than iris for performance evaluation: iris is really a toy problem: low dim, low number of samples and linearly separable: not representative at all of datasets where ensemble methods shine...

You could use the dataset used by https://github.com/scikit-learn/ml-benchmarks that is less trivial. That also allow us to compare those methods with existing methods implemented in the scikit and else where:

@satra
scikit-learn member
@GaelVaroquaux
scikit-learn member

About a year ago I approached Fabian about implementing decision trees and he recommended using Cython only for interfacing the compiled C/C++ for performance reasons. Has this preference changed?

My preference has not changed (although the more I use Cython, the more I like it). It is possible that Fabian's did. Sorry for the bumpy ride. Hopefully that little extra work, combined with Satra's effort, will enable you to get even better code.

I agree with Satra and Olivier that the road forward is to benchmark on real datasets and to take decision based the performances.

@fabianp
scikit-learn member

Having to deal with a lot of legacy C/C++ in the last year has indeed changed my mind. Cython is definitely better for maintainability.

@mblondel
scikit-learn member

I also think that tree manipulation must be done in compiled code to achieve good performance. The think that worries me is that manipulating tree nodes involves working with pointers and so a Cython module would include a lot of hybrid code (C++-style and Python-style). Personally, when the code becomes hybrid like this, I prefer to write everything in C++ and make a Cython wrapper.

@ogrisel
scikit-learn member

You can use cython to define arbitrarily structured extension types quite naturally

http://docs.cython.org/src/userguide/extension_types.html

And the fields of the extension types can be recursive pointers to the same type (e.g. a pair of Node pointers to form a binary tree). I don't really see what the c++ + cython wrapper combo would bring in term of maintainability.

@mblondel
scikit-learn member

I think we could ask people to announce on the ML what they plan to work on prior to start coding. This way, we can point them in the right direction at an early stage and there will be less overlap between people's pull requests.

@ogrisel
scikit-learn member

Yes the contributors guide need an update.

@fabianp
scikit-learn member

Either way, no matter what language you choose, the things that will ensure that your code will get in are:

  • Start with a small patch. You will get a lot of feedback that you can use in subsequent work. Otherwise nobody will want to review your work.

  • Make sure you've run pep8 and pyflaks on your code. This avoids unnecessary iterations and lets reviewer focus on more important things.

  • Write docstrings and test for each important function.

    • Write at least one example and some narrative (RST) docs.

Best,

Fabian

@ndawe
scikit-learn member

I am closing this pull request and will submit a new one including only the ensemble module after I clean it up. satra and I will continue working on the decision tree module.

@ndawe ndawe closed this Apr 10, 2011
@GaelVaroquaux
scikit-learn member

Thanks a lot for such a positive attitude. It is very beneficial for the code.

Don't hesitate to ask for feedback on the code even when it is in unfinished code or to ask for advice on specific points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment