Skip to content
This repository

Decision Trees, AdaBoost, and Bagging #134

Closed
wants to merge 41 commits into from

7 participants

Noel Dawe Fabian Pedregosa Mathieu Blondel Alexandre Gramfort Gael Varoquaux Satrajit Ghosh Olivier Grisel
Noel Dawe
ndawe commented

I have implemented decision trees from scratch in C++ which is interfaced with Cython.
AdaBoost and bagging are implemented in a new module "scikits.learn.ensemble". An implementation of gradient boosting will come soon.

For preliminary tests/sample usage see:
scikit-learn/scikits/learn/decisiontree/tests/test_decisiontree.py

ndawe added some commits
Noel Dawe ndawe adding boosting and decision trees 476824a
Noel Dawe ndawe adding bagging and gradboost 5112d7c
Noel Dawe ndawe minor change c6cd221
Noel Dawe ndawe working on interfacing with Cython eeb9ba4
Noel Dawe ndawe minor updates 2886f05
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn e5d8ded
Noel Dawe ndawe pull from upstream 2bc0e76
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 1ed5718
Noel Dawe ndawe implemented AdaBoost 5a71b60
Noel Dawe ndawe refactoring 239cf8b
Noel Dawe ndawe minor fix 9a8cb21
Noel Dawe ndawe minor fix 17a06c7
Noel Dawe ndawe almost done... 73b1a3f
Noel Dawe ndawe it compiles\! d72d838
Noel Dawe ndawe now it really compiles 16d3990
Noel Dawe ndawe minor fix 129702f
Noel Dawe ndawe working on segfault a097bf8
Noel Dawe ndawe now it works 174fc2f
Noel Dawe ndawe trying to fix score bounds 2657354
Noel Dawe ndawe updates 6166e2e
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 058f5ae
Noel Dawe ndawe sanity check in adaboost a480d43
Noel Dawe ndawe more sanity checks in adaboost 86ca325
Noel Dawe ndawe fairly stable now 8067ea1
Noel Dawe ndawe fixed bug where node cuts were not set but left at 0 79d1830
Noel Dawe ndawe working on limiting cases cf10808
Noel Dawe ndawe updates 4d1c12c
Noel Dawe ndawe fixing bug in adaboost 39e1d3e
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 05b2395
Noel Dawe ndawe updates 8dc785f
Noel Dawe ndawe minor change c9a49a5
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 9f4a53f
Noel Dawe ndawe bagging now implemented 8a1a996
Noel Dawe ndawe removing committee for now c278f17
Noel Dawe ndawe updates b7267f0
Noel Dawe ndawe adding tests 6c74aa1
Noel Dawe ndawe better demonstration in test module b6c89ee
Noel Dawe ndawe minor change f641291
Noel Dawe ndawe bugfix 12c577f
Fabian Pedregosa
Owner

Hi!

This is a really impressive contribution. However, to review such a big chunk is extremely time-consuming. Would it be possible that you send separate pull request for different parts, i.e. at least one for decisiontree and another one for ensemble ?

Thanks,

Fabian

Mathieu Blondel
Owner

There seems to be overlap with this pull request: #133

Alexandre Gramfort
Owner

wouhaou !

1/ +1 for 2 separate pull requests and I would start with ensemble
2/ there is indeed a big overlap with satra's pull request and his tree_model package. This will need more work to evaluate / merge / test / document.

Alexandre Gramfort
Owner

also some examples should be added. In /examples/ensemble
See also the nice examples of satra for trees.

Noel Dawe
ndawe commented

Thanks. If it makes things easier for you I will submit separate pull requests for ensemble and decisiontree.
The main idea behind separating boosting/bagging from the decision tree module is that any classifier (even regressors) can be boosted/bagged/combined and so the ensemble module should be implemented in such a general way as to support any of the scikit learners.

I noticed satra's pull request before submitting mine (to my surprise! what a coincidence...) but I wanted to still present what I have as an option. If I understand satra's code correctly, it looks like he is using some of the milk module? The problem with the milk code is that all of the tree construction is done on the python side which really hurts the performance. On the other hand, the decisiontree module I have performs all tree construction (fit) and evaluation (predict) on the compiled c++ side, making it quite fast (even on my tiny notebook). decisiontree could really benefit from the addition of other node-splitting criteria I see in the milk code though. Currently I only have the Gini index implemented.

My future decisiontree development would include implementing tree pruning with cost-complexity, more separation criteria, and support for regression and multiclass classification.

Noel Dawe
ndawe commented

I should let you know: this is my first experience on Github submitting pull requests etc. so don't hesitate in letting me know if I am not following proper procedures.

Gael Varoquaux
Owner

Impressive amount of work!

The problem with the milk code is that all of the tree construction is done on the python side which really hurts the performance. On the other hand, the decisiontree module I have performs all tree construction (fit) and evaluation (predict) on the compiled c++ side, making it quite fast (even on my tiny notebook). decisiontree could really benefit from the addition of other node-splitting criteria I see in the milk code though. Currently I only have the Gini index implemented.

Thanks for your detailed analysis. I understand that compiled code brings performance, but it is also a maintenance cost. There is a balance to find between both approach. However, looking at your code, you do have a very limited amount of compiled code.

On the topic of compiled code, I must admit that I'd rather avoid C++ code as it has semantics differences from C that people are not always aware of. I'd also prefer if things were completely in Cython, without hand-crafted C. Maybe this is not realistic. However, the current part of the code that is hard to follow are the Node.cpp and Histogram.h file. It seems that something like Histogram.h should be trivial to implement in Cython, mixing C and numpy. I wonder how much of Node.cpp could be converted to Cython without inducing a computational cost. A sizeable fraction, at first glance.

A few remarks more related to style: not all docstrings respect the numpy standard as described on
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
as such they will not render properly in the HTML document. In addition you have (for example in AdaBoost) some parameters in the fit method that should be attributes, set in the init, as it is the convention in the scikit. Finally you are not PEP-008 compliant.

However, the most important point to address is how to merge the good points of your implementation with Satra's work, and to try to get rid of the C++ code in favor of Cython.

Satrajit Ghosh
Owner
satra commented

hi noel,

this is great. i had no idea somebody else was working on decision trees and ensemble methods. it'll take me some time to go over the code, but i would definitely love to work together in combining this with some of the features of the pull i had submiited - probably around the same time .. dang - there should be more communication!

i understand the c++ piece in speeding up things, but the basic decision tree that i have is fairly fast and based on fabian's comments, i'm going to take out even the cython code that's there. that said, the decision tree implements the id3 algo (no pruning) and it would be good to have a version that implements C/J4.8, which i think is closest to C5 (commercial only).

i'll check out your branch and merge some of my stuff into it and then perhaps we can modify it together? also it might be good to give it a performance test (after i've removed the cython code).

as an example, a quick look suggested that you are using gini index as the impurity criterion in the splitter. i think it would be nice to have this be settable by the user (which you already have as a parameter, but it doesn't change).

what do you think?

Noel Dawe
ndawe commented

Hi Gael,

Thanks for your comments/suggestions.

About a year ago I approached Fabian about implementing decision trees and he recommended using Cython only for interfacing the compiled C/C++ for performance reasons. Has this preference changed?

Noel Dawe
ndawe commented

Hi Satra,

I think that's a great idea. It's too bad we didn't connect earlier. Let's combine forces :)

Satrajit Ghosh
Owner
satra commented
Olivier Grisel
Owner

I am also in favor of using cython instead of c++ to make the code more readable / maintainable by the scikit-learn community that is more familiar with the python syntax than c++ idioms.

BTW: thanks to you and satra for you those contributions, the scikit was really lacking in this domain.

Olivier Grisel
Owner

@satra please use larger datasets than iris for performance evaluation: iris is really a toy problem: low dim, low number of samples and linearly separable: not representative at all of datasets where ensemble methods shine...

You could use the dataset used by https://github.com/scikit-learn/ml-benchmarks that is less trivial. That also allow us to compare those methods with existing methods implemented in the scikit and else where:

Satrajit Ghosh
Owner
satra commented
Gael Varoquaux
Owner

About a year ago I approached Fabian about implementing decision trees and he recommended using Cython only for interfacing the compiled C/C++ for performance reasons. Has this preference changed?

My preference has not changed (although the more I use Cython, the more I like it). It is possible that Fabian's did. Sorry for the bumpy ride. Hopefully that little extra work, combined with Satra's effort, will enable you to get even better code.

I agree with Satra and Olivier that the road forward is to benchmark on real datasets and to take decision based the performances.

Fabian Pedregosa
Owner

Having to deal with a lot of legacy C/C++ in the last year has indeed changed my mind. Cython is definitely better for maintainability.

Mathieu Blondel
Owner

I also think that tree manipulation must be done in compiled code to achieve good performance. The think that worries me is that manipulating tree nodes involves working with pointers and so a Cython module would include a lot of hybrid code (C++-style and Python-style). Personally, when the code becomes hybrid like this, I prefer to write everything in C++ and make a Cython wrapper.

Olivier Grisel
Owner

You can use cython to define arbitrarily structured extension types quite naturally

http://docs.cython.org/src/userguide/extension_types.html

And the fields of the extension types can be recursive pointers to the same type (e.g. a pair of Node pointers to form a binary tree). I don't really see what the c++ + cython wrapper combo would bring in term of maintainability.

Mathieu Blondel
Owner

I think we could ask people to announce on the ML what they plan to work on prior to start coding. This way, we can point them in the right direction at an early stage and there will be less overlap between people's pull requests.

Olivier Grisel
Owner

Yes the contributors guide need an update.

Fabian Pedregosa
Owner

Either way, no matter what language you choose, the things that will ensure that your code will get in are:

  • Start with a small patch. You will get a lot of feedback that you can use in subsequent work. Otherwise nobody will want to review your work.

  • Make sure you've run pep8 and pyflaks on your code. This avoids unnecessary iterations and lets reviewer focus on more important things.

  • Write docstrings and test for each important function.

    • Write at least one example and some narrative (RST) docs.

Best,

Fabian

Noel Dawe
ndawe commented

I am closing this pull request and will submit a new one including only the ensemble module after I clean it up. satra and I will continue working on the decision tree module.

Noel Dawe ndawe closed this
Gael Varoquaux
Owner

Thanks a lot for such a positive attitude. It is very beneficial for the code.

Don't hesitate to ask for feedback on the code even when it is in unfinished code or to ask for advice on specific points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 41 unique commits by 1 author.

Mar 31, 2011
Noel Dawe ndawe adding boosting and decision trees 476824a
Noel Dawe ndawe adding bagging and gradboost 5112d7c
Noel Dawe ndawe minor change c6cd221
Noel Dawe ndawe working on interfacing with Cython eeb9ba4
Apr 01, 2011
Noel Dawe ndawe minor updates 2886f05
Apr 02, 2011
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn e5d8ded
Noel Dawe ndawe pull from upstream 2bc0e76
Apr 05, 2011
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 1ed5718
Noel Dawe ndawe implemented AdaBoost 5a71b60
Noel Dawe ndawe refactoring 239cf8b
Noel Dawe ndawe minor fix 9a8cb21
Noel Dawe ndawe minor fix 17a06c7
Noel Dawe ndawe almost done... 73b1a3f
Apr 06, 2011
Noel Dawe ndawe it compiles\! d72d838
Noel Dawe ndawe now it really compiles 16d3990
Noel Dawe ndawe minor fix 129702f
Noel Dawe ndawe working on segfault a097bf8
Noel Dawe ndawe now it works 174fc2f
Noel Dawe ndawe trying to fix score bounds 2657354
Noel Dawe ndawe updates 6166e2e
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 058f5ae
Noel Dawe ndawe sanity check in adaboost a480d43
Noel Dawe ndawe more sanity checks in adaboost 86ca325
Apr 07, 2011
Noel Dawe ndawe fairly stable now 8067ea1
Noel Dawe ndawe fixed bug where node cuts were not set but left at 0 79d1830
Noel Dawe ndawe working on limiting cases cf10808
Noel Dawe ndawe updates 4d1c12c
Noel Dawe ndawe fixing bug in adaboost 39e1d3e
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 05b2395
Noel Dawe ndawe updates 8dc785f
Apr 08, 2011
Noel Dawe ndawe minor change c9a49a5
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn 9f4a53f
Noel Dawe ndawe bagging now implemented 8a1a996
Noel Dawe ndawe removing committee for now c278f17
Noel Dawe ndawe updates b7267f0
Noel Dawe ndawe adding tests 6c74aa1
Noel Dawe ndawe better demonstration in test module b6c89ee
Noel Dawe ndawe minor change f641291
Noel Dawe ndawe bugfix 12c577f
Noel Dawe ndawe Merge branch 'master' of git://github.com/scikit-learn/scikit-learn c2f3ac9
Noel Dawe ndawe minor change deed432
Something went wrong with that request. Please try again.