New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Complete rewrite of the tree module #2131

Merged
merged 124 commits into from Jul 22, 2013

Conversation

Projects
None yet
@glouppe
Member

glouppe commented Jul 6, 2013

Here are some good news for Kaggle competitors... :)

In this PR, I propose a complete rewrite of the core tree module (_tree.pyx) and of all tree-dependent estimators. In particular, this new implementation factorizes out the splitting strategy at the core of the construction process of a tree. Such a strategy is now implemented in a Splitter object as specified in the new interface in _tree.pxd. As of now, this PR provides two splitting strategies, BestSplitter for finding the best split in a node (this is CART) and RandomSplitter for finding the best random split (this is Extra-Tree).

The PR adresses 3 issues with the module: modularity, space and speed.

  1. Modularity: It is now more convenient to write a new Splitting strategy. For example, one could now easily write a splitting strategy based on binning and plug it in the module.

I think it is also possible to reimplement our old X_argsorted strategy in a shared Splitter object. We will have to carry out some benchmarks though, to see if it is worth it.

  1. Space: no more X_argsorted, no more sample_mask, no more min_density stuff. No more dupplicates of X! A tree is now directly built on the original version of X.

  2. Speed: Both Random Forest and Extra-Trees have been speeded up. The most significant improvement benefits to Extra-Trees which are now properly implemented, i.e. without any sorting.

As a small benchmark, I did a little experiment on mnist3vs8 (around ~12000 samples, 784 features, 2 classes):

Parameters:
n_estimators=10
max_features=5
bootstrap=False
random_state=0
all other defaults

master
    RandomForestClassifier
        In [4]: %timeit -n 5 clf.fit(X_train, y_train)
        5 loops, best of 3: 22.2 s per loop

    ExtraTreesClassifier
        In [7]: %timeit -n 5 clf.fit(X_train, y_train)
        5 loops, best of 3: 25.8 s per loop

trees-v2
    RandomForestClassifier
        In [5]: %timeit -n 5 clf.fit(X_train, y_train)
        5 loops, best of 3: 1.33 s per loop
        => Speedup = 16.7x

    ExtraTreesClassifier
        In [9]: %timeit -n 5 clf.fit(X_train, y_train)
        5 loops, best of 3: 757 ms per loop
        => Speedup = 34.08x

I think the figures speak for themselves... :-)


This is still a work in progress and lot of work still needs to be done. However all tests in test_tree.py and test_forest.py already pass.

On my todo list:

  • Fix the GBRT module for the new interface
  • Fix the AdaBoost module for the new interface
  • Depecrate removed parameters
  • Update documentation
  • More benchmarks
  • PEP8 / Flake
  • Check the impact of this refactoring on the RandomState unpickling perf issue from #1622

Once all of this is done, I'll call for reviews. My goal is to have this beast merged in during the sprint. I think it is shaping up very well... :)

CC: @pprett @bdholt1 @ndawe @arjoly

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Jul 6, 2013

Member

This looks awesome! Though it might ruin wise.io's favorite speed demo... 😄

Member

jakevdp commented Jul 6, 2013

This looks awesome! Though it might ruin wise.io's favorite speed demo... 😄

@satra

This comment has been minimized.

Show comment
Hide comment
@satra

satra Jul 6, 2013

Member

this is great.

ps. also at the recent pattern recognition in neuroimaging conference many people were introduced to random forests and extra trees in a tutorial. this will also benefit all of those folks as well.

Member

satra commented Jul 6, 2013

this is great.

ps. also at the recent pattern recognition in neuroimaging conference many people were introduced to random forests and extra trees in a tutorial. this will also benefit all of those folks as well.

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 6, 2013

Member

aahhhh... I'm so excited - great work Gilles!

I will make GBRT running in no time.

2013/7/6 Satrajit Ghosh notifications@github.com

this is great.

ps. also at the recent pattern recognition in neuroimaging conference many
people were introduced to random forests and extra trees in a tutorial.
this will also benefit all of those folks as well.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2131#issuecomment-20556793
.

Peter Prettenhofer

Member

pprett commented Jul 6, 2013

aahhhh... I'm so excited - great work Gilles!

I will make GBRT running in no time.

2013/7/6 Satrajit Ghosh notifications@github.com

this is great.

ps. also at the recent pattern recognition in neuroimaging conference many
people were introduced to random forests and extra trees in a tutorial.
this will also benefit all of those folks as well.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2131#issuecomment-20556793
.

Peter Prettenhofer

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 6, 2013

Member

@glouppe could you make a gist with the benchmark or is it in the repo?

Member

pprett commented Jul 6, 2013

@glouppe could you make a gist with the benchmark or is it in the repo?

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 6, 2013

Member

@pprett The benchmark is not in the repo. I run the following script inside IPython:

import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
data = np.load("/home/gilles/PhD/db/data/mnist3vs8.npz")
X_train = data["X_train"]
y_train = data["y_train"]
X_test = data["X_test"]
y_test = data["y_test"]

clf = RandomForestClassifier(n_estimators=10, max_features=5, bootstrap=False)
%timeit -n 5 clf.fit(X_train, y_train)

So, nothing very scientific. We should make a proper script to benchmark the two implementations in more details (this is on the todo list).

Member

glouppe commented Jul 6, 2013

@pprett The benchmark is not in the repo. I run the following script inside IPython:

import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
data = np.load("/home/gilles/PhD/db/data/mnist3vs8.npz")
X_train = data["X_train"]
y_train = data["y_train"]
X_test = data["X_test"]
y_test = data["y_test"]

clf = RandomForestClassifier(n_estimators=10, max_features=5, bootstrap=False)
%timeit -n 5 clf.fit(X_train, y_train)

So, nothing very scientific. We should make a proper script to benchmark the two implementations in more details (this is on the todo list).

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 6, 2013

Member

in _tree.pyx line 88::

 def __cinit__(self, SIZE_t n_outputs, object n_classes):

shouldn't n_classes be np.ndarray of dtype SIZE_T or SIZE_t[::1]

(doesn't matter much performance wise but the yellow stain in the annotated cython hurts my eyes)

Member

pprett commented Jul 6, 2013

in _tree.pyx line 88::

 def __cinit__(self, SIZE_t n_outputs, object n_classes):

shouldn't n_classes be np.ndarray of dtype SIZE_T or SIZE_t[::1]

(doesn't matter much performance wise but the yellow stain in the annotated cython hurts my eyes)

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 6, 2013

Member

Currently, you call the following line in each iteration over the max_features loop::

f_j = random_state.randint(0, n_features - f_idx)

this calls into the interpreter; for large max_features it might be better to create a random array of size max_features and then loop over that (alternatively, maybe we can use the c-api of RandomState)

Member

pprett commented Jul 6, 2013

Currently, you call the following line in each iteration over the max_features loop::

f_j = random_state.randint(0, n_features - f_idx)

this calls into the interpreter; for large max_features it might be better to create a random array of size max_features and then loop over that (alternatively, maybe we can use the c-api of RandomState)

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 6, 2013

Member

it might be better to create a random array of size max_features and then loop over that

The thing is, we may need to visit more than max_features features (because e.g, some may be constant for that node and should not be counted among the max_features features to be considered).

I'll dig into the c-api of Random State and try to solve that.

Member

glouppe commented Jul 6, 2013

it might be better to create a random array of size max_features and then loop over that

The thing is, we may need to visit more than max_features features (because e.g, some may be constant for that node and should not be counted among the max_features features to be considered).

I'll dig into the c-api of Random State and try to solve that.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 7, 2013

Member

This looks awesome! Though it might ruin wise.io's favorite speed demo...

It's already fake, as if you choose reasonable settings you don't achieve
speed ups as fast as they pretend.

Member

GaelVaroquaux commented Jul 7, 2013

This looks awesome! Though it might ruin wise.io's favorite speed demo...

It's already fake, as if you choose reasonable settings you don't achieve
speed ups as fast as they pretend.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 7, 2013

Member

@pprett Both of your comments have been addressed. However, GBRT still happens to be slower than before. I think this comes from the new default splitting strategy in decision trees. As you build more and more trees, pre-sorting X compensates, and eventually beats, sorting the features on the fly at each node. I'll work on a new splitter starting from tomorrow.

Member

glouppe commented Jul 7, 2013

@pprett Both of your comments have been addressed. However, GBRT still happens to be slower than before. I think this comes from the new default splitting strategy in decision trees. As you build more and more trees, pre-sorting X compensates, and eventually beats, sorting the features on the fly at each node. I'll work on a new splitter starting from tomorrow.

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 7, 2013

Member

@glouppe I can do that too - already started some coding - for now I just wanted to write a splitter that implements the same strategy as in our current impl (presorting + sample mask).

do you see a better way to incorporate presorting?

Member

pprett commented Jul 7, 2013

@glouppe I can do that too - already started some coding - for now I just wanted to write a splitter that implements the same strategy as in our current impl (presorting + sample mask).

do you see a better way to incorporate presorting?

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 7, 2013

Member

A better strategy is the one from Breiman's code: X_argsorted is computed once. Then, for each tree, it is duplicated and then rearranged inplace when splitting a node. This is fast, but requires additional memory space.

(I'll be away today, we can discuss this tomorrow.)

Member

glouppe commented Jul 7, 2013

A better strategy is the one from Breiman's code: X_argsorted is computed once. Then, for each tree, it is duplicated and then rearranged inplace when splitting a node. This is fast, but requires additional memory space.

(I'll be away today, we can discuss this tomorrow.)

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 7, 2013

Member

ok - thanks

2013/7/7 Gilles Louppe notifications@github.com

A better strategy is the one from Breiman's code: X_argsorted is computed
once. Then, for each tree, it is duplicated and then rearranged inplace
when splitting a node. This is fast, but requires additional memory space.

(I'll be away today, we can discuss this tomorrow.)


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2131#issuecomment-20567872
.

Peter Prettenhofer

Member

pprett commented Jul 7, 2013

ok - thanks

2013/7/7 Gilles Louppe notifications@github.com

A better strategy is the one from Breiman's code: X_argsorted is computed
once. Then, for each tree, it is duplicated and then rearranged inplace
when splitting a node. This is fast, but requires additional memory space.

(I'll be away today, we can discuss this tomorrow.)


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2131#issuecomment-20567872
.

Peter Prettenhofer

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 7, 2013

Member

another random thought: currently you sort the data at the end of find_split. Since find_split is called for each node (internal and leaf) it might be a bit faster to sort only if the node is internal.

what do you think - does it matter? for deep trees and low min_samples_leaf it might not matter much...

Member

pprett commented Jul 7, 2013

another random thought: currently you sort the data at the end of find_split. Since find_split is called for each node (internal and leaf) it might be a bit faster to sort only if the node is internal.

what do you think - does it matter? for deep trees and low min_samples_leaf it might not matter much...

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 7, 2013

Member

It seems that most of the code of method functions like find_split does not need to call into Python at all. In that case it would be interesting to add the nogil marker either for the whole function or for the CPU intensive inner loops in with nogil blocks.

http://docs.cython.org/src/userguide/external_C_code.html#releasing-the-gil

That way it should make it possible to have a threadpool version of joblib.Parallel (TODO) be able to efficiently train random forests on multicore machines without incurring any memory copy of the dataset.

We can discuss this further during the next sprint by adding nogil declarations right now will make it easier prototyping stuff with threads later.

Member

ogrisel commented Jul 7, 2013

It seems that most of the code of method functions like find_split does not need to call into Python at all. In that case it would be interesting to add the nogil marker either for the whole function or for the CPU intensive inner loops in with nogil blocks.

http://docs.cython.org/src/userguide/external_C_code.html#releasing-the-gil

That way it should make it possible to have a threadpool version of joblib.Parallel (TODO) be able to efficiently train random forests on multicore machines without incurring any memory copy of the dataset.

We can discuss this further during the next sprint by adding nogil declarations right now will make it easier prototyping stuff with threads later.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 8, 2013

Member

Regarding wise.io implementation, I just the followed the benchmark they posted some time ago (http://about.wise.io/blog/2012/11/22/wiserf-introduction-and-benchmarks/) and compared computing times with this branch. The full benchmark script is available at https://gist.github.com/glouppe/5949526

I used WiseRF 1.5.6 (as available on their website).

On my machine, results are the following:

RandomForestClassifier: Accuracy: 0.95     34.61s
ExtraTreesClassifier: Accuracy: 0.96       26.65s
WiseRFClassifier Accuracy: 0.95            28.48s

Take home message: According to their blog post, WiseRFClassifier was 7.19x faster than RandomForestClassifier. It is now only 1.21x faster. Barely noticeable. Yet, this means we still have some room for improvements. Let's beat that!

Also, ExtraTreesClassifier confirms to be faster than both of them - while being more accurate! Really, why bother finding the very best splits? ;-)

Member

glouppe commented Jul 8, 2013

Regarding wise.io implementation, I just the followed the benchmark they posted some time ago (http://about.wise.io/blog/2012/11/22/wiserf-introduction-and-benchmarks/) and compared computing times with this branch. The full benchmark script is available at https://gist.github.com/glouppe/5949526

I used WiseRF 1.5.6 (as available on their website).

On my machine, results are the following:

RandomForestClassifier: Accuracy: 0.95     34.61s
ExtraTreesClassifier: Accuracy: 0.96       26.65s
WiseRFClassifier Accuracy: 0.95            28.48s

Take home message: According to their blog post, WiseRFClassifier was 7.19x faster than RandomForestClassifier. It is now only 1.21x faster. Barely noticeable. Yet, this means we still have some room for improvements. Let's beat that!

Also, ExtraTreesClassifier confirms to be faster than both of them - while being more accurate! Really, why bother finding the very best splits? ;-)

@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jul 8, 2013

Member

Nice !!! :-)

Member

arjoly commented Jul 8, 2013

Nice !!! :-)

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 8, 2013

Member

Very cool. I was already sold on extra trees but it's nice to see it confirmed once more.

Member

ogrisel commented Jul 8, 2013

Very cool. I was already sold on extra trees but it's nice to see it confirmed once more.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 8, 2013

Member

What is the current performance impact on Adaboost and GBRT?

Member

ogrisel commented Jul 8, 2013

What is the current performance impact on Adaboost and GBRT?

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 8, 2013

Member

What is the current performance impact on Adaboost and GBRT?

Currently, this branch is slower than master for GBRT. We have been profiling the code with Peter and it happens that pre-sorting X is a better strategy for shallow trees with a large max_features value (as in stumps used in GBRT). To solve that, we plan on writing a third Splitter object tuned for that use-case and rhoughly reimplementing our old strategy (pre-sorting X once and for all trees).

Member

glouppe commented Jul 8, 2013

What is the current performance impact on Adaboost and GBRT?

Currently, this branch is slower than master for GBRT. We have been profiling the code with Peter and it happens that pre-sorting X is a better strategy for shallow trees with a large max_features value (as in stumps used in GBRT). To solve that, we plan on writing a third Splitter object tuned for that use-case and rhoughly reimplementing our old strategy (pre-sorting X once and for all trees).

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 8, 2013

Member

Yes this is what I understood from the previous exchanges. What I wanted to know was more actual numbers: is it 10%, 50% or 200% slower or does it really depend on the size of the dataset?

Member

ogrisel commented Jul 8, 2013

Yes this is what I understood from the previous exchanges. What I wanted to know was more actual numbers: is it 10%, 50% or 200% slower or does it really depend on the size of the dataset?

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 8, 2013

Member

Yes this is what I understood from the previous exchanges. What I wanted to know was more actual numbers: is it 10%, 50% or 200% slower or does it really depend on the size of the dataset?

About twice slower on the benchmarks we did.

(On some others, it was faster though... so this is really parameter dependent.)

Member

glouppe commented Jul 8, 2013

Yes this is what I understood from the previous exchanges. What I wanted to know was more actual numbers: is it 10%, 50% or 200% slower or does it really depend on the size of the dataset?

About twice slower on the benchmarks we did.

(On some others, it was faster though... so this is really parameter dependent.)

@yang

This comment has been minimized.

Show comment
Hide comment
@yang

yang Jul 8, 2013

@glouppe Awesome work! For those of us who have been curious about how wise.io's RFs were so much faster, what do you think was the most important change for the performance gains?

yang commented Jul 8, 2013

@glouppe Awesome work! For those of us who have been curious about how wise.io's RFs were so much faster, what do you think was the most important change for the performance gains?

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 8, 2013

Member

@yang For historical reasons I would say, finding a split within a node was previously linearly proportional to the total number of samples in X. It is now linear with respect to the number of samples falling into that node, as it should always have been. The new implementation is also now organized in a better way, which allows to easily implement various splitting strategies. In particular, Extra-Trees now benefit from their own Splitter object, which find splits without having to sort features. Previously, the implementation of Extra-Trees was more closely related to the one of Random Forest, which made them actually quite inefficient... Regarding wise.io, I wouldn't say anything, as they always remained quite obscure regarding implementation details, but I don't think they do anything magical. Just a plain, classical, but well coded, Random Forest algorithm.

Member

glouppe commented Jul 8, 2013

@yang For historical reasons I would say, finding a split within a node was previously linearly proportional to the total number of samples in X. It is now linear with respect to the number of samples falling into that node, as it should always have been. The new implementation is also now organized in a better way, which allows to easily implement various splitting strategies. In particular, Extra-Trees now benefit from their own Splitter object, which find splits without having to sort features. Previously, the implementation of Extra-Trees was more closely related to the one of Random Forest, which made them actually quite inefficient... Regarding wise.io, I wouldn't say anything, as they always remained quite obscure regarding implementation details, but I don't think they do anything magical. Just a plain, classical, but well coded, Random Forest algorithm.

# Define a datatype for the data array
DTYPE = np.float32
ctypedef np.float32_t DTYPE_t
ctypedef np.npy_intp SIZE_t

This comment has been minimized.

@larsmans

larsmans Jul 9, 2013

Member

Good to see npy_intp in use!

@larsmans

larsmans Jul 9, 2013

Member

Good to see npy_intp in use!

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Jul 9, 2013

Member

How would you handle negative sample weights, apart from throwing an exception?

Member

larsmans commented Jul 9, 2013

How would you handle negative sample weights, apart from throwing an exception?

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 9, 2013

Member

Just redid the benchmark:

RandomForestClassifier: Accuracy: 0.95  27.13s
ExtraTreesClassifier: Accuracy: 0.95    22.17s
WiseRFClassifier Accuracy: 0.95 27.68s

Load on my machine seemed to have affected computing times :-)

(Anyway, for a proper comparison, one should run that benchmark several times and then make a t-test to assess the significance.)

Member

glouppe commented Jul 9, 2013

Just redid the benchmark:

RandomForestClassifier: Accuracy: 0.95  27.13s
ExtraTreesClassifier: Accuracy: 0.95    22.17s
WiseRFClassifier Accuracy: 0.95 27.68s

Load on my machine seemed to have affected computing times :-)

(Anyway, for a proper comparison, one should run that benchmark several times and then make a t-test to assess the significance.)

@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jul 12, 2013

Member

Thanks for taking my comment into account (commit 54b77e1 to b26273c).

Member

arjoly commented Jul 12, 2013

Thanks for taking my comment into account (commit 54b77e1 to b26273c).

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 12, 2013

Member

I did some benchmarks with GBRT on covertype - here are the results

I've used the bench_covtype.py script but changed the GBRT params as follows::

GradientBoostingClassifier(n_estimators=200, min_samples_split=5,  max_features='log2',
                                        max_depth=6, subsample=0.2, verbose=3,
                                        random_state=opts.random_seed)
version train time test time error rate
tree-v2 472.6818s 1.5883s 0.1489
master 219.1950s 0.5327s 0.1332

Two things seem strange: a) test time on tree-v2 is much larger -- this shouldn't be the case unless the trees have been build differently (maybe the random state is used differently in master vs tree-v2?)
b) train time is 2x faster on master even though the tree depth is rather large -- maybe this interacts with subsample?

Member

pprett commented Jul 12, 2013

I did some benchmarks with GBRT on covertype - here are the results

I've used the bench_covtype.py script but changed the GBRT params as follows::

GradientBoostingClassifier(n_estimators=200, min_samples_split=5,  max_features='log2',
                                        max_depth=6, subsample=0.2, verbose=3,
                                        random_state=opts.random_seed)
version train time test time error rate
tree-v2 472.6818s 1.5883s 0.1489
master 219.1950s 0.5327s 0.1332

Two things seem strange: a) test time on tree-v2 is much larger -- this shouldn't be the case unless the trees have been build differently (maybe the random state is used differently in master vs tree-v2?)
b) train time is 2x faster on master even though the tree depth is rather large -- maybe this interacts with subsample?

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 12, 2013

Member

Here I've a comparison on a number of datasets -- again using fairly deep trees (depth=6)

for classification I've used the following params::

classification_params = {'loss': 'deviance', 'n_estimators': 250,
                     'min_samples_leaf': 1, 'max_depth': 6,
                     'learning_rate': .6, 'subsample': 1.0}

master-treev2-clf

for regression I used the params::

regression_params = {'n_estimators': 250, 'max_depth': 6,
                 'min_samples_leaf': 1, 'learning_rate': 0.1,
                 'loss': 'ls'}

master-treev2-reg

Member

pprett commented Jul 12, 2013

Here I've a comparison on a number of datasets -- again using fairly deep trees (depth=6)

for classification I've used the following params::

classification_params = {'loss': 'deviance', 'n_estimators': 250,
                     'min_samples_leaf': 1, 'max_depth': 6,
                     'learning_rate': .6, 'subsample': 1.0}

master-treev2-clf

for regression I used the params::

regression_params = {'n_estimators': 250, 'max_depth': 6,
                 'min_samples_leaf': 1, 'learning_rate': 0.1,
                 'loss': 'ls'}

master-treev2-reg

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 12, 2013

Member

The test time change is strange. Could you output the mean effective size of the trees in the ensembles? Maybe an implementation detail has changed the way the tree are actually built.

Member

ogrisel commented Jul 12, 2013

The test time change is strange. Could you output the mean effective size of the trees in the ensembles? Maybe an implementation detail has changed the way the tree are actually built.

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 12, 2013

Member

Gilles, you will like this one :-)

I tried your ordinary splitter instead of the breiman splitter for GBRT - here are the results for covertype

version train time test time error rate
tree-v2 144.0397s 0.6534s 0.1481
master 219.1950s 0.5327s 0.1332

way faster! also, the test time thing was probably an artifact...

Member

pprett commented Jul 12, 2013

Gilles, you will like this one :-)

I tried your ordinary splitter instead of the breiman splitter for GBRT - here are the results for covertype

version train time test time error rate
tree-v2 144.0397s 0.6534s 0.1481
master 219.1950s 0.5327s 0.1332

way faster! also, the test time thing was probably an artifact...

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 12, 2013

Member

here is the best splitter with smaller trees (depth=3)

version train time test time error rate
tree-v2 73.3558s 0.5371s 0.1958
master 88.0081s 0.3438s 0.1936

and here only 100 trees w/ max_features=None

version train time test time error rate
master 200.9225s 0.1981s 0.1926
tree-v2 227.3296s 0.3089s 0.1954

and here only 20 trees w/o sub-sampling

version train time test time error rate
master 82.1088s 0.0800s 0.2243
tree-v2 139.6819s 0.0920s 0.2243
Member

pprett commented Jul 12, 2013

here is the best splitter with smaller trees (depth=3)

version train time test time error rate
tree-v2 73.3558s 0.5371s 0.1958
master 88.0081s 0.3438s 0.1936

and here only 100 trees w/ max_features=None

version train time test time error rate
master 200.9225s 0.1981s 0.1926
tree-v2 227.3296s 0.3089s 0.1954

and here only 20 trees w/o sub-sampling

version train time test time error rate
master 82.1088s 0.0800s 0.2243
tree-v2 139.6819s 0.0920s 0.2243
@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 12, 2013

Member

from this benchmarks I basically conclude that your default splitter actually does pretty good for GBRT on larger datasets (even using "shallow" trees).

Member

pprett commented Jul 12, 2013

from this benchmarks I basically conclude that your default splitter actually does pretty good for GBRT on larger datasets (even using "shallow" trees).

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 12, 2013

Member

Thanks for the benchmarks Peter! So basically, this is a good news. I will remove the BreimanSplitter since it does not prove to be that effective. I have actually always prefered the default one, since it uses no additional memory and happens to be faster, as your last results suggest.

Member

glouppe commented Jul 12, 2013

Thanks for the benchmarks Peter! So basically, this is a good news. I will remove the BreimanSplitter since it does not prove to be that effective. I have actually always prefered the default one, since it uses no additional memory and happens to be faster, as your last results suggest.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 12, 2013

Member

I have just removed BreimanSplitter and made GBRT use the default splitter. This greatly simplifies the code :)

I have also just pushed a small optimization to predict. Don't know if it will change anything. This is quite strange though, since the code is basically identical to master. Maybe this is due to compilation option (I added -03 and -funroll-all-loops in setup.py, this proved to be slightly faster on my machine, but it is maybe not the case on yours. Could run a quick benchmark with and without loop unrolling? If it happens to be slower, then I'll remove it.)

Member

glouppe commented Jul 12, 2013

I have just removed BreimanSplitter and made GBRT use the default splitter. This greatly simplifies the code :)

I have also just pushed a small optimization to predict. Don't know if it will change anything. This is quite strange though, since the code is basically identical to master. Maybe this is due to compilation option (I added -03 and -funroll-all-loops in setup.py, this proved to be slightly faster on my machine, but it is maybe not the case on yours. Could run a quick benchmark with and without loop unrolling? If it happens to be slower, then I'll remove it.)

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 12, 2013

Member

maybe the random state is used differently in master vs tree-v2?

Yes, random states are used differently, but the variance shouldn't be that high...

Member

glouppe commented Jul 12, 2013

maybe the random state is used differently in master vs tree-v2?

Yes, random states are used differently, but the variance shouldn't be that high...

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 12, 2013

Member

I'll have some time tomorrow - I'll launch some more benchmarks, for RandomForest, Extra-Trees and GBRT on larger datasets, comparing results with master.

Member

glouppe commented Jul 12, 2013

I'll have some time tomorrow - I'll launch some more benchmarks, for RandomForest, Extra-Trees and GBRT on larger datasets, comparing results with master.

@ogrisel

View changes

Show outdated Hide outdated sklearn/ensemble/weight_boosting.py
@ogrisel

View changes

Show outdated Hide outdated sklearn/ensemble/weight_boosting.py
@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 22, 2013

Member

This PR has been rebased on top of master. Both @pprett and @arjoly gave their +1, @ogrisel also had a look earlier.

I think this is ready to be merged. Shall I click the green button?

Member

glouppe commented Jul 22, 2013

This PR has been rebased on top of master. Both @pprett and @arjoly gave their +1, @ogrisel also had a look earlier.

I think this is ready to be merged. Shall I click the green button?

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 22, 2013

Member

I think this is ready to be merged. Shall I click the green button?

Go for it. Hurray, hurray, hurray

Member

GaelVaroquaux commented Jul 22, 2013

I think this is ready to be merged. Shall I click the green button?

Go for it. Hurray, hurray, hurray

glouppe added a commit that referenced this pull request Jul 22, 2013

Merge pull request #2131 from glouppe/trees-v2
[MRG] Complete rewrite of the tree module

@glouppe glouppe merged commit 239054d into scikit-learn:master Jul 22, 2013

1 check passed

default The Travis CI build passed
Details
@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 22, 2013

Member

Done :)

Member

glouppe commented Jul 22, 2013

Done :)

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Jul 22, 2013

Member

awesome!

Member

pprett commented Jul 22, 2013

awesome!

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 22, 2013

Member

F####ing aye!

Member

GaelVaroquaux commented Jul 22, 2013

F####ing aye!

@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jul 22, 2013

Member

Congratulations !!!

Member

arjoly commented Jul 22, 2013

Congratulations !!!

@satra

This comment has been minimized.

Show comment
Hide comment
@satra

satra Jul 22, 2013

Member

fantastic work guys!

Member

satra commented Jul 22, 2013

fantastic work guys!

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Jul 22, 2013

Member

🍻

Member

mblondel commented Jul 22, 2013

🍻

@pprett pprett referenced this pull request Jul 22, 2013

Closed

Random Forest Performance #1435

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 22, 2013

Member

Grats! awesome work!

Member

amueller commented Jul 22, 2013

Grats! awesome work!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 22, 2013

Member

it seems there is no whatsnew entry. I'm pretty sure there should be one ;)

Member

amueller commented Jul 22, 2013

it seems there is no whatsnew entry. I'm pretty sure there should be one ;)

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 22, 2013

Member

Just did in c0b35eb

Thanks for the support guys!

Member

glouppe commented Jul 22, 2013

Just did in c0b35eb

Thanks for the support guys!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 22, 2013

Member

btw did you end up using funroll_all_loop? Afaik it should never be used (it unrolls loops the length of which is not known at compile time) and usually slows down things. - sorry for my late feedback, I'm at least three month behind on sklearn :-/

Member

amueller commented Jul 22, 2013

btw did you end up using funroll_all_loop? Afaik it should never be used (it unrolls loops the length of which is not known at compile time) and usually slows down things. - sorry for my late feedback, I'm at least three month behind on sklearn :-/

else:
estimator.fit(X, y, sample_weight=sample_weight)
try:
estimator.set_params(random_state=self.random_state)

This comment has been minimized.

@amueller

amueller Jul 22, 2013

Member

there is actually a helper function in utils.testing that does that, but it's probably not really worth using here.

@amueller

amueller Jul 22, 2013

Member

there is actually a helper function in utils.testing that does that, but it's probably not really worth using here.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Jul 22, 2013

Member

Indeed, this was still lying around. I have now removed it.

On 22 July 2013 15:19, Andreas Mueller notifications@github.com wrote:

btw did you end up using funroll_all_loop? Afaik it should never be used
(it unrolls loops the length of which is not known at compile time) and
usually slows down things. - sorry for my late feedback, I'm at least three
month behind on sklearn :-/


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2131#issuecomment-21343067
.

Member

glouppe commented Jul 22, 2013

Indeed, this was still lying around. I have now removed it.

On 22 July 2013 15:19, Andreas Mueller notifications@github.com wrote:

btw did you end up using funroll_all_loop? Afaik it should never be used
(it unrolls loops the length of which is not known at compile time) and
usually slows down things. - sorry for my late feedback, I'm at least three
month behind on sklearn :-/


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2131#issuecomment-21343067
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 24, 2013

Member

Well done! But I notice that all the attribute documentation for Tree has disappeared...

Member

jnothman commented Jul 24, 2013

Well done! But I notice that all the attribute documentation for Tree has disappeared...

@jnothman jnothman referenced this pull request Jul 25, 2013

Closed

DOC docstring for Tree #2215

@ndawe

This comment has been minimized.

Show comment
Hide comment
@ndawe

ndawe Aug 22, 2013

Member

Great work @glouppe! Just getting caught up on sklearn developments. Excited to see the speed improvements in my analysis framework!

Member

ndawe commented Aug 22, 2013

Great work @glouppe! Just getting caught up on sklearn developments. Excited to see the speed improvements in my analysis framework!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment