[MRG] faster sorting in trees; random forests almost 2× as fast #2747

larsmans · 2014-01-13T22:03:53Z

Changed the heapsort in the tree learners into a quicksort and gave it cache-friendlier data access. Speeds up RF longer almost two-fold. In fact, profiling with @fabianp's yep tool show the time taken by sort to go down from 65% to <10% of total running time in the covertype benchmark.

This is taking longer than I thought but I figured I should at least show @glouppe and @pprett what I've got so far.

TODO:

more benchmarks, esp. on a denser dataset than covertype (sparse data is easy :)
make tests pass
clean up code
filter out the cruft
decide on the final algorithm: quicksort takes O(n²) time in the worst case, which can be avoided by introsort at the expense of more code.

ogrisel · 2014-01-13T22:21:30Z

Nice! I tagged this PR for 0.15 milestone if everyone agrees :)

larsmans · 2014-01-13T22:26:20Z

On the flip side, the optimization to the sorting is so good that it makes the rest of the tree code look slow :p

(But again, covertype is really easy. I'll try 20news after SVD-200 as well.)

pprett · 2014-01-13T22:30:39Z

@larsmans I've a benchmark suite that contains datasets with different characteristics -- will send the results tomorrow

ogrisel · 2014-01-13T22:33:55Z

You can try on MNIST as well with the mldata loader: there is a script in the MLP PR: https://github.com/IssamLaradji/scikit-learn/blob/multilayer-perceptron/benchmarks/bench_mnist.py

larsmans · 2014-01-13T22:36:47Z

@pprett Then be sure to use vanilla quicksort, not the randomized one. Shuffling turns out to be extremely expensive.

larsmans · 2014-01-13T22:45:16Z

pprof (Google perftools) graph w/ quicksort on covertype:

larsmans · 2014-01-13T23:10:08Z

On 20news, all categories, 100 SVD components, 500 trees and four cores of an Intel i7, training time goes down from 24.181s to 11.683s. F1-score goes down from ~.75 to ~.6, though, so I may have a bug somewhere...

larsmans · 2014-01-13T23:55:23Z

Covertype accuracy actually went down the drain as well. This wasn't the case before I rebased, I must have made a mistake in handling the new X_fx_stride.

pprett · 2014-01-14T07:00:43Z

sklearn/tree/_tree.pyx

-                while ((p + 1 < end) and
-                       (X[X_sample_stride * samples[p + 1] + X_fx_stride * current_feature] <=
-                        X[X_sample_stride * samples[p] + X_fx_stride * current_feature] + EPSILON_FLT)):
+                while p + 1 < end and Xf[p + 1] <= Xf[p] + EPSILON_FLT:


@larsmans in the block above you set Xf[p] for p in range(0, end-start). Here p runs from range(start, end) - is that correct?

When I did my profiling of the tree code a couple of weeks ago it turned out that for datasets with lots of split points the bulk of time is spent in the while condition -- maybe part of your speed-up stems from this refactoring rather than the new sorting.

@larsmans in the block above you set Xf[p] for p in range(0, end-start). Here p runs from range(start, end) - is that correct?

+1, indices are not correct. Please always make p ranges in [start;end) to avoid bugs and confusion with other parts of the code.

Right, this is it. Will change this bit tonight.

@pprett No, this isn't actually the cause of the speedup, it was near 50% before I even introduced this bug.

great - thx Lars

2014/1/14 Lars Buitinck notifications@github.com

In sklearn/tree/_tree.pyx:

# Evaluate all splits self.criterion.reset() p = start while p < end:

while ((p + 1 < end) and

(X[X_sample_stride \* samples[p + 1] + X_fx_stride \* current_feature] <=

X[X_sample_stride \* samples[p] + X_fx_stride \* current_feature] + EPSILON_FLT)):

while p + 1 < end and Xf[p + 1] <= Xf[p] + EPSILON_FLT:

Right, this is it. Will change this bit tonight.

@pprett https://github.com/pprett No, this isn't actually the cause of
the speedup, it was near 50% before I even introduced this bug.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2747/files#r8859593
.

Peter Prettenhofer

pprett · 2014-01-19T13:07:21Z

@larsmans can I benchmark the enhancements or are you still working out some issues in the code?

glouppe · 2014-01-19T13:09:46Z

@pprett As long as the trees are not guaranteed to be the same (which is not the case since accuracy drops), there is no point in benchmarking the current changes. We should invest some time to try to figure this out. I can have a look tomorrow.

larsmans · 2014-01-19T14:43:39Z

I re-applied the patches on top of current master. The first patch, faster heapsort, can AFAIC be merged into master immediately. It gives an almost two-fold speedup and it passes the testsuite.

The second patch, quicksort, doesn't pass all the tests due to randomness issues, but further speeds up tree learning significantly.

larsmans · 2014-01-19T17:57:21Z

@glouppe You can certainly benchmark 238d692, it passes all of the tests.

The second produces somewhat different trees. I'm not sure if we can ever fix that, since neither quicksort nor heapsort are stable sorts.

pprett · 2014-01-19T18:33:13Z

I'm running benchmarks now - should be finished in a couple of hours

2014/1/19 Lars Buitinck notifications@github.com

@glouppe https://github.com/glouppe You can certainly benchmark 238d692238d692,
it passes all of the tests.

The second produces somewhat different trees. I'm not sure if we can ever
fix that, since neither quicksort nor heapsort are stable sorts.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2747#issuecomment-32714620
.

Peter Prettenhofer

ogrisel · 2014-01-19T23:26:20Z

The second produces somewhat different trees. I'm not sure if we can ever fix that, since neither quicksort nor heapsort are stable sorts.

But do you get good test accuracy on covertype and other benchmarks with quicksort?

glouppe · 2014-01-20T06:48:02Z

The second produces somewhat different trees. I'm not sure if we can ever fix that, since neither quicksort nor heapsort are stable sorts.

Stability of the sorting algorithm shouldn't in theory have any impact on the trees that are built. As long as the feature values are sorted, the same cutting points should be found. I'll investigate when i'll have some time.

glouppe · 2014-01-20T07:24:18Z

sklearn/tree/_tree.pyx

+                i += 1
+
+        l -= 1
+        r += 1


This is wrong. Removing these two lines solve all the bugs on my box. :)

glouppe · 2014-01-20T07:36:07Z

@larsmans I just submitted the fix to your branch.

Since I am sure caching the feature values should be also profitable to other splitters, I'd like to make similar changes to PresortBestSplitter and RandomSplitter. I'll push one more patch to your branch during the day.

pprett · 2014-01-20T07:51:22Z

here are some benchmark results -- I only looked at the first commit 238d692

All values are relative to Master where Master is the version after @arjoly recent MSE enhancement (w/o @jnothman tree structure refactoring) -- sorry for that but I only realized when it was too late.
We can see a nice performance improvement for all large datasets (covtype, expedia, solar, bioresponse) -- the improvement is about 15-20% .
Good work @larsmans - awesome!
There was a slight performance decrease on the sythetic regression benchmarks (Friedman#1-3) -- these have mostly large amount of split points thus stability should not be an issue at all.

pprett · 2014-01-20T08:11:11Z

I looked at RandomForestClassifier|Regressor only and used the following parameters:

classification_params = {'n_estimators': 100,
                         'max_depth': None,}
regression_params = {'n_estimators': 100,
                     'max_depth': None, 'max_features': 0.3,
                     'min_samples_leaf': 1,
                     }

larsmans · 2014-01-20T11:58:43Z

@glouppe will merge the patch later this week. @pprett Not as impressive as on the covtype benchmark... did you fix the random seed? I'm surprised to see a difference in accuracy with the refactored heapsort, it should give the exact same ordering.

pprett · 2014-01-20T12:05:51Z

@larsmans each values is the mean of 3 repetitions each with a different random seed (the same for both branches)

pprett · 2014-01-20T12:06:43Z

@larsmans which parameters did you use for your covtype benchmark?

larsmans · 2014-01-20T12:08:29Z

Just the standard ones from the covtype script, n_estimators=20, random_seed=13.

larsmans · 2014-01-20T12:23:43Z

@pprett How many cores? I see a somewhat smaller speedup w/ one core compared to the four I used to benchmark previously:

master 133.6747s
238d692 86.0302s

That's about 36% off. At four cores, I get 43% off, despite having only four cores and a browser still running.

pprett · 2014-01-20T13:37:44Z

@larsmans I ran only single threaded experiments

ogrisel · 2014-01-21T09:31:05Z

I just tried to run:

python benchmarks/bench_covertype.py --classifiers=ExtraTrees --n-jobs=8

On this branch and master. The validation error is the same (~0.021). However I do not see any significant training time improvement (the standard deviations overlap). Maybe the speedup observed by @larsmans is architecture specific (e.g. related to the CPU cache size)?

Here are some attributes from one of my cores:

model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
cpu MHz     : 2660.000
cache size  : 12288 KB
bogomips    : 5585.92

larsmans · 2014-01-26T13:53:38Z

Introsort is about as fast as quicksort. The only reason to use it is to get rid of the worst-case quadratic behavior.

ogrisel · 2014-01-26T15:15:07Z

I don't have access to my workstation right now, but on my laptop I get for the covertype bench:

python benchmarks/bench_covertype.py --n-jobs=4 --classifiers=RandomForest --random-seed=1

master:

RandomForest  81.5990s   0.2137s     0.0301

this branch:

RandomForest  34.9615s   0.2133s     0.0301

So this is very good (at least as fast as the previous benchmarks run with quicksort).

ogrisel · 2014-01-26T15:27:33Z

I had a quick look at the code and it looks fine to me although I am not familiar with sorting algorithms. +1 for merging on my side.

ogrisel · 2014-01-26T15:31:21Z

I also ran my memory leak detection script with the DecisionTreeRegressor class instead of the ExtraTreeRegressor class and neither psutil's reported RSS nor objgraph.get_leaking_objects detected a leak.

glouppe · 2014-01-26T16:26:49Z

Thanks for the check! @larsmans feel free to merge this in :)

On 26/01/2014, Olivier Grisel notifications@github.com wrote:

I also ran my memory leak detection script with the DecisionTreeRegressor
class instead of the ExtraTreeRegressor class and neither psutil's
reported RSS nor objgraph.get_leaking_object detected a leak.

Reply to this email directly or view it on GitHub:
#2747 (comment)

amueller · 2014-01-26T17:03:24Z

awesome!!! great work!

[MRG] faster sorting in trees; random forests almost 2× as fast

ogrisel · 2014-01-26T22:09:20Z

\o/

ogrisel · 2014-01-26T22:10:27Z

Please don't forget to add an entry in the whats_new.rst file.

jnothman · 2014-01-26T22:11:32Z

Great work! Nice to know we persist in teaching diverse sorting algorithms
for good reason!

On 27 January 2014 09:09, Olivier Grisel notifications@github.com wrote:

\o/

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2747#issuecomment-33332180
.

songgc · 2014-01-27T07:15:45Z

After this merge, the GBRT regression takes 2 times longer on a data set than the previous build (commit bf1635d). The loss score seems OK. BTW, the data set is MSLR-WEB10K/Fold1 (MS learning of rank)

glouppe · 2014-01-27T07:22:14Z

Could you try with 31491f9 as head
instead? The only changes on GBRT are with regards to the
PresortBestSplitter and shouldn't make things slower. CC: @pprett

On 27 January 2014 08:15, Guocong Song notifications@github.com wrote:

After this merge, the GBRT regression takes 2 times longer on a data set
than the previous build (commit bf1635d). The loss score seems OK. BTW, the
data set is MSLR-WEB10K/Fold1 (MS learning of rank)

—
Reply to this email directly or view it on GitHub.

songgc · 2014-01-27T08:11:17Z

Hash 31491f9 is even faster than bf1635d by 15%.

pprett · 2014-01-27T09:17:55Z

thanks @songgc - I did a quick benchmark using my solar dataset (regression).

I looked at master ( ), introsort (31491f9), MSE Optim (0b7c79b), best-first (834b375).
I can definitely see a performance regression between 0b7c79b and 31491f9 .
Since it wasn't exposed in the latest benchmarks we did I assume it is an effect of the memory leak fix. I need to check this in more detail.

@songgc I find the 2x performance degression quite harsh - can you tell me which parameters you used (max_features, max_depth, n_estimators).

ogrisel · 2014-01-27T09:24:21Z

@songgc have you fixed the random_state parameter of your GradientBoostingRegressor? I tried on a subsamples of 62244 MSLR results / 136 (500 queries) with GradientBoostingRegressor(n_estimators=100, random_state=1) and it trains in 1m30s both in master and on bf1635d and yield NDCG@10=0.507 each time.

I also tried to bench GradientBoostingRegressor on a simple make_friedman3 dataset with 100k samples and the training speed is the same.

pprett · 2014-01-27T09:34:09Z

@ogrisel ok - I'm running a benchmark suite now with 3 repetitions between current master and @arjoly MSE optimization -- I keep you posted.

@songgc it would be great if you could post the parameters you used -- tree building performance can differ quite considerably depending on the parameters used (eg. max_features)

pprett · 2014-01-27T11:21:13Z

this one just uses smaller datasets -- it looks good IMHO - I used the following params::

classification_params = {'n_estimators': 500, 'loss': 'deviance',
                     'min_samples_leaf': 1, 'max_leaf_nodes': 6,
                     'max_depth': None,
                     'learning_rate': .01, 'subsample': 1.0, 'verbose': 0}

regression_params = {'n_estimators': 500, 'max_leaf_nodes': 6,
                 'max_depth': None,
                 'min_samples_leaf': 1, 'learning_rate': 0.01,
                 'loss': 'ls', 'subsample': 1.0, 'verbose': 0,
                 }

ogrisel · 2014-01-27T12:04:40Z

Same here I do not see any regression between 31491f9 and the current master:

I trained GradientBoostingRegressor(n_estimators=100, random_state=1) on the full Fold1 split of MSLR-10K (3 folds train + 1 fold val == 958671 samples) in 33min on master and 36 min on 31491f9. In both cases I get the following scores on the Fold1 test fold:

NDCG@5: 0.506
NDCG@10: 0.514
R2: 0.168

If I understand correctly, the only commit that is impacting between 31491f9 and master is @glouppe's cache optim a681c9b (aka: ENH Make PresortBestSplitter cache friendly + cosmetics). It seems to indeed work on my box by removing 3mins of training time.

songgc · 2014-01-27T22:38:34Z

My apologies for the false alarm! I have found that I installed version 0.14.1rather than the master branch...

pip install scikit-learn git+https://github.com/scikit-learn/scikit-learn.git gives me the stable version.
pip install -f scikit-learn file://"a synced local repos" gives me the master branch.

My lesson is "check version first". Currently, my benchmarks are consistent with you guys. The speed improvement is impressive! I don't to need to transfer data to R for the GBM package:)

pprett · 2014-01-27T22:42:40Z

no worries @songgc thanks for double-checking -- I'd say there is definitely no reason now to switch to R for the randomForest package ;)

larsmans · 2014-01-27T22:55:51Z

@songgc @pprett Any benchmarks against R? :)

pprett · 2014-01-27T23:06:48Z

I've code for this -- next days are a bit busy -- will post it later this
week - I expect quite a differences because 0.14.1 used to be faster as
well. WiseRF is the competitor :)

2014-01-27 Lars Buitinck notifications@github.com

@songgc https://github.com/songgc @pprett https://github.com/pprettAny benchmarks against R? :)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2747#issuecomment-33433329
.

Peter Prettenhofer

songgc · 2014-01-27T23:18:45Z

@larsmans
I had a benchmark case against GBM as follows:
Data set: MSLR-WEB10K/Fold1
params:
for GBRT {'n_estimators': 100, 'max_depth': 4, 'min_samples_split': 10,
'learning_rate': 0.03, 'loss': 'ls', 'subsample': 0.5, 'random_state': 11, 'verbose': 1}
for GBM {"distribution": "gaussian", "shrinkage": 0.03,
"n.tree": 100, "bag.fraction": 0.5, "verbose": True,
"n.minobsinnode": 10, "interaction.depth": 6}
Please note that max depths are different. GBM usually requires deeper trees compared to GBRT to achieve a similar performance.

Benchmark result:
library, test MSE, running time
GBRT, 0.5854, 1238s
GBM, 0.5943, 1442s

pprett · 2014-01-28T00:20:48Z

@songgc the current master also includes an option to build GBM style trees in GradientBoostingRegressor|Classifier -- use the max_leaf_nodes argument (max_leaf_nodes -1 equals interaction.depth)

pprett reviewed Jan 14, 2014
View reviewed changes

glouppe reviewed Jan 20, 2014
View reviewed changes

sklearn/tree/_tree.pyx

i += 1

l -= 1

r += 1

Copy link

Contributor

glouppe Jan 20, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. Removing these two lines solve all the bugs on my box. :)

larsmans added a commit that referenced this pull request Jan 26, 2014

Merge pull request #2747 from larsmans/tree-sort

9f6dbc5

[MRG] faster sorting in trees; random forests almost 2× as fast

larsmans merged commit 9f6dbc5 into scikit-learn:master Jan 26, 2014

larsmans deleted the tree-sort branch January 27, 2014 15:27

amueller modified the milestones: 0.16, 0.15 Jul 15, 2014

ogrisel mentioned this pull request May 17, 2022

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

Closed

+                              i += 1
+                      l -= 1
+                      r += 1

[MRG] faster sorting in trees; random forests almost 2× as fast #2747

[MRG] faster sorting in trees; random forests almost 2× as fast #2747

Conversation

larsmans commented Jan 13, 2014

ogrisel commented Jan 13, 2014

larsmans commented Jan 13, 2014

pprett commented Jan 13, 2014

ogrisel commented Jan 13, 2014

larsmans commented Jan 13, 2014

larsmans commented Jan 13, 2014

larsmans commented Jan 13, 2014

larsmans commented Jan 13, 2014

pprett Jan 14, 2014

Choose a reason for hiding this comment

pprett Jan 14, 2014

Choose a reason for hiding this comment

glouppe Jan 14, 2014

Choose a reason for hiding this comment

larsmans Jan 14, 2014

Choose a reason for hiding this comment

pprett Jan 14, 2014

Choose a reason for hiding this comment

pprett commented Jan 19, 2014

glouppe commented Jan 19, 2014

larsmans commented Jan 19, 2014

larsmans commented Jan 19, 2014

pprett commented Jan 19, 2014

ogrisel commented Jan 19, 2014

glouppe commented Jan 20, 2014

glouppe Jan 20, 2014

Choose a reason for hiding this comment

glouppe commented Jan 20, 2014

pprett commented Jan 20, 2014

pprett commented Jan 20, 2014

larsmans commented Jan 20, 2014

pprett commented Jan 20, 2014

pprett commented Jan 20, 2014

larsmans commented Jan 20, 2014

larsmans commented Jan 20, 2014

pprett commented Jan 20, 2014

ogrisel commented Jan 21, 2014

larsmans commented Jan 26, 2014

ogrisel commented Jan 26, 2014

ogrisel commented Jan 26, 2014

ogrisel commented Jan 26, 2014

glouppe commented Jan 26, 2014

amueller commented Jan 26, 2014

ogrisel commented Jan 26, 2014

ogrisel commented Jan 26, 2014

jnothman commented Jan 26, 2014

songgc commented Jan 27, 2014

glouppe commented Jan 27, 2014

songgc commented Jan 27, 2014

pprett commented Jan 27, 2014

ogrisel commented Jan 27, 2014

pprett commented Jan 27, 2014

pprett commented Jan 27, 2014

ogrisel commented Jan 27, 2014

songgc commented Jan 27, 2014

pprett commented Jan 27, 2014

larsmans commented Jan 27, 2014

pprett commented Jan 27, 2014

songgc commented Jan 27, 2014

pprett commented Jan 28, 2014