[MRG] Gradient Boosting enhancements #2570

pprett · 2013-11-03T07:49:34Z

This PR contains a number of enhancements for GBRT:

warm_start argument now allows to add additional trees to an already trained model.
The fit parameter monitor allows to inject a callback into the learning procedure that can implement early stopping, evaluate training progress on held-out, creating snap-shots, ...
Trees can now be grown using max_leaf_nodes as a stopping criterion.
Setting max_leaf_nodes > 0 will grow the tree in a best-first fashing. Nodes are pushed on a PriorityQueue (impl. as a binary heap) and the node with the highest impurity improvement is expanded next.
If max_leaf_nodes < 0 then trees are grown in depth-first fashion by using a stack instead of the PriorityQueue.
Trees grown with max_leaf_nodes > 0 have at most depth max_leaf_nodes - 1 and thus can model interactions of (at most) max_leaf_nodes - 1 features. Individual trees might, however, be shallower then max_leaf_nodes - 1.
New estimator ZeroEstimator if you want to start your GBRT model from scratch.
Tree code now supports both fortran and c-style inputs X
Tree ensemble estimators don't enforce a specific layout of X. Benchmarks indicate that fortran is indeed faster -- even for ExtraTrees.

Moved verbose output code to VerboseReporter class. Cosmit: logical structured code blocks Refactored: moved init state to method Added partial_fit to GradientBoosting Fix: is_classification not available; n_features not n_features_ Tests Add ``partial_fit`` to docs

…lue comparisons

…rion.children_impurity. Pass partition impurity to stack to avoid re-computation (saves some runtime). New tree parameter: complete; specifies whether complete binary trees are grows or if a greedy branch of max_depth is grown with at most max_depth + 1 leafs.

max_leaf_nodes instead of complete parameter

pprett · 2013-11-03T09:25:19Z

Some timings against master to show that there is no performance regression for Forests.
ExtraTrees and RF use default params and 20 trees.
Covertype is a binary clf problem with 54 fx and 0.5M train and 60k test samples.
Solar is a regression problem with 512 fx and 0.2M train and 50k test samples.
All timings are mean of 3 repetitions.

branch	dataset	estimator	train	test	error
PR	Covertype	ExtraTrees	147.34s	0.45s	0.0211
PR	Covertype	RF	160.29s	0.43s	0.0302
PR	Solar	ExtraTrees	31.71s	0.54s	2106559
PR	Solar	RF	148.03s	0.43s	2124727
M	Covertype	ExtraTrees	145.351s	0.45s	0.0211
M	Covertype	RF	157.471s	0.39s	0.0302
M	Solar	ExtraTrees	31.063s	0.48s	2107144
M	Solar	RF	143.913s	0.43s	2122097

Remarks: training timings are similar (PR slightly lower but IMHO not relevant)
On solar, master and PR lead to different results, why? They use same random seed.

pprett · 2013-11-03T10:04:41Z

I looked into the difference between the results on solar: parameters (incl. random states) are the same for master and PR. Of all 20 trees there is only one tree that differs between the two versions. It seems that there are minor differences in the impurity scores (because now dont compute them when processing a node but pass them down from the parent split node) -- if impurity scores are below a threshold (1e-7) we do not try to split the node since its almost pure -- I tracked the number of times this is the case and it differs quite a bit between master and this PR.

If this is indeed the case it should be affected by the setting of min_samples_split - I'll investigate that.

pprett · 2013-11-03T10:19:25Z

Here is a quick benchmark of sklearn vs gbm on covertype::

I used the same parameters for both (min_samples_leaf=5, n_estimators=100, num_leafs=4, learning rate=0.1)

Classifier	train-time	test-time	error-rate
GBM	226.6512s	1.2698s	0.2102
sklearn c	287.1446s	0.2773s	0.2166
sklearn f	211.1848s	0.2883s	0.2166

f and c stand for fortran or c-style memory layout of X.

GaelVaroquaux · 2013-11-03T10:22:25Z

Here is a quick benchmark of sklearn vs gbm on covertype::

Looks really good. Impressive!

pprett · 2013-11-03T12:11:41Z

@GaelVaroquaux @glouppe I've updated the GBRT benchmark - I modified our tree code to support both fortran and c-style inputs (previously there was only c-style support) - will update the PR in a minute.

As expected fortran layout is significantly better -- but we assume that when you use random feature sampling the advantage diminishes - I'll keep you posted.

pprett · 2013-11-03T12:31:14Z

Performance after the above changes:

branch	dataset	estimator	train	test	error
PR	Covertype	ExtraTrees*c	145.90s	0.43s	0.0211
PR	Covertype	ExtraTrees*f	115.51s	0.48s	0.0211

@glouppe this might interest you

glouppe · 2013-11-03T12:43:09Z

@pprett That's great! Handling both c-contiguous and fortan-style without making a new copy is really great! Thanks for working on this :)

Regarding the new max_leaf_nodes hyper-parameter, I just told on gtalk, but I would say it also here for the record: I am not sure using current node impurity as priority is the best thing to do. In a true best-first strategy, I would rather prioritize the nodes by their potential weighted impurity decreases, that is:

n_node_samples[node] * impurity[node] 
    - n_node_samples[children_left[node]] * impurity[children_left[node]] 
    - n_node_samples[children_right[node]] * impurity[children_right[node]]

glouppe · 2013-11-03T12:55:19Z

sklearn/tree/_tree.pyx

@@ -1459,9 +1682,9 @@ cdef class Tree:
            return sizet_ptr_to_ndarray(self.n_node_samples, self.node_count)



Why do you need this? For easier inspection?

what are you referring to? the return value of pop?

I was refering to the max_depth attribute. But is seems to have disappeared?

ok, maybe you meant why I set max_depth at the end of Tree.build?

Yep. self.max_depth = max_depth_seen. Dunno why github messed with my comments :s

Btw, I am not against this addition, I was just surprised to see it. You could rename it to max_depth_ to enforce that it is learned.

It was just convenience to study the nature of the learned trees -- do you think we should keep it (renamed) or throw it out?

+1 for recording max_depth_, it's very useful IMHO.

pprett · 2013-11-03T14:00:20Z

@glouppe just to make sure I understand correctly

Regarding the new max_leaf_nodes hyper-parameter, I just told on gtalk, but I would say it also here for the record: I am not sure using current node impurity as priority is the best thing to do. In a true best-first strategy, I would rather prioritize the nodes by their potential weighted impurity decreases, that is:

n_node_samples[node] * impurity[node]
- n_node_samples[children_left[node]] * impurity[children_left[node]]
- n_node_samples[children_right[node]] * impurity[children_right[node]]

Suppose you create a new split node (nodeid node) - Now you want to know which of its children you want to split first: Using the above formula, you would need to treat each of the children as a split node themselves (to get their children left and children right) in order to evaluate which of the two you should indeed split first. Thus, you have to compute both splits and then discard one of them. This reminds me much of a "pruning" strategy... is this correct or am I missing something?

ogrisel · 2014-01-06T18:30:49Z

sklearn/tree/tests/test_tree.py

-    assert_almost_equal(0.84652100667116, score)
+#     est.fit(X_train, y_train)
+#     score = est.score(X_test, y_test)
+#     assert_almost_equal(score, 0.86908506408880637)


Why is this commented out?

well.. the honest answer is: because when you uncomment it it fails but I guess you know that...
truth is trees give slightly different results on 32bit vs 64bit arch - during the sprint in paris i did a change that made them be equal at the expense of 64bit precision. I'd rather see the test removed.

+1 for removing the test. It is vain to try to have the exact same results on different architectures, always, in all cases.

Does it pass with assert_almost_equal(score, 0.869, 2)?

mblondel · 2014-01-07T01:08:37Z

Sorry for coming late to the party, but I'd be +1 on fit_more and -100 on
adding another CV object as the API is fundamentally flawed.

Can you elaborate on the flawed API?

amueller · 2014-01-07T11:58:33Z

@mblondel if you have a pipeline with a CV object and want to do grid-searches, you will always do a nested cross-validation, i.e. GridSearchCV will do a train/test split and the CV object will do another split. But you actually just want to do one cross-validation where GridSearchCV can take advantage of some path algorithm (as used by the CV objects). See #1626.

remove troublesome 32bit test

coveralls · 2014-01-08T11:18:25Z

Coverage remained the same when pulling 73c7de3 on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

pprett · 2014-01-08T11:28:38Z

@ogrisel I've addressed the two issues with the uninformative exceptions. I also removed the 32bit test - the difference is actually quite significant (64bit: 0.84758663315474814, 32bit: 0.82891483480560135).

If you're fine with that I'd merge

coveralls · 2014-01-08T11:35:39Z

Coverage remained the same when pulling 9a3c22a on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

ogrisel · 2014-01-08T12:48:56Z

sklearn/tree/tree.py

+                             "%r" % max_leaf_nodes)
+        if -1 < max_leaf_nodes < 2:
+            raise ValueError(("max_leaf_nodes %r must be either smaller than 0 or "
+                             "larger than 1").format(max_leaf_nodes))


Apparently this is not tested as it would raise a formatting exception: it should be:

raise ValueError("max_leaf_nodes {0} must be either smaller than 0 or " "larger than 1".format(max_leaf_nodes))

Note that beside the %r / {0} placeholder fix, you also should not need the additional pair of parens. But better add a test first to check that this is actually a fix.

Actually the coveralls reports this line as being tested but the test does not cover the result of the formating:

>>> "a{0}a".format('a') 'aaa' >>> "a%ra".format('a') 'a%ra'

You can use:

from sklearn.utils.testing import assert_raise_message assert_raise_message(ValueError, expected_msg, my_callable, args)

to check the content of a message.

Apparently it doesn't raise a formatting exception because the line is covered -- thanks for the additional pair of eye-balls, I completely missed that and I rather not alter the whole test suite to check the message of each exception...

I think it's ok not to rewrite the whole test suite to use assert_raise_message but it's better to only use assert_raise_message instead of assert_raise from now on whenever the message is stable to make the test suite more explicit (easier to read by making the motivation explicit).

coveralls · 2014-01-08T13:25:39Z

Coverage remained the same when pulling 792a09c on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

coveralls · 2014-01-08T14:01:58Z

Coverage remained the same when pulling 05d88c7 on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

ogrisel · 2014-01-08T14:13:47Z

This PR looks good to me. +1 for merging on my side.

[MRG] Gradient Boosting enhancements

pprett · 2014-01-08T14:23:14Z

glouppe · 2014-01-08T14:28:12Z

Phew... finally! Great work Peter :)

Next time, let's make our changes smaller ;)

glouppe · 2014-01-08T14:28:21Z

and thanks all for the reviews.

GaelVaroquaux · 2014-01-08T14:29:17Z

It only took two months. Many PRs are open for longer.

pprett · 2014-01-08T14:30:55Z

yeah - thanks for the throughout reviews!

will do a smaller one next time... added stuff successively... bad style

arjoly · 2014-01-08T14:53:04Z

Well done @pprett !!! 🍻

emef · 2014-05-04T02:03:01Z

@PPRET really happy to see these changes I was hoping for last year; all really nice improvements! nice work

pprett added 13 commits November 1, 2013 21:23

GBRT enhancements:

098f630

Moved verbose output code to VerboseReporter class. Cosmit: logical structured code blocks Refactored: moved init state to method Added partial_fit to GradientBoosting Fix: is_classification not available; n_features not n_features_ Tests Add ``partial_fit`` to docs

added EPSILON_FLT and EPSILON_DBL for almost equal impurity and fx va…

5dc3754

…lue comparisons

fiddled with EPSILON_DBL

0412d4d

added monitor callback w/ early stopping support

ecd606b

fix: VerboseReporter remaining time in the case of partial_fit

7eb94f6

added ZeroEstimator

e7bf1aa

use StackRecord as elements of stack not 5 consecutive entries

d2c2dae

set EPSILON_DBL to old val of 1e-7

cdef5d6

add tests for complete

d7e5802

fix: use regressor in regression test

3914a22

Add stack class

7eb679d

PriorityQueue for best-first tree growing

e9cea77

max_leaf_nodes instead of complete parameter

fix: test renamed and check only on nr of leafs.

173f6e4

pprett added 4 commits November 3, 2013 14:19

tree code now supports both c-style and fortran inputs

5f09b5a

tree code now supports both c-style and fortran inputs

a9c2f89

updated after tree.pxd change

79a67d0

Tree ensemble classes don't enforce c-style inputs.

766e11f

glouppe reviewed Nov 3, 2013
View reviewed changes

use assert_equal if possible

bef7d40

ogrisel reviewed Jan 6, 2014
View reviewed changes

cosmit: better exception

73c7de3

remove troublesome 32bit test

cosmit: better exceptions

9a3c22a

ogrisel reviewed Jan 8, 2014
View reviewed changes

pprett added 2 commits January 8, 2014 14:04

fix: wrong placeholder in format string

4e4a8a9

better test coverage in tree module

792a09c

GBRT subsection for warm_start

05d88c7

pprett added a commit that referenced this pull request Jan 8, 2014

Merge pull request #2570 from pprett/gbrt-enh-stackrec-greedy

b8bea8e

[MRG] Gradient Boosting enhancements

pprett merged commit b8bea8e into scikit-learn:master Jan 8, 2014

jwkvam mentioned this pull request Jan 9, 2014

Fitting additional estimators for ensemble methods #1585

Closed

pprett mentioned this pull request Jan 14, 2014

fix: remaining time not currectly computed #2753

Merged

This was referenced Jan 15, 2014

Gbrt stability #2395

Closed

gradient boosting oob test error on 32bit #2225

Closed

NicolasHug mentioned this pull request Aug 20, 2020

[MRG] MNT remove duplicated call to children_impurity() #18203

Merged

		@@ -1459,9 +1682,9 @@ cdef class Tree:
		return sizet_ptr_to_ndarray(self.n_node_samples, self.node_count)

[MRG] Gradient Boosting enhancements #2570

[MRG] Gradient Boosting enhancements #2570

Conversation

pprett commented Nov 3, 2013

pprett commented Nov 3, 2013

pprett commented Nov 3, 2013

pprett commented Nov 3, 2013

GaelVaroquaux commented Nov 3, 2013

pprett commented Nov 3, 2013

pprett commented Nov 3, 2013

glouppe commented Nov 3, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pprett commented Nov 3, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblondel commented Jan 7, 2014

amueller commented Jan 7, 2014

coveralls commented Jan 8, 2014

pprett commented Jan 8, 2014

coveralls commented Jan 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 8, 2014

coveralls commented Jan 8, 2014

ogrisel commented Jan 8, 2014

pprett commented Jan 8, 2014

glouppe commented Jan 8, 2014

glouppe commented Jan 8, 2014

GaelVaroquaux commented Jan 8, 2014

pprett commented Jan 8, 2014

arjoly commented Jan 8, 2014

emef commented May 4, 2014