Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Gradient Boosting enhancements #2570

Merged
merged 74 commits into from
Jan 8, 2014

Conversation

pprett
Copy link
Member

@pprett pprett commented Nov 3, 2013

This PR contains a number of enhancements for GBRT:

  1. warm_start argument now allows to add additional trees to an already trained model.
  2. The fit parameter monitor allows to inject a callback into the learning procedure that can implement early stopping, evaluate training progress on held-out, creating snap-shots, ...
  3. Trees can now be grown using max_leaf_nodes as a stopping criterion.
    Setting max_leaf_nodes > 0 will grow the tree in a best-first fashing. Nodes are pushed on a PriorityQueue (impl. as a binary heap) and the node with the highest impurity improvement is expanded next.
    If max_leaf_nodes < 0 then trees are grown in depth-first fashion by using a stack instead of the PriorityQueue.
    Trees grown with max_leaf_nodes > 0 have at most depth max_leaf_nodes - 1 and thus can model interactions of (at most) max_leaf_nodes - 1 features. Individual trees might, however, be shallower then max_leaf_nodes - 1.
  4. New estimator ZeroEstimator if you want to start your GBRT model from scratch.
  5. Tree code now supports both fortran and c-style inputs X
    Tree ensemble estimators don't enforce a specific layout of X. Benchmarks indicate that fortran is indeed faster -- even for ExtraTrees.

  Moved verbose output code to VerboseReporter class.
  Cosmit: logical structured code blocks
  Refactored: moved init state to method
  Added partial_fit to GradientBoosting
  Fix: is_classification not available; n_features not n_features_
  Tests
  Add ``partial_fit`` to docs
…rion.children_impurity. Pass partition impurity to stack to avoid re-computation (saves some runtime).

New tree parameter: complete; specifies whether complete binary trees are grows or if a greedy branch of max_depth is grown with at most max_depth + 1 leafs.
max_leaf_nodes instead of complete parameter
@pprett
Copy link
Member Author

pprett commented Nov 3, 2013

Some timings against master to show that there is no performance regression for Forests.
ExtraTrees and RF use default params and 20 trees.
Covertype is a binary clf problem with 54 fx and 0.5M train and 60k test samples.
Solar is a regression problem with 512 fx and 0.2M train and 50k test samples.
All timings are mean of 3 repetitions.

branch dataset estimator train test error
PR Covertype ExtraTrees 147.34s 0.45s 0.0211
PR Covertype RF 160.29s 0.43s 0.0302
PR Solar ExtraTrees 31.71s 0.54s 2106559
PR Solar RF 148.03s 0.43s 2124727
M Covertype ExtraTrees 145.351s 0.45s 0.0211
M Covertype RF 157.471s 0.39s 0.0302
M Solar ExtraTrees 31.063s 0.48s 2107144
M Solar RF 143.913s 0.43s 2122097

Remarks: training timings are similar (PR slightly lower but IMHO not relevant)
On solar, master and PR lead to different results, why? They use same random seed.

@pprett
Copy link
Member Author

pprett commented Nov 3, 2013

I looked into the difference between the results on solar: parameters (incl. random states) are the same for master and PR. Of all 20 trees there is only one tree that differs between the two versions. It seems that there are minor differences in the impurity scores (because now dont compute them when processing a node but pass them down from the parent split node) -- if impurity scores are below a threshold (1e-7) we do not try to split the node since its almost pure -- I tracked the number of times this is the case and it differs quite a bit between master and this PR.

If this is indeed the case it should be affected by the setting of min_samples_split - I'll investigate that.

@pprett
Copy link
Member Author

pprett commented Nov 3, 2013

Here is a quick benchmark of sklearn vs gbm on covertype::

I used the same parameters for both (min_samples_leaf=5, n_estimators=100, num_leafs=4, learning rate=0.1)

Classifier train-time test-time error-rate
GBM 226.6512s 1.2698s 0.2102
sklearn c 287.1446s 0.2773s 0.2166
sklearn f 211.1848s 0.2883s 0.2166

f and c stand for fortran or c-style memory layout of X.

@GaelVaroquaux
Copy link
Member

Here is a quick benchmark of sklearn vs gbm on covertype::

Looks really good. Impressive!

@pprett
Copy link
Member Author

pprett commented Nov 3, 2013

@GaelVaroquaux @glouppe I've updated the GBRT benchmark - I modified our tree code to support both fortran and c-style inputs (previously there was only c-style support) - will update the PR in a minute.

As expected fortran layout is significantly better -- but we assume that when you use random feature sampling the advantage diminishes - I'll keep you posted.

@pprett
Copy link
Member Author

pprett commented Nov 3, 2013

Performance after the above changes:

branch dataset estimator train test error
PR Covertype ExtraTrees*c 145.90s 0.43s 0.0211
PR Covertype ExtraTrees*f 115.51s 0.48s 0.0211

@glouppe this might interest you

@glouppe
Copy link
Contributor

glouppe commented Nov 3, 2013

@pprett That's great! Handling both c-contiguous and fortan-style without making a new copy is really great! Thanks for working on this :)

Regarding the new max_leaf_nodes hyper-parameter, I just told on gtalk, but I would say it also here for the record: I am not sure using current node impurity as priority is the best thing to do. In a true best-first strategy, I would rather prioritize the nodes by their potential weighted impurity decreases, that is:

n_node_samples[node] * impurity[node] 
    - n_node_samples[children_left[node]] * impurity[children_left[node]] 
    - n_node_samples[children_right[node]] * impurity[children_right[node]]

@@ -1459,9 +1682,9 @@ cdef class Tree:
return sizet_ptr_to_ndarray(self.n_node_samples, self.node_count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this? For easier inspection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you referring to? the return value of pop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was refering to the max_depth attribute. But is seems to have disappeared?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe you meant why I set max_depth at the end of Tree.build?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. self.max_depth = max_depth_seen. Dunno why github messed with my comments :s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I am not against this addition, I was just surprised to see it. You could rename it to max_depth_ to enforce that it is learned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just convenience to study the nature of the learned trees -- do you think we should keep it (renamed) or throw it out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for recording max_depth_, it's very useful IMHO.

@pprett
Copy link
Member Author

pprett commented Nov 3, 2013

@glouppe just to make sure I understand correctly

Regarding the new max_leaf_nodes hyper-parameter, I just told on gtalk, but I would say it also here for the record: I am not sure using current node impurity as priority is the best thing to do. In a true best-first strategy, I would rather prioritize the nodes by their potential weighted impurity decreases, that is:

n_node_samples[node] * impurity[node]
- n_node_samples[children_left[node]] * impurity[children_left[node]]
- n_node_samples[children_right[node]] * impurity[children_right[node]]

Suppose you create a new split node (nodeid node) - Now you want to know which of its children you want to split first: Using the above formula, you would need to treat each of the children as a split node themselves (to get their children left and children right) in order to evaluate which of the two you should indeed split first. Thus, you have to compute both splits and then discard one of them. This reminds me much of a "pruning" strategy... is this correct or am I missing something?

assert_almost_equal(0.84652100667116, score)
# est.fit(X_train, y_train)
# score = est.score(X_test, y_test)
# assert_almost_equal(score, 0.86908506408880637)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well.. the honest answer is: because when you uncomment it it fails but I guess you know that...
truth is trees give slightly different results on 32bit vs 64bit arch - during the sprint in paris i did a change that made them be equal at the expense of 64bit precision. I'd rather see the test removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing the test. It is vain to try to have the exact same results on different architectures, always, in all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it pass with assert_almost_equal(score, 0.869, 2)?

@mblondel
Copy link
Member

mblondel commented Jan 7, 2014

Sorry for coming late to the party, but I'd be +1 on fit_more and -100 on
adding another CV object as the API is fundamentally flawed.

Can you elaborate on the flawed API?

@amueller
Copy link
Member

amueller commented Jan 7, 2014

@mblondel if you have a pipeline with a CV object and want to do grid-searches, you will always do a nested cross-validation, i.e. GridSearchCV will do a train/test split and the CV object will do another split. But you actually just want to do one cross-validation where GridSearchCV can take advantage of some path algorithm (as used by the CV objects). See #1626.

remove troublesome 32bit test
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 73c7de3 on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

@pprett
Copy link
Member Author

pprett commented Jan 8, 2014

@ogrisel I've addressed the two issues with the uninformative exceptions. I also removed the 32bit test - the difference is actually quite significant (64bit: 0.84758663315474814, 32bit: 0.82891483480560135).

If you're fine with that I'd merge

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9a3c22a on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

"%r" % max_leaf_nodes)
if -1 < max_leaf_nodes < 2:
raise ValueError(("max_leaf_nodes %r must be either smaller than 0 or "
"larger than 1").format(max_leaf_nodes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is not tested as it would raise a formatting exception: it should be:

            raise ValueError("max_leaf_nodes {0} must be either smaller than 0 or "
                             "larger than 1".format(max_leaf_nodes))

Note that beside the %r / {0} placeholder fix, you also should not need the additional pair of parens. But better add a test first to check that this is actually a fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the coveralls reports this line as being tested but the test does not cover the result of the formating:

>>> "a{0}a".format('a')
'aaa'
>>> "a%ra".format('a')
'a%ra'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use:

from sklearn.utils.testing import assert_raise_message

assert_raise_message(ValueError, expected_msg, my_callable, args)

to check the content of a message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it doesn't raise a formatting exception because the line is covered -- thanks for the additional pair of eye-balls, I completely missed that and I rather not alter the whole test suite to check the message of each exception...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok not to rewrite the whole test suite to use assert_raise_message but it's better to only use assert_raise_message instead of assert_raise from now on whenever the message is stable to make the test suite more explicit (easier to read by making the motivation explicit).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 792a09c on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 05d88c7 on pprett:gbrt-enh-stackrec-greedy into b57b3a3 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Jan 8, 2014

This PR looks good to me. +1 for merging on my side.

pprett added a commit that referenced this pull request Jan 8, 2014
[MRG] Gradient Boosting enhancements
@pprett pprett merged commit b8bea8e into scikit-learn:master Jan 8, 2014
@pprett
Copy link
Member Author

pprett commented Jan 8, 2014

mege all the things

@glouppe
Copy link
Contributor

glouppe commented Jan 8, 2014

Phew... finally! Great work Peter :)

Next time, let's make our changes smaller ;)

@glouppe
Copy link
Contributor

glouppe commented Jan 8, 2014

and thanks all for the reviews.

@GaelVaroquaux
Copy link
Member

It only took two months. Many PRs are open for longer.

@pprett
Copy link
Member Author

pprett commented Jan 8, 2014

yeah - thanks for the throughout reviews!

will do a smaller one next time... added stuff successively... bad style

@arjoly
Copy link
Member

arjoly commented Jan 8, 2014

Well done @pprett !!! 🍻

@emef
Copy link

emef commented May 4, 2014

@PPRET really happy to see these changes I was hoping for last year; all really nice improvements! nice work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet