Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX several bugs in initial predictions for GBDT #12983

Merged
merged 46 commits into from Mar 1, 2019

Conversation

Projects
None yet
7 participants
@NicolasHug
Copy link
Contributor

NicolasHug commented Jan 14, 2019

Reference Issues/PRs

Closes #12436 , continuation of the work from @jeremiedbb

What does this implement/fix? Explain your changes.

This PR:

  • Fixes the use of the init estimator parameter in GBDTs
  • exposes init='zero' which was supported but undocumented.
  • deprecates all the losses and init estimators in gradient_boosting.py
    • the loss classes have been moved to a losses.py file and have a new get_init_raw_predictions method
    • the init estimators are replaced by an equivalent instance of DummyClassfier or DummyRegressor
  • renames y_pred into raw_predictions, and relevant (private) methods as well.
  • fixes some the inaccurate docstrings
  • "Fixes" the initial default prediction for multiclass classification: the raw prediction used to be the classes priors, but it should be the log of the priors. In practice the difference in accuracy is very minimal: over 100 runs, I observed an avg improvement in test accuracy of 0.000388 (std = 0.008266) between the 2 methods, with n_estimators=1, n_samples=10000.

The loss.get_init_raw_predictions(X, predictor) methods return a raw_prediction with the correct shape (n_samples, K) where K is the number of classes in multiclass classification, else 1.

Those raw_predictions are homogeneous to what the trees predict, not homogeneous to what the ensemble predicts. For regression GBDT this is the same thing, but for e.g. binary classification raw_prediction is homogeneous to a log-odds ratio.

For more, see #12436 (comment) and following discussion (not sure if this is much clearer ^^)

Any other comments?

@NicolasHug NicolasHug changed the title [WIP] Fix GBDT init estimators [MRG] Fix GBDT init estimators Jan 15, 2019

@glemaitre
Copy link
Contributor

glemaitre left a comment

Partial review

@@ -61,7 +63,15 @@
from ..exceptions import NotFittedError


class QuantileEstimator(object):
# 0.23

This comment has been minimized.

@glemaitre

glemaitre Jan 28, 2019

Contributor

You can add a FIXME:

dtype=np.float64)
else:
try:
self.init_.fit(X, y, sample_weight=sample_weight)

This comment has been minimized.

@glemaitre

glemaitre Jan 28, 2019

Contributor

I think that we are using something like that to check the support for sample_weight:

support_sample_weight = has_fit_parameter(self.init_, "sample_weight")
    if not support_sample_weight and not sample_weight_is_none:
        raise ValueError("xxxx")
If None it uses ``loss.init_estimator``.
init : estimator or 'zero', optional (default=None)
An estimator object that is used to compute the initial predictions.
``init`` has to provide `fit` and `predict_proba`. If 'zero', the

This comment has been minimized.

@glemaitre

glemaitre Jan 28, 2019

Contributor

are we using the double or single backticks?

This comment has been minimized.

@NicolasHug

NicolasHug Jan 28, 2019

Author Contributor

I think we use single backticks for anything that has a glossary entry (that will automatically link it) and double backticks for everything else (to avoid shpinx warnings)

This comment has been minimized.

@thomasjpfan

thomasjpfan Feb 7, 2019

Contributor

@glemaitre Once #13000 gets merged, we do not need to worry about this anymore! Single backticks without a reference will automatically be treated as double backticks!

Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Number of classes
"""
def init_estimator(self):
# Predicts the median

This comment has been minimized.

@glemaitre

glemaitre Jan 28, 2019

Contributor

I think that you can remove this comment. Strategy quantile with 0.5 should be self explicit (and this is the Least Absolute Error) as well

Show resolved Hide resolved sklearn/ensemble/losses.py Outdated

@glemaitre glemaitre self-requested a review Jan 28, 2019

@glemaitre
Copy link
Contributor

glemaitre left a comment

Another partial review

Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
raw_predictions : array, shape (n_samples, K)
The raw_predictions (i.e. values from the tree leaves)
sample_weight : array-like, shape (n_samples,), optional

This comment has been minimized.

@glemaitre

glemaitre Jan 28, 2019

Contributor

is it array-lie?

Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
Show resolved Hide resolved sklearn/ensemble/losses.py Outdated
the does not support probabilities raises AttributeError.
"""
raise TypeError(
'%s does not support predict_proba' % type(self).__name__)

This comment has been minimized.

@glemaitre

glemaitre Jan 28, 2019

Contributor

self.class.name, this might be the same actually

@jnothman
Copy link
Member

jnothman left a comment

I can't say I've done my most thorough review ever, but I think this is very good. Not sure if the losses need more tests. Thoughts?

assert_equal(sw_out.shape, (y.shape[0], 1))

# check if predictions match
assert_array_equal(out, sw_out)
assert_allclose(out, sw_out, rtol=1e-2)


This comment has been minimized.

@jnothman

jnothman Feb 10, 2019

Member

Should there be more extensive tests of the new interface to losses?

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Feb 11, 2019

Thanks @jnothman

I've added sanity checks for the losses in 2c06512 , and it made me realize some issue that I fixed in e2e1a5c but I need your input (@glemaitre as well please :) ):

I've added a check in the losses that raises ValueError when the user uses an init estimator that wasn't trained on the same kind of data as the GradientBoosting instance. For example, training the initial estimator on 3 classes while only 2 classes are passed to the GBDT for training. I chose to raise a ValueError with an error message basically saying "This is the user's fault".

The thing is, the ValueError can also be raised because of early stopping: the training data is split into train and val, and some classes might not be present in train: The GBDT loss expects 3 classes but only 2 of them are present in train, which the default init estimator will be using. In this case the error message is not clear on what the user should do.

Maybe splitting the data with train_test_split(stratify=True) would prevent this?

Note: please ignore the removed #FIXME comments in e2e1a5c, I'll add them back if I can but the fix forces me to remove these checks.

testing has revealed issues

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 12, 2019

stratified splitting could help, but might not if there is a very rare class. Is the problem merely that you don't know how to align the prediction to the validation data? Would initialising the classifier loss with a list of classes known from training facilitate fixing this problem?

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Feb 12, 2019

Is the problem merely that you don't know how to align the prediction to the validation data?

No the problem isn't related to the validation data. It's related to the training data.


Basically the raised ValueErrors in e2e1a5c may happen in 2 scenarios:

  • User has passed an init estimator that was trained on a different number of classes than what the GBDT. In this case the error message is appropriate, since this is the user's fault.
  • User has passed an init estimator, or is using the default init estimator, AND uses early stopping. If one of the classes is not represented in the actual training data (after train_test_split), the init predictor is going to be trained on this data and init_predictor.predict_proba() will be of shape (n_samples, n_classes - 1), but the GBDT expects it to be of shape (n_samples, n_classes). In this case the error message is not helpful because it doesn't tell the user what to do. In this case we can either:
    • Raise an error with a more helpful message (I think that using stratify in train_test_split would be the right solution)
    • Not raise an error, try to find which of the classes has been ignored, and reconstruct a valid probas array. That might be tricky and will definitely make the code more complex than it needs to be.

I'm sorry I find it hard to explain clearly and concisely.

I think that using stratify in train_test_split would solve the issue because train_test_split raises an error when one of the classes are missing in y_train. Regardless of this issue, I think that early stopping splits should be stratified anyway. Would that be OK to use it and consider it as a bugfix?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 12, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Feb 12, 2019

The init predictor's output can be fixed by aligning it up through classes_

Yes but we still need to identify which of the classes are missing in y_train, and the loss is not supposed to know anything about y_pred at this point (only its init estimator knows about it, but itself cannot know which classes are missing). That's what I mean by tricky / overly complicated.

Does this happen elsewhere in the library with validation splits?

I don't think the same issues happens in other estimators. This issue is happening because our main estimator (GBDT) relies on another estimator (the init estimator) which is passed only a subset of the data where some classes might be missing, because of early stopping. GBDTs are the only estimators that use early stopping + another estimator, as far as I know.

I took a look at different early stopping implementations:

  • All estimators inheriting from BaseSGD (perceptron, passiveagressive, SGDClassifier) use stratified splits.
  • Multilayer perceptron and gradient boosting do not use stratifies splits.

Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Feb 13, 2019

Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged?

fine with me.

NicolasHug added some commits Feb 28, 2019

EOF
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 28, 2019

I am still +1 here

@jnothman jnothman changed the title [MRG+1] Fix GBDT init estimators FIX several bugs in initial predictions for GBDT Mar 1, 2019

@jnothman jnothman merged commit ea63c56 into scikit-learn:master Mar 1, 2019

0 of 6 checks passed

LGTM analysis: C/C++ Fetching git commits
Details
LGTM analysis: JavaScript Fetching git commits
Details
LGTM analysis: Python Fetching git commits
Details
ci/circleci: doc Your tests are queued behind your running builds
Details
ci/circleci: doc-min-dependencies Your tests are queued behind your running builds
Details
ci/circleci: lint Your tests are queued behind your running builds
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Mar 1, 2019

Thanks for a phenomenal effort, @NicolasHug!

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Mar 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.