Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX several bugs in initial predictions for GBDT #12983

Merged
merged 46 commits into from Mar 1, 2019

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jan 14, 2019

Reference Issues/PRs

Closes #12436 , continuation of the work from @jeremiedbb

What does this implement/fix? Explain your changes.

This PR:

  • Fixes the use of the init estimator parameter in GBDTs
  • exposes init='zero' which was supported but undocumented.
  • deprecates all the losses and init estimators in gradient_boosting.py
    • the loss classes have been moved to a losses.py file and have a new get_init_raw_predictions method
    • the init estimators are replaced by an equivalent instance of DummyClassfier or DummyRegressor
  • renames y_pred into raw_predictions, and relevant (private) methods as well.
  • fixes some the inaccurate docstrings
  • "Fixes" the initial default prediction for multiclass classification: the raw prediction used to be the classes priors, but it should be the log of the priors. In practice the difference in accuracy is very minimal: over 100 runs, I observed an avg improvement in test accuracy of 0.000388 (std = 0.008266) between the 2 methods, with n_estimators=1, n_samples=10000.

The loss.get_init_raw_predictions(X, predictor) methods return a raw_prediction with the correct shape (n_samples, K) where K is the number of classes in multiclass classification, else 1.

Those raw_predictions are homogeneous to what the trees predict, not homogeneous to what the ensemble predicts. For regression GBDT this is the same thing, but for e.g. binary classification raw_prediction is homogeneous to a log-odds ratio.

For more, see #12436 (comment) and following discussion (not sure if this is much clearer ^^)

Any other comments?

@NicolasHug NicolasHug changed the title [WIP] Fix GBDT init estimators [MRG] Fix GBDT init estimators Jan 15, 2019
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

@@ -61,7 +63,15 @@
from ..exceptions import NotFittedError


class QuantileEstimator(object):
# 0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a FIXME:

dtype=np.float64)
else:
try:
self.init_.fit(X, y, sample_weight=sample_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we are using something like that to check the support for sample_weight:

support_sample_weight = has_fit_parameter(self.init_, "sample_weight")
    if not support_sample_weight and not sample_weight_is_none:
        raise ValueError("xxxx")

If None it uses ``loss.init_estimator``.
init : estimator or 'zero', optional (default=None)
An estimator object that is used to compute the initial predictions.
``init`` has to provide `fit` and `predict_proba`. If 'zero', the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using the double or single backticks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use single backticks for anything that has a glossary entry (that will automatically link it) and double backticks for everything else (to avoid shpinx warnings)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre Once #13000 gets merged, we do not need to worry about this anymore! Single backticks without a reference will automatically be treated as double backticks!

sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
Number of classes
"""
def init_estimator(self):
# Predicts the median
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can remove this comment. Strategy quantile with 0.5 should be self explicit (and this is the Least Absolute Error) as well

sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
@glemaitre glemaitre self-requested a review January 28, 2019 13:13
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review

sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
raw_predictions : array, shape (n_samples, K)
The raw_predictions (i.e. values from the tree leaves)

sample_weight : array-like, shape (n_samples,), optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it array-lie?

sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
sklearn/ensemble/losses.py Outdated Show resolved Hide resolved
the does not support probabilities raises AttributeError.
"""
raise TypeError(
'%s does not support predict_proba' % type(self).__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.class.name, this might be the same actually

@NicolasHug
Copy link
Member Author

NicolasHug commented Feb 12, 2019

Is the problem merely that you don't know how to align the prediction to the validation data?

No the problem isn't related to the validation data. It's related to the training data.


Basically the raised ValueErrors in e2e1a5c may happen in 2 scenarios:

  • User has passed an init estimator that was trained on a different number of classes than what the GBDT. In this case the error message is appropriate, since this is the user's fault.
  • User has passed an init estimator, or is using the default init estimator, AND uses early stopping. If one of the classes is not represented in the actual training data (after train_test_split), the init predictor is going to be trained on this data and init_predictor.predict_proba() will be of shape (n_samples, n_classes - 1), but the GBDT expects it to be of shape (n_samples, n_classes). In this case the error message is not helpful because it doesn't tell the user what to do. In this case we can either:
    • Raise an error with a more helpful message (I think that using stratify in train_test_split would be the right solution)
    • Not raise an error, try to find which of the classes has been ignored, and reconstruct a valid probas array. That might be tricky and will definitely make the code more complex than it needs to be.

I'm sorry I find it hard to explain clearly and concisely.

I think that using stratify in train_test_split would solve the issue because train_test_split raises an error when one of the classes are missing in y_train. Regardless of this issue, I think that early stopping splits should be stratified anyway. Would that be OK to use it and consider it as a bugfix?

@jnothman
Copy link
Member

jnothman commented Feb 12, 2019 via email

@NicolasHug
Copy link
Member Author

The init predictor's output can be fixed by aligning it up through classes_

Yes but we still need to identify which of the classes are missing in y_train, and the loss is not supposed to know anything about y_pred at this point (only its init estimator knows about it, but itself cannot know which classes are missing). That's what I mean by tricky / overly complicated.

Does this happen elsewhere in the library with validation splits?

I don't think the same issues happens in other estimators. This issue is happening because our main estimator (GBDT) relies on another estimator (the init estimator) which is passed only a subset of the data where some classes might be missing, because of early stopping. GBDTs are the only estimators that use early stopping + another estimator, as far as I know.

I took a look at different early stopping implementations:

  • All estimators inheriting from BaseSGD (perceptron, passiveagressive, SGDClassifier) use stratified splits.
  • Multilayer perceptron and gradient boosting do not use stratifies splits.

Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged?

@amueller
Copy link
Member

Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged?

fine with me.

@glemaitre
Copy link
Member

glemaitre commented Feb 28, 2019

I am still +1 here

@jnothman jnothman changed the title [MRG+1] Fix GBDT init estimators FIX several bugs in initial predictions for GBDT Mar 1, 2019
@jnothman jnothman merged commit ea63c56 into scikit-learn:master Mar 1, 2019
@jnothman
Copy link
Member

jnothman commented Mar 1, 2019

Thanks for a phenomenal effort, @NicolasHug!

@agramfort
Copy link
Member

agramfort commented Mar 1, 2019 via email

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants