Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FIX gradient boosting with sklearn estimator as init #12436

Closed
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@jeremiedbb
Copy link
Contributor

jeremiedbb commented Oct 22, 2018

Fixes #10302, Fixes #12429, Fixes #2691

Gradient Boosting used to fail when init was a sklearn estimator, which is a bit ironic :)
Issue was that the predict output didn't have the expected shape. And apparently there was no test for the init parameter with other estimator than default.

Edit Also accept initial estimator which does not support sample weights as long as the gradient boosting is not fitted with sample weights

@rth
Copy link
Member

rth left a comment

I'm not too familiar with GB code, but this looks good to me except for a minor comment below.

You also need to repeat "close" or "fix" in the description for every issue you want to auto-close: i.e. close #123, close#1234, etc..

I can confirm the added test fails on master.

Maybe @glemaitre would be able to do another review..

@@ -1421,7 +1421,9 @@ def fit(self, X, y, sample_weight=None, monitor=None):
self.init_.fit(X, y, sample_weight)

# init predictions
y_pred = self.init_.predict(X)
y_pred = self.init_.predict(X).astype(np.float64, copy=False)
if y_pred.shape == (X.shape[0],):

This comment has been minimized.

@rth

rth Oct 23, 2018

Member

Wouldn't,

y_pred.ndim == 1

be enough?

This comment has been minimized.

@jeremiedbb

jeremiedbb Oct 23, 2018

Author Contributor

probably :)

@@ -1421,7 +1421,9 @@ def fit(self, X, y, sample_weight=None, monitor=None):
self.init_.fit(X, y, sample_weight)

# init predictions
y_pred = self.init_.predict(X)
y_pred = self.init_.predict(X).astype(np.float64, copy=False)

This comment has been minimized.

@rth

rth Oct 23, 2018

Member

What happens if you don't cast it to 64bit?

This comment has been minimized.

@jeremiedbb

jeremiedbb Oct 23, 2018

Author Contributor

An error is raised due to unsafe casting because y_pred is updated later by the actual fit of the gradient boosting, and filled with float64 values.

y_pred[:, k] += (learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0))
@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 23, 2018

Also please add a what's new. I guess we could try to squeeze it into 0.20.1 since it's a long standing issue that has been reported multiple times..

@jeremiedbb

This comment has been minimized.

Copy link
Contributor Author

jeremiedbb commented Oct 23, 2018

@stoddardg Pointed out another related bug. When the init estimator does not support sample weights, it crashes even if one calls fit with sample_weight=None.
I remove the MRG temporary, but I think I have a fix for that too.

@jeremiedbb jeremiedbb changed the title [MRG] FIX gradient boosting with sklearn estimator as init [WIP] FIX gradient boosting with sklearn estimator as init Oct 23, 2018

@jeremiedbb jeremiedbb force-pushed the jeremiedbb:gbc-with-init branch from 7d3c06e to 53abfd6 Oct 23, 2018

@jeremiedbb jeremiedbb changed the title [WIP] FIX gradient boosting with sklearn estimator as init [MRG] FIX gradient boosting with sklearn estimator as init Oct 24, 2018

# fit initial model - FIXME make sample_weight optional
self.init_.fit(X, y, sample_weight)
# fit initial model
if has_fit_parameter(self.init_, "sample_weight"):

This comment has been minimized.

@jnothman

jnothman Oct 28, 2018

Member

We should avoid has_fit_parameter unless try-except really won't work.

This comment has been minimized.

@jnothman

jnothman Oct 28, 2018

Member

It means keyword args won't work. We should only use this kind of thing when we really need to take a different approach when sample_weight is not supported. (Feel free to document this caveat more in a separate pr)

@jeremiedbb jeremiedbb force-pushed the jeremiedbb:gbc-with-init branch from cc85fee to 12c18f4 Oct 29, 2018

@jnothman
Copy link
Member

jnothman left a comment

I'm not certain this should go in 0.20.1, but whatever...?

Nice work!

New features added

@jeremiedbb

This comment has been minimized.

Copy link
Contributor Author

jeremiedbb commented Nov 8, 2018

@rth I made some changes since your last review. Maybe you want to give another look at it.

I'm not certain this should go in 0.20.1

It's not a regression from 0.19 so maybe it should go in 0.21 indeed.

@jeremiedbb jeremiedbb force-pushed the jeremiedbb:gbc-with-init branch 2 times, most recently from ae99693 to 92cb295 Nov 8, 2018

# init does not support sample weights
init_est = _NoSampleWeightWrapper(init())
gb(init=init_est).fit(X, y) # ok no sample weights
with pytest.raises(ValueError):

This comment has been minimized.

@rth

rth Nov 18, 2018

Member

To be more specific about the error,

Suggested change
with pytest.raises(ValueError):
with pytest.raises(
ValueError,
match=r'.*estimator.*does not support sample weights.*'):
gb, dataset_maker, init = task

X, y = dataset_maker()
sample_weight = np.random.rand(100)

This comment has been minimized.

@rth

rth Nov 18, 2018

Member

Let's use a RNG,

Suggested change
sample_weight = np.random.rand(100)
sample_weight = np.random.RandomState(42).rand(100)
# estimator.
# Check that an error is raised if trying to fit with sample weight but
# inital estimator does not support sample weight
gb, dataset_maker, init = task

This comment has been minimized.

@rth

rth Nov 18, 2018

Member

Maybe use a bit more explicit name here,

init_estimator
else:
raise ValueError(
"The initial estimator {} does not support sample "
"weights yet.".format(self.init_.__class__.__name__))

This comment has been minimized.

@rth

rth Nov 18, 2018

Member
Suggested change
"weights yet.".format(self.init_.__class__.__name__))
"weights.".format(self.init_.__class__.__name__))

This comment has been minimized.

@jeremiedbb

jeremiedbb Nov 18, 2018

Author Contributor

It was to point that it's not a definitive statement :)

if sample_weight is None:
sample_weight = np.ones(n_samples, dtype=np.float32)
sample_weight_was_none = True

This comment has been minimized.

@rth

rth Nov 18, 2018

Member
Suggested change
sample_weight_was_none = True
sample_weight_is_none = True
else:
sample_weight = column_or_1d(sample_weight, warn=True)
sample_weight_was_none = False

This comment has been minimized.

@rth

rth Nov 18, 2018

Member
Suggested change
sample_weight_was_none = False
sample_weight_is_none = False
# fit initial model
try:
self.init_.fit(X, y, sample_weight=sample_weight)
except TypeError:

This comment has been minimized.

@rth

rth Nov 18, 2018

Member

Maybe calling inspect.signature(self.init_.fit), and adapting the kwargs, in consequence, would be cleaner? Here I wonder what happens if one gets a genuine unrelated TypeError, then understanding the traceback will be somewhat difficult.

This comment has been minimized.

@rth

rth Nov 18, 2018

Member

Nevermind, I see, it was the case originally and Joel requested the opposite change in #12436 (comment)

@NicolasHug
Copy link
Contributor

NicolasHug left a comment

Happy to see this fixed!

The current fix does not support multiclass classification, I added some suggestions.

@pytest.mark.parametrize(
"task",
[(GradientBoostingClassifier, make_classification, DummyClassifier),
(GradientBoostingRegressor, make_regression, DummyRegressor)],

This comment has been minimized.

@NicolasHug

NicolasHug Nov 28, 2018

Contributor

Testing for multiclass classification currently fails:

def make_multiclass():
    return make_classification(n_classes=3, n_clusters_per_class=1)


@pytest.mark.parametrize(
    "task",
    [(GradientBoostingClassifier, make_classification, DummyClassifier),
     (GradientBoostingClassifier, make_multiclass, DummyClassifier),
     (GradientBoostingRegressor, make_regression, DummyRegressor)],
    ids=["classification", "multiclass", "regression"]
def test_gradient_boosting_with_init(task):
...

# will fail with make_multiclass
y_pred = self.init_.predict(X)
y_pred = self.init_.predict(X).astype(np.float64, copy=False)
if y_pred.ndim == 1:
y_pred = y_pred[:, np.newaxis]

This comment has been minimized.

@NicolasHug

NicolasHug Nov 28, 2018

Contributor
Suggested change
y_pred = y_pred[:, np.newaxis]
if self.loss_.K >= 2:
# multiclass classification needs y_pred to be of shape
# (n_samples, K) where K is the number of trees per
# iteration, i.e. the number of classes.
# see e.g. PriorProbabilityEstimator or ZeroEstimator
n_samples = y_pred.shape[0]
y_pred = np.tile(y_pred, self.n_classes_)
y_pred = y_pred.reshape(n_samples, self.loss_.K)
else:
y_pred = y_pred[:, np.newaxis]

This comment has been minimized.

@jeremiedbb

jeremiedbb Nov 30, 2018

Author Contributor

Good catch. Wouldn't be better to make a one-hot of y_pred ?
I pushed it, let me know which version is best.

This comment has been minimized.

@NicolasHug

NicolasHug Nov 30, 2018

Contributor

Yes one-hot seems to be a better option. I think my version would have been equivalent to a zero initialization (or any constant value).

BTW I realized we could use if loss.is_multi_class to be more explicit.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 28, 2018

Should we detect that this is happening and bail in the partial dependence plot? Non-constant init estimators will make the partial dependence plot misleading / wrong.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Nov 28, 2018

@amueller what do you mean "this is happening"? I don't think this is related to partial dependence plot, we're simply fixing a shape issue here.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 28, 2018

By "this" I mean "a non-constant init estimator". So far in our implementation init estimators were always constant. After this patch they will not be.
In the limited setting of before the patch, our partial dependence plot implementation was meaningful, in the more general setting after the patch, it is not.

@rth rth dismissed their stale review Nov 30, 2018

outdated review

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Dec 4, 2018

@amueller I guess we could merge this as-is, and simply add a warning in the doc in #12599 about this?

@jnothman
Copy link
Member

jnothman left a comment

LGTM again! Thanks @NicolasHug!

jeremiedbb added some commits Nov 30, 2018

@jeremiedbb jeremiedbb force-pushed the jeremiedbb:gbc-with-init branch from b5d1462 to 1129cb1 Dec 17, 2018

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Dec 26, 2018

Thinking about this again, I don't think it makes any sense to have an init predictor in a classification setting (It's OK for regression).

This y_pred variable that we are initializing here contains the sum of the raw values from the trees, for each training sample. Those trees are all regression trees, even for classification problems, so y_pred contains contiguous values. For classification those raw values in y_pred first have to go through a sigmoid or a softmax to be converted into probabilities, and then into classes.

So currently init_.predict() will predict classes, while it should instead be predicting continuous values that somehow make sense. By 'making sense' here I mean values that minimize the loss, as done by the LogOddsEstimator class used by default.

This is not a problem in regression because ensemble.predict(X) == sum(tree.predict(X) for tree in estimators_): the prediction of the ensemble is homogeneous to the predictions of the trees, and doesn't need to go through another function (sigmoid or softmax) to be converted into a final prediction.

I would propose to either:

  • deprecate the initial predictor for classification and ultimately remove it
  • only accept a predictor that has predict_proba and pass those probabilities through log(p / 1 - p) (or equivalent for multiclass classif).

Slightly related: lots of the docstrings actually state that y_pred are the predicted labels but that's not true. Those are the values in the tree leaves. I'm to blame for this and will provide a fix later.

Hope it's clear... I can try to clarify if needed

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Dec 27, 2018

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Dec 27, 2018

Is a classifier with decision_function appropriate?

That would be slightly less inappropriate than using predict but still not ideal, I think.

In binary classification:

  • at any point in fit(), y_pred contains sum of the raw values from the trees for each training sample.
  • sigmoid(y_pred) is the probability that each sample belongs to the positive class, and argmax(sigmoid(y_pred)) are the predicted labels.
  • Before fitting, the best initial prediction we can have is just to predict the most common class in the training set. Denoting p = y_train.mean() the proportion of positive examples in the training set, the best we can do is that predict_proba returns p for each sample. To achieve this, we set y_pred to sigmoid^{-1}(p) = log(p / (1 - p)), since this will result in predicted_proba = sigmoid(y_pred) = sigmoid(sigmoid^{-1}(p)) = p. That's what the default LogOddsEstimator is doing.

My proposition is to replace the constant p by some_estimator.predict_proba(X_train).

It does not make sense to set y_pred to some_estimator.predict(X_train) because then sigmoid(y_pred) isn't the probability of anything meaningful.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 27, 2018

From discussing with @NicolasHug I think the solution in this PR is not correct for the classification case and we should do what @NicolasHug proposes (which actually also cleans up the GradientBoosting interface).

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Dec 31, 2018

I propose the following:

Each loss function should have a get_init_predictions(self, X_train, init_estimator) method that returns the initial (raw) predictions. init_estimator would be a fitted estimator on X_train, y_train.

For the binary classification loss, this would look like

def get_init_prediction(X_train, init_estimator):
    p = estimator.predict_proba(X_train)
    return np.log(p / (1 - p))

while for least squares this would be

def get_init_prediction(X_train, init_estimator):
    return estimator.predict(X_train)

The built-in init estimators like LogOddsEstimator would need to be changed to only return p and not log(p / (1 - p)). We might even completely remove those classes since those could be treated as default if init_estimator=None in loss.get_init_prediction.

It would be similar to what we do in pygbm, except that pygbm does not accept custom init estimators.


@jeremiedbb you probably didn't sign up for this so if you want, I can take it up. I was planning on refactoring the GB code anyway (putting losses in another file, renaming y_pred into raw_predictions, and fixing the incorrect docstrings.)

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 2, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jan 2, 2019

I don't think so because y_pred is not a probability. Also, sigmoid is only used in binary classification to transform the raw predictions (i.e. the output of decision_function) into probabilities. sigmoid isn't used in regression or in multiclass classification.

@jeremiedbb

This comment has been minimized.

Copy link
Contributor Author

jeremiedbb commented Jan 14, 2019

@jeremiedbb you probably didn't sign up for this so if you want, I can take it up.

@NicolasHug feel free to take over. I won't have much time to work on it, and this is going out of my comfort zone :)

@NicolasHug NicolasHug referenced this pull request Jan 14, 2019

Merged

FIX several bugs in initial predictions for GBDT #12983

8 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.