Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Poisson loss for HistGradientBoostingRegressor #16692

Merged
merged 23 commits into from
Apr 23, 2020

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 14, 2020

Reference Issues/PRs

This PR partly addresses #16668 and #5975.

What does this implement/fix? Explain your changes.

This PR implements the Poisson loss for HistGradientBoostingRegressor, i.e. splitting based on improvement in Poisson deviance.

@lorentzenchr lorentzenchr changed the title [WIP] Poisson loss for GradientBoostingRegressor [MRG] Poisson loss for HistGradientBoostingRegressor Mar 15, 2020
@lorentzenchr
Copy link
Member Author

ping @NicolasHug

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorentzenchr , this looks good!

I mostly have minor comments.

Should we check that we always have y >= 0?

Please make a minor update to ensemble.rst (around line 955) to document the new loss.
This will also need an entry in the what's new

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/loss.py Outdated Show resolved Hide resolved
# than least squares measured in Poisson deviance as score.
rng = np.random.RandomState(42)
X, y, coef = make_regression(n_samples=500, coef=True, random_state=rng)
coef /= np.max(np.abs(coef))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Also at this point since we're also overriding y, should we still eb using make_regression?

assert_almost_equal(np.mean(y_baseline), y_train.mean())

# Test baseline for y_true = 0
y_train.fill(0.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y_train.fill(0.)
y_train = np.zeros(100)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer it this way?

@@ -192,6 +192,24 @@ def test_least_absolute_deviation():
assert gbdt.score(X, y) > .9


def test_poisson_loss():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also test that the score is above a given threshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike r2 score, it is hard to give an absolute "good" value for the poisson deviance. With the "d2 score" of #15244 this would make more sense.
I added a DummyRegressor with mean as prediction. This is (almost) equivalent to a d2 score. And I added out-of-sample tests.

@lorentzenchr
Copy link
Member Author

@NicolasHug Thanks for your fast first review pass. I think I addressed all comments. I have to say, the histogram gradient boosting implementation seems like a piece of art. I wish it would have been that easy to include Poisson for linear models 😄

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorentzenchr , a few more nits but looks good!

Thanks for the fast work!

(I wonder why the CI doesn't show the test suite instances... tests pass locally at least)

pinging @ogrisel who will be interested.

@lorentzenchr
Copy link
Member Author

Can someone explain, why the test suddenly fails and how to resolve it?

___ test_estimators[HistGradientBoostingRegressor()-check_estimators_unfitted] ____
...
def predict(self, X):
...
>       return self.loss_.inverse_link_function(self._raw_predict(X).ravel())
E       AttributeError: 'HistGradientBoostingRegressor' object has no attribute 'loss_'

An unfitted HistGradientBoostingRegressordoes not have attribute loss_ only attribute loss.

@NicolasHug
Copy link
Member

You'll need to call check_is_fitted in predict of HistGradientBoostingRegressor now.

I think the error comes from the fact that you added

return self.loss_.inverse_link_function(self._raw_predict(X).ravel())

so now the error is "this estimator doesn't have a loss_ attribute" instead of being "this estimator isn't fitted" (as would be raised by _raw_predict())

Alternatively this should also work

		pred = self._raw_predict(X).ravel()
        return self.loss_.inverse_link_function(pred)

@lorentzenchr
Copy link
Member Author

@NicolasHug Thanks. That solves it. In particular, I was wondering why the tests did pass before. Never mind.

@lorentzenchr lorentzenchr changed the title [MRG] Poisson loss for HistGradientBoostingRegressor [MRG+1] Poisson loss for HistGradientBoostingRegressor Mar 30, 2020
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here @lorentzenchr

sklearn/ensemble/_hist_gradient_boosting/loss.py Outdated Show resolved Hide resolved
# return a view.
raw_predictions = raw_predictions.reshape(-1)
# TODO: For speed, we could remove the constant xlogy(y_true, y_true)
# Advantage of this form: minimum of zero at raw_predictions = y_true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we taking advantage of this advantage somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. Might be interesting to see, if it matters (at all).

@rth rth self-requested a review April 21, 2020 08:42
y = rng.poisson(lam=np.exp(X @ coef))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
random_state=rng)
gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)
gbdt_pois = HistGradientBoostingRegressor(loss='poisson', random_state=rng)

And below

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
random_state=rng)
gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)
gbdt2 = HistGradientBoostingRegressor(loss='least_squares',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gbdt2 = HistGradientBoostingRegressor(loss='least_squares',
gbdt_ls = HistGradientBoostingRegressor(loss='least_squares',

And below

# log(0)
assert y_train.sum() > 0
baseline_prediction = loss.get_baseline_prediction(y_train, None, 1)
assert baseline_prediction.shape == tuple() # scalar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
assert baseline_prediction.shape == tuple() # scalar
assert np.isscaler(baseline_prediction)

@thomasjpfan thomasjpfan changed the title [MRG+1] Poisson loss for HistGradientBoostingRegressor ENH Poisson loss for HistGradientBoostingRegressor Apr 23, 2020
@thomasjpfan thomasjpfan merged commit a93b15f into scikit-learn:master Apr 23, 2020
@thomasjpfan
Copy link
Member

Thank you @lorentzenchr !

@lorentzenchr
Copy link
Member Author

@thomasjpfan Thank you for your review and merging. 👍

Now, the good old, brand new Poisson GLM will come out in the same release as this Poisson HGB. That is a strong competitor! 😄

@lorentzenchr lorentzenchr deleted the hgb_poisson branch April 23, 2020 15:16
@rth
Copy link
Member

rth commented Apr 23, 2020

Thanks for all the work by the three of you in this PR! Looking forward to the release :)

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants