-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Poisson loss for HistGradientBoostingRegressor #16692
Conversation
ping @NicolasHug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorentzenchr , this looks good!
I mostly have minor comments.
Should we check that we always have y >= 0?
Please make a minor update to ensemble.rst
(around line 955) to document the new loss.
This will also need an entry in the what's new
# than least squares measured in Poisson deviance as score. | ||
rng = np.random.RandomState(42) | ||
X, y, coef = make_regression(n_samples=500, coef=True, random_state=rng) | ||
coef /= np.max(np.abs(coef)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
Also at this point since we're also overriding y
, should we still eb using make_regression
?
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
assert_almost_equal(np.mean(y_baseline), y_train.mean()) | ||
|
||
# Test baseline for y_true = 0 | ||
y_train.fill(0.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y_train.fill(0.) | |
y_train = np.zeros(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you prefer it this way?
@@ -192,6 +192,24 @@ def test_least_absolute_deviation(): | |||
assert gbdt.score(X, y) > .9 | |||
|
|||
|
|||
def test_poisson_loss(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also test that the score is above a given threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike r2 score, it is hard to give an absolute "good" value for the poisson deviance. With the "d2 score" of #15244 this would make more sense.
I added a DummyRegressor
with mean as prediction. This is (almost) equivalent to a d2 score. And I added out-of-sample tests.
@NicolasHug Thanks for your fast first review pass. I think I addressed all comments. I have to say, the histogram gradient boosting implementation seems like a piece of art. I wish it would have been that easy to include Poisson for linear models 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorentzenchr , a few more nits but looks good!
Thanks for the fast work!
(I wonder why the CI doesn't show the test suite instances... tests pass locally at least)
pinging @ogrisel who will be interested.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
Can someone explain, why the test suddenly fails and how to resolve it?
An unfitted |
You'll need to call I think the error comes from the fact that you added
so now the error is "this estimator doesn't have a loss_ attribute" instead of being "this estimator isn't fitted" (as would be raised by Alternatively this should also work pred = self._raw_predict(X).ravel()
return self.loss_.inverse_link_function(pred) |
@NicolasHug Thanks. That solves it. In particular, I was wondering why the tests did pass before. Never mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here @lorentzenchr
# return a view. | ||
raw_predictions = raw_predictions.reshape(-1) | ||
# TODO: For speed, we could remove the constant xlogy(y_true, y_true) | ||
# Advantage of this form: minimum of zero at raw_predictions = y_true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we taking advantage of this advantage somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. Might be interesting to see, if it matters (at all).
y = rng.poisson(lam=np.exp(X @ coef)) | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test, | ||
random_state=rng) | ||
gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng) | |
gbdt_pois = HistGradientBoostingRegressor(loss='poisson', random_state=rng) |
And below
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test, | ||
random_state=rng) | ||
gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng) | ||
gbdt2 = HistGradientBoostingRegressor(loss='least_squares', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gbdt2 = HistGradientBoostingRegressor(loss='least_squares', | |
gbdt_ls = HistGradientBoostingRegressor(loss='least_squares', |
And below
# log(0) | ||
assert y_train.sum() > 0 | ||
baseline_prediction = loss.get_baseline_prediction(y_train, None, 1) | ||
assert baseline_prediction.shape == tuple() # scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
assert baseline_prediction.shape == tuple() # scalar | |
assert np.isscaler(baseline_prediction) |
Thank you @lorentzenchr ! |
@thomasjpfan Thank you for your review and merging. 👍 Now, the good old, brand new Poisson GLM will come out in the same release as this Poisson HGB. That is a strong competitor! 😄 |
Thanks for all the work by the three of you in this PR! Looking forward to the release :) |
Reference Issues/PRs
This PR partly addresses #16668 and #5975.
What does this implement/fix? Explain your changes.
This PR implements the Poisson loss for
HistGradientBoostingRegressor
, i.e. splitting based on improvement in Poisson deviance.