Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] implement least absolute deviation loss in GBDTs #13896

Merged
merged 19 commits into from Sep 9, 2019

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented May 16, 2019

This PR implements the least absolute deviation (or MAE) loss for GBDTs.

It's not as trivial as it seems, since the leaves values have to be updated after the tree is trained. Indeed for MAE, we train the trees to fit the gradients (or more accuratly a Newton–Raphson step), but we need them to predict the median of the residuals. See the original paper for more details.

See also the current implementation of this for the old version of GBDTs, in particular _update_terminal_region() in sklearn/ensemble/gb_losses.py

Will ping when this is ready.


This is a bit slower than LightGBM, since the leaves values update is not done in parallel here.

python benchmarks/bench_hist_gradient_boosting.py --lightgbm --problem regression --n-samples-max 5000000 --n-trees 50 --loss least_absolute_deviation

mae

@NicolasHug NicolasHug changed the title [WIP] implement least absolute deviation loss in GBDTs [MRG] implement least absolute deviation loss in GBDTs May 16, 2019
@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented May 16, 2019

I think it's ready, ping @ogrisel @glemaitre @adrinjalali since you guys are the most familiar with the code ;)

# loss
# If the hessians are constant, we consider they are equal to 1.
# - This is correct for the half LS loss
# - For LAD loss, hessians are actually 0, but they are always
Copy link
Member

@adrinjalali adrinjalali Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's the case, doesn't it make sense to actually set them to 0? I'm kinda worried about maintainability of this.

Copy link
Member

@adrinjalali adrinjalali Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even have a parameter in the loss class which is the function which gives you the constant hessians, i.e. for LAD it'd be np.zeros.

Copy link
Member Author

@NicolasHug NicolasHug Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's more maintainable to have the convention that constant hessians are always 1 rather than have custom constant hessians for each loss. Especially when these hessians are never used, like here.

Copy link
Member Author

@NicolasHug NicolasHug Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, that's something that's always implicit and not explained in the papers, but hessians need to be treated as 1 (even though they're 0): in order for the leaf value computation to be an average instead of a sum.

Copy link
Contributor

@glemaitre glemaitre left a comment

It looks good otherwise. You will need an entry in what's new as well.

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Jun 19, 2019

I just went to check our docs, and noticed/remembered we still don't have good examples/user guide for these classes. But other than that, it seems the loss function names are not consistent with the other ensemble methods, for instance the GBR calls this LAD, I think.

('binary_crossentropy', 2, 1),
('categorical_crossentropy', 3, 3),
])
@pytest.mark.skipif(Y_DTYPE != np.float64,
reason='Need 64 bits float precision for numerical checks')
def test_numerical_gradients(loss, n_classes, prediction_dim):
@pytest.mark.parametrize('seed', range(1))
Copy link
Member

@adrinjalali adrinjalali Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O_o

Copy link
Member Author

@NicolasHug NicolasHug Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^

I can remove it, it's just convenient when testing locally

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Jun 19, 2019

we still don't have good examples/user guide for these classes.

don't worry that's still on my stack, I'm just waiting for a few more features

the loss function names are not consistent with the other ensemble methods

right, we broke name consistency with the other implementation already ('ls' vs 'least-squares', 'deviance' vs 'binary-crossentropy', etc.)

Copy link
Member

@adrinjalali adrinjalali left a comment

LGTM :)

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 2, 2019

@glemaitre you were OK with this I think, do you wanna have another look so we can merge? Thanks!

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 21, 2019

Ping @glemaitre ;)

Maybe @ogrisel too?

@@ -668,7 +672,8 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
Parameters
----------
loss : {'least_squares'}, optional (default='least_squares')
loss : {'least_squares', 'least_absolute_deviation'}, \
optional (default='least_squares')
Copy link
Contributor

@glemaitre glemaitre Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we remove right away the optional and the parenthesis?

Copy link
Member Author

@NicolasHug NicolasHug Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the current docstring consistent with the other entries but I'm +1 in updating all of them in another PR

Copy link
Contributor

@glemaitre glemaitre Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fine with me

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Aug 23, 2019

I am still fine with the PR. But before to merge, would it make sense to add a test in test_compare_lightgbm.py to check that we get a similar result?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Aug 23, 2019

and regarding the test, WDYT?

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 23, 2019

I agree. It currently fails, I'm investigating why. This has to do with the initial predictions being different from lightgbm and sklearn.

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 23, 2019

OK I sorted it out. Added a comment:

    # - We don't check the least_absolute_deviation loss here. This is because
    #   LightGBM's computation of the median (used for the initial value of
    #   raw_prediction) is a bit off (they'll e.g. return midpoints when there
    #   is no need to.). Since these tests only run 1 iteration, the
    #   discrepancy between the initial values leads to biggish differences in
    #   the predictions. These differences are much smaller with more
    #   iterations.

@glemaitre glemaitre merged commit 5cf88db into scikit-learn:master Sep 9, 2019
17 checks passed
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 9, 2019

OK make sense. Merging then. Thanks @NicolasHug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants