loss function name consistency in gradient boosting #3481

mblondel · 2014-07-24T09:14:57Z

It would be nice if the loss option in gradient boosting could be more consistent with the one in SGD. Rather than deprecating names in gradient boosting, I suggest adding aliases.

The text was updated successfully, but these errors were encountered:

agramfort · 2014-07-24T12:11:54Z

maybe something for @dsullivan7

dsullivan7 · 2014-07-24T12:31:09Z

Yes, I was just thinking this actually. I was hoping to extract the loss functions out from SGD so perhaps they can be shared between sgd_fast.pyx and _gradient_boosting.pyx? I'm not too comfortable with gradient_boosting but I'll take a look.

agramfort · 2014-07-24T13:24:07Z

it's more in terms of API here.

mblondel · 2014-07-24T13:31:50Z

The underlying methods are completely different so I don't think we need to share code. It's just a matter of unifying the names.

lad -> absolute
bdeviance -> log
mdeviance -> multiclass_log

The absolute loss is missing from SGD right know but it can be implemented by setting epsilon=0 in the epsilon insensitive loss.

mblondel · 2014-07-24T13:34:00Z

Conversely, I think it should also be possible to add the squared_hinge and modified_huber losses to gradient boosting. The hinge is loss is not differentiable so it cannot be added.

dsullivan7 · 2014-07-24T13:44:08Z

Ok sounds good, I'll take a crack at making the aliases then. I'll also check in on possibly adding squared_hinge and modified_huber. Is there a reason that the underlying methods are completely different? I haven't looked at it so I don't know.

mblondel · 2014-07-24T15:11:39Z

The elements of the stochastic (sub-)gradient in SGD are with respect the feature coefficients coef[j], so the gradient is n_features dimensional. The elements of the gradient in gradient boosting are with respect to the predictions y_pred[i], so the gradient is n_samples dimensional. In addition, gradient boosting needs a method to update the underlying trees.

larsmans · 2014-08-01T06:43:10Z

Do we want to fix LinearSVC as well? It calls its losses L1 and L2, which is quite confusing given that they mean hinge and squared hinge.

dsullivan7 · 2014-08-01T07:05:43Z

Yikes, yes I'll take a look at that too

mblondel · 2014-08-01T07:40:10Z

+1e6 too

It was not clear in the SO answer but the reason it's called L1 and L2 losses is because of the constrained formulation of the soft-margin SVM. The sum
over the ξ variables is an L1 norm for hinge loss (the ξ variables are
non-negative) and an L2 norm for squared hinge loss.

kastnerkyle · 2014-08-06T15:14:53Z

On a side note - hinge loss is differentiable... see Charlie Tang's paper. It might not be worth the complexity to implement right now, but I think it is possible and it worked well for the tasks I tried it on (neural net image recognition).

larsmans · 2014-08-07T07:56:05Z

As a side note to the side note: Hinton mentioned something about LeCun having done max-margin neural nets in his Coursera course, and I gather he meant optimizing for hinge loss. This would have been ~two decades ago.

mblondel · 2014-08-09T21:27:41Z

@kastnerkyle Just to clarify, the hinge loss is only differentiable in the squared / l2 case. I'd love to add this loss in gradient boosting. I'd expect it to use fewer trees than the log loss.

kastnerkyle · 2014-08-09T21:42:51Z

I have been going back and forth on this for a while now - do you think equation (10) in the paper is meant to be the differentiated L1 loss? It looks like it, but I don't know how to verify it.

When I did this last, I only implemented the L2 version because of their earlier statements about only L2 being differentiable in section 2.2. However on the same page as eq (10) and (11) in section 2.4, they say that they tested with both L1 and L2 SVM, which would mean they got a gradient for both. I have been too spoiled by Theano's gradient magic...

kastnerkyle · 2014-08-09T21:45:53Z

@larsmans I would not be surprised if after the paper, Y. LeCun was like "oh by the way, nice paper but I did that 20 years ago". Seems to happen a lot - hopefully that means it was a good idea :)

mblondel · 2014-08-09T21:59:35Z

@kastnerkyle Eq. 10 is technically a sub-gradient, not a gradient so one should use it with the sub-gradient method, not gradient descent. This has implications on convergence proofs, choice of the learning rate, etc. I haven't read the paper but I'd guess it is lacking theoretical guarantees.

kastnerkyle · 2014-08-09T22:22:23Z

@mblondel That makes a lot of sense, and explains the confusion I had with the paper. Thanks! Ultimately they say "we tried both, but L2-SVM was always better on our tests" - which may or may not have to do with the difference between gradient/subgradient if they were using eq. 10 in the paper for backprop. Either way squared_hinge should be quite nice for GBRT I think, thanks for clarifying.

dsullivan7 · 2014-08-13T12:55:55Z

It looks like mdevience and bdevience are deprecated, so it might not be a good idea to add aliases for them.

lorentzenchr · 2020-08-27T07:21:39Z

Superseded by #18248.

mblondel added the Enhancement label Jul 24, 2014

dsullivan7 mentioned this issue Aug 13, 2014

[WIP] loss function name consistency #3556

Closed

mblondel changed the title ~~loss function name consistency~~ loss function name consistency in gradient boosting Dec 4, 2014

rth mentioned this issue Oct 3, 2019

A common private module for differentiable loss functions used as objective functions in estimators #15123

Open

lorentzenchr mentioned this issue Aug 27, 2020

RFC Consistent options/names for loss and criterion #18248

Closed

3 tasks

lorentzenchr closed this as completed Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss function name consistency in gradient boosting #3481

loss function name consistency in gradient boosting #3481

mblondel commented Jul 24, 2014

agramfort commented Jul 24, 2014

dsullivan7 commented Jul 24, 2014

agramfort commented Jul 24, 2014

mblondel commented Jul 24, 2014

mblondel commented Jul 24, 2014

dsullivan7 commented Jul 24, 2014

mblondel commented Jul 24, 2014

larsmans commented Aug 1, 2014

dsullivan7 commented Aug 1, 2014

mblondel commented Aug 1, 2014

kastnerkyle commented Aug 6, 2014

larsmans commented Aug 7, 2014

mblondel commented Aug 9, 2014

kastnerkyle commented Aug 9, 2014

kastnerkyle commented Aug 9, 2014

mblondel commented Aug 9, 2014

kastnerkyle commented Aug 9, 2014

dsullivan7 commented Aug 13, 2014

lorentzenchr commented Aug 27, 2020

loss function name consistency in gradient boosting #3481

loss function name consistency in gradient boosting #3481

Comments

mblondel commented Jul 24, 2014

agramfort commented Jul 24, 2014

dsullivan7 commented Jul 24, 2014

agramfort commented Jul 24, 2014

mblondel commented Jul 24, 2014

mblondel commented Jul 24, 2014

dsullivan7 commented Jul 24, 2014

mblondel commented Jul 24, 2014

larsmans commented Aug 1, 2014

dsullivan7 commented Aug 1, 2014

mblondel commented Aug 1, 2014

kastnerkyle commented Aug 6, 2014

larsmans commented Aug 7, 2014

mblondel commented Aug 9, 2014

kastnerkyle commented Aug 9, 2014

kastnerkyle commented Aug 9, 2014

mblondel commented Aug 9, 2014

kastnerkyle commented Aug 9, 2014

dsullivan7 commented Aug 13, 2014

lorentzenchr commented Aug 27, 2020