Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss function name consistency in gradient boosting #3481

Closed
mblondel opened this issue Jul 24, 2014 · 19 comments
Closed

loss function name consistency in gradient boosting #3481

mblondel opened this issue Jul 24, 2014 · 19 comments

Comments

@mblondel
Copy link
Member

It would be nice if the loss option in gradient boosting could be more consistent with the one in SGD. Rather than deprecating names in gradient boosting, I suggest adding aliases.

@agramfort
Copy link
Member

maybe something for @dsullivan7

@dsullivan7
Copy link
Contributor

Yes, I was just thinking this actually. I was hoping to extract the loss functions out from SGD so perhaps they can be shared between sgd_fast.pyx and _gradient_boosting.pyx? I'm not too comfortable with gradient_boosting but I'll take a look.

@agramfort
Copy link
Member

it's more in terms of API here.

@mblondel
Copy link
Member Author

The underlying methods are completely different so I don't think we need to share code. It's just a matter of unifying the names.

lad -> absolute
bdeviance -> log
mdeviance -> multiclass_log

The absolute loss is missing from SGD right know but it can be implemented by setting epsilon=0 in the epsilon insensitive loss.

@mblondel
Copy link
Member Author

Conversely, I think it should also be possible to add the squared_hinge and modified_huber losses to gradient boosting. The hinge is loss is not differentiable so it cannot be added.

@dsullivan7
Copy link
Contributor

Ok sounds good, I'll take a crack at making the aliases then. I'll also check in on possibly adding squared_hinge and modified_huber. Is there a reason that the underlying methods are completely different? I haven't looked at it so I don't know.

@mblondel
Copy link
Member Author

The elements of the stochastic (sub-)gradient in SGD are with respect the feature coefficients coef[j], so the gradient is n_features dimensional. The elements of the gradient in gradient boosting are with respect to the predictions y_pred[i], so the gradient is n_samples dimensional. In addition, gradient boosting needs a method to update the underlying trees.

@larsmans
Copy link
Member

larsmans commented Aug 1, 2014

Do we want to fix LinearSVC as well? It calls its losses L1 and L2, which is quite confusing given that they mean hinge and squared hinge.

@dsullivan7
Copy link
Contributor

Yikes, yes I'll take a look at that too

@mblondel
Copy link
Member Author

mblondel commented Aug 1, 2014

+1e6 too

It was not clear in the SO answer but the reason it's called L1 and L2 losses is because of the constrained formulation of the soft-margin SVM. The sum
over the ξ variables is an L1 norm for hinge loss (the ξ variables are
non-negative) and an L2 norm for squared hinge loss.

@kastnerkyle
Copy link
Member

On a side note - hinge loss is differentiable... see Charlie Tang's paper. It might not be worth the complexity to implement right now, but I think it is possible and it worked well for the tasks I tried it on (neural net image recognition).

@larsmans
Copy link
Member

larsmans commented Aug 7, 2014

As a side note to the side note: Hinton mentioned something about LeCun having done max-margin neural nets in his Coursera course, and I gather he meant optimizing for hinge loss. This would have been ~two decades ago.

@mblondel
Copy link
Member Author

mblondel commented Aug 9, 2014

@kastnerkyle Just to clarify, the hinge loss is only differentiable in the squared / l2 case. I'd love to add this loss in gradient boosting. I'd expect it to use fewer trees than the log loss.

@kastnerkyle
Copy link
Member

I have been going back and forth on this for a while now - do you think equation (10) in the paper is meant to be the differentiated L1 loss? It looks like it, but I don't know how to verify it.

When I did this last, I only implemented the L2 version because of their earlier statements about only L2 being differentiable in section 2.2. However on the same page as eq (10) and (11) in section 2.4, they say that they tested with both L1 and L2 SVM, which would mean they got a gradient for both. I have been too spoiled by Theano's gradient magic...

@kastnerkyle
Copy link
Member

@larsmans I would not be surprised if after the paper, Y. LeCun was like "oh by the way, nice paper but I did that 20 years ago". Seems to happen a lot - hopefully that means it was a good idea :)

@mblondel
Copy link
Member Author

mblondel commented Aug 9, 2014

@kastnerkyle Eq. 10 is technically a sub-gradient, not a gradient so one should use it with the sub-gradient method, not gradient descent. This has implications on convergence proofs, choice of the learning rate, etc. I haven't read the paper but I'd guess it is lacking theoretical guarantees.

@kastnerkyle
Copy link
Member

@mblondel That makes a lot of sense, and explains the confusion I had with the paper. Thanks! Ultimately they say "we tried both, but L2-SVM was always better on our tests" - which may or may not have to do with the difference between gradient/subgradient if they were using eq. 10 in the paper for backprop. Either way squared_hinge should be quite nice for GBRT I think, thanks for clarifying.

@dsullivan7
Copy link
Contributor

It looks like mdevience and bdevience are deprecated, so it might not be a good idea to add aliases for them.

@lorentzenchr
Copy link
Member

Superseded by #18248.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants