Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation mistake of Adam in v1.6.0? #42843

Open
yuhaozhang opened this issue Aug 11, 2020 · 7 comments
Open

Documentation mistake of Adam in v1.6.0? #42843

yuhaozhang opened this issue Aug 11, 2020 · 7 comments
Labels
module: docs Related to our documentation, both in docs/ and docblocks module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@yuhaozhang
Copy link

yuhaozhang commented Aug 11, 2020

馃摎 Documentation

The documentation of Adam since v1.6.0 suggests that it uses the weight decay fix proposed by paper "Decoupled Weight Decay Regularization":

The implementation of the L2 penalty follows changes proposed in
`Decoupled Weight Decay Regularization`_.

However I found this to not be the case. The actual weight decay in Adam is still the old one:

if group['weight_decay'] != 0:
grad = grad.add(p, alpha=group['weight_decay'])

In contrast, the new weight decay fix is implemented in AdamW, and explained by the AdamW doc:

The AdamW variant was proposed in `Decoupled Weight Decay Regularization`_.

p.mul_(1 - group['lr'] * group['weight_decay'])

It is possible that I misunderstood the current documentation of Adam, but I found it confusing and suggest that currently there is no difference between the Adam and AdamW implementations, while the difference still exists.

cc @jlin27 @vincentqb

@zou3519 zou3519 added module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 11, 2020
@zou3519
Copy link
Contributor

zou3519 commented Aug 11, 2020

cc @vincentqb -- what do you think?

@zou3519 zou3519 added the module: docs Related to our documentation, both in docs/ and docblocks label Aug 11, 2020
@vincentqb
Copy link
Contributor

vincentqb commented Aug 20, 2020

My understanding of the issue here is that the weight decay paper doesn't need to be referenced, and adding the reference in the documentation adds to confusion.

The reference to the follow-up paper was added via this issue, and the justification for the change is here. Does this address your comment @yuhaozhang? If not, how would you suggest rephrasing the documentation?

@yuhaozhang
Copy link
Author

yuhaozhang commented Aug 20, 2020

Hi @vincentqb, thanks for the followup! Your understanding is correct in that this current reference leads to more confusion about the actual weight decay applied in the Adam implementation.

I have looked at #41477 as you pointed to, and I am now more confused about why this reference was added in the first place. It looks to me that the current Adam implementation conforms perfectly to the original Adam paper (Algorithm 1 in page 2), and there is no difference between how epsilon was applied in the original Adam paper and the more recent Decoupled Weight Decay paper.

Looking further down the line, it looks like there was originally a difference between how Adam was implemented in early versions of PyTorch, which was then fixed (fed5ca1) by referencing to the more recent Decoupled Weight Decay paper. However this reference is not necessary since the implementation of epsilon is the same in both papers and we can just equally reference the original Adam paper for that.

@yuhaozhang
Copy link
Author

To summarize, a proposed change would be to remove this note at all, because: 1) the current code implements epsilon in the same way as the original Adam paper; 2) adding this note may lead to further confusion about the weight decay in Adam. There is a chance that my understanding of this is wrong, in which case please correct me. Thanks!

@dvolgyes
Copy link

I absolutely think this documentation mistake is serious, and the documentation should be fixed as soon as possible.
It mislead at least me, but probably other people too.

@Yura52
Copy link
Contributor

Yura52 commented Apr 2, 2021

+1, I have always thought that the documentation says that Adam and AdamW are literally the same (which was strange to me at the same time).
I think the remark can be safely removed, in its current form it rather adds confusion than makes things clearer. If it is important to keep it, then I would propose to add a note, something like this:
"Note: for the decoupled weight decay, use AdamW."

@RuABraun
Copy link

While screensharing went to the Adam documentation to show someone that it was implemented with the fix, only to find the documentation changed, something I only realized after switching the pytorch versions (on the documentation webpage).

Cannot believe this wasn't advertised more widely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: docs Related to our documentation, both in docs/ and docblocks module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants