Documentation mistake of Adam in v1.6.0? #42843

yuhaozhang · 2020-08-11T01:17:41Z

📚 Documentation

The documentation of Adam since v1.6.0 suggests that it uses the weight decay fix proposed by paper "Decoupled Weight Decay Regularization":

pytorch/torch/optim/adam.py

Lines 10 to 11 in 4b4273a

    
               The implementation of the L2 penalty follows changes proposed in 
        
               `Decoupled Weight Decay Regularization`_.

However I found this to not be the case. The actual weight decay in Adam is still the old one:

pytorch/torch/optim/adam.py

Lines 99 to 100 in 4b4273a

    
           if group['weight_decay'] != 0: 
        
               grad = grad.add(p, alpha=group['weight_decay'])

In contrast, the new weight decay fix is implemented in AdamW, and explained by the AdamW doc:

pytorch/torch/optim/adamw.py

Line 10 in 4b4273a

The AdamW variant was proposed in `Decoupled Weight Decay Regularization`_.

pytorch/torch/optim/adamw.py

Line 73 in 4b4273a

p.mul_(1 - group['lr'] * group['weight_decay'])

It is possible that I misunderstood the current documentation of Adam, but I found it confusing and suggest that currently there is no difference between the Adam and AdamW implementations, while the difference still exists.

cc @jlin27 @vincentqb

zou3519 · 2020-08-11T14:12:21Z

cc @vincentqb -- what do you think?

vincentqb · 2020-08-20T20:45:45Z

My understanding of the issue here is that the weight decay paper doesn't need to be referenced, and adding the reference in the documentation adds to confusion.

The reference to the follow-up paper was added via this issue, and the justification for the change is here. Does this address your comment @yuhaozhang? If not, how would you suggest rephrasing the documentation?

yuhaozhang · 2020-08-20T23:57:34Z

Hi @vincentqb, thanks for the followup! Your understanding is correct in that this current reference leads to more confusion about the actual weight decay applied in the Adam implementation.

I have looked at #41477 as you pointed to, and I am now more confused about why this reference was added in the first place. It looks to me that the current Adam implementation conforms perfectly to the original Adam paper (Algorithm 1 in page 2), and there is no difference between how epsilon was applied in the original Adam paper and the more recent Decoupled Weight Decay paper.

Looking further down the line, it looks like there was originally a difference between how Adam was implemented in early versions of PyTorch, which was then fixed (fed5ca1) by referencing to the more recent Decoupled Weight Decay paper. However this reference is not necessary since the implementation of epsilon is the same in both papers and we can just equally reference the original Adam paper for that.

yuhaozhang · 2020-08-21T00:06:38Z

To summarize, a proposed change would be to remove this note at all, because: 1) the current code implements epsilon in the same way as the original Adam paper; 2) adding this note may lead to further confusion about the weight decay in Adam. There is a chance that my understanding of this is wrong, in which case please correct me. Thanks!

dvolgyes · 2020-11-26T09:23:36Z

I absolutely think this documentation mistake is serious, and the documentation should be fixed as soon as possible.
It mislead at least me, but probably other people too.

Yura52 · 2021-04-02T10:43:54Z

+1, I have always thought that the documentation says that Adam and AdamW are literally the same (which was strange to me at the same time).
I think the remark can be safely removed, in its current form it rather adds confusion than makes things clearer. If it is important to keep it, then I would propose to add a note, something like this:
"Note: for the decoupled weight decay, use AdamW."

RuABraun · 2021-12-10T16:09:26Z

While screensharing went to the Adam documentation to show someone that it was implemented with the fix, only to find the documentation changed, something I only realized after switching the pytorch versions (on the documentation webpage).

Cannot believe this wasn't advertised more widely.

zou3519 added module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 11, 2020

zou3519 added the module: docs Related to our documentation, both in docs/ and docblocks label Aug 11, 2020

constantiux mentioned this issue Aug 15, 2020

update adam documentation #43055

Closed

dvolgyes mentioned this issue Nov 26, 2020

0.1.0 changes for ranger_adabelief juntang-zhuang/Adabelief-Optimizer#19

Closed

dvolgyes mentioned this issue Jul 25, 2022

weight_decay in Adam is not an L2 Penalty #38767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation mistake of Adam in v1.6.0? #42843

Documentation mistake of Adam in v1.6.0? #42843

yuhaozhang commented Aug 11, 2020 •

edited by pytorch-probot bot

Loading

zou3519 commented Aug 11, 2020

vincentqb commented Aug 20, 2020 •

edited

Loading

yuhaozhang commented Aug 20, 2020 •

edited

Loading

yuhaozhang commented Aug 21, 2020

dvolgyes commented Nov 26, 2020

Yura52 commented Apr 2, 2021

RuABraun commented Dec 10, 2021

Documentation mistake of Adam in v1.6.0? #42843

Documentation mistake of Adam in v1.6.0? #42843

Comments

yuhaozhang commented Aug 11, 2020 • edited by pytorch-probot bot Loading

📚 Documentation

zou3519 commented Aug 11, 2020

vincentqb commented Aug 20, 2020 • edited Loading

yuhaozhang commented Aug 20, 2020 • edited Loading

yuhaozhang commented Aug 21, 2020

dvolgyes commented Nov 26, 2020

Yura52 commented Apr 2, 2021

RuABraun commented Dec 10, 2021

yuhaozhang commented Aug 11, 2020 •

edited by pytorch-probot bot

Loading

vincentqb commented Aug 20, 2020 •

edited

Loading

yuhaozhang commented Aug 20, 2020 •

edited

Loading