-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation mistake of Adam in v1.6.0? #42843
Comments
cc @vincentqb -- what do you think? |
My understanding of the issue here is that the weight decay paper doesn't need to be referenced, and adding the reference in the documentation adds to confusion. The reference to the follow-up paper was added via this issue, and the justification for the change is here. Does this address your comment @yuhaozhang? If not, how would you suggest rephrasing the documentation? |
Hi @vincentqb, thanks for the followup! Your understanding is correct in that this current reference leads to more confusion about the actual weight decay applied in the Adam implementation. I have looked at #41477 as you pointed to, and I am now more confused about why this reference was added in the first place. It looks to me that the current Adam implementation conforms perfectly to the original Adam paper (Algorithm 1 in page 2), and there is no difference between how epsilon was applied in the original Adam paper and the more recent Decoupled Weight Decay paper. Looking further down the line, it looks like there was originally a difference between how Adam was implemented in early versions of PyTorch, which was then fixed (fed5ca1) by referencing to the more recent Decoupled Weight Decay paper. However this reference is not necessary since the implementation of epsilon is the same in both papers and we can just equally reference the original Adam paper for that. |
To summarize, a proposed change would be to remove this note at all, because: 1) the current code implements epsilon in the same way as the original Adam paper; 2) adding this note may lead to further confusion about the weight decay in |
I absolutely think this documentation mistake is serious, and the documentation should be fixed as soon as possible. |
+1, I have always thought that the documentation says that Adam and AdamW are literally the same (which was strange to me at the same time). |
While screensharing went to the Adam documentation to show someone that it was implemented with the fix, only to find the documentation changed, something I only realized after switching the pytorch versions (on the documentation webpage). Cannot believe this wasn't advertised more widely. |
馃摎 Documentation
The documentation of Adam since v1.6.0 suggests that it uses the weight decay fix proposed by paper "Decoupled Weight Decay Regularization":
pytorch/torch/optim/adam.py
Lines 10 to 11 in 4b4273a
However I found this to not be the case. The actual weight decay in Adam is still the old one:
pytorch/torch/optim/adam.py
Lines 99 to 100 in 4b4273a
In contrast, the new weight decay fix is implemented in AdamW, and explained by the AdamW doc:
pytorch/torch/optim/adamw.py
Line 10 in 4b4273a
pytorch/torch/optim/adamw.py
Line 73 in 4b4273a
It is possible that I misunderstood the current documentation of Adam, but I found it confusing and suggest that currently there is no difference between the Adam and AdamW implementations, while the difference still exists.
cc @jlin27 @vincentqb
The text was updated successfully, but these errors were encountered: