Skip to content

Inconsistent description of AMSGrad with code #142323

@Tony-Y

Description

@Tony-Y

📚 The doc issue

pytorch/torch/optim/adam.py

Lines 469 to 476 in 0bd7b7a

if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
# Use the max. for normalizing running avg. of gradient
denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
else:
denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)

In the code, the bias correction term $1-\beta_2^t$ is used after the max operation. However, the documentation describes that it is used before the max operation:
max operation

Suggest a potential alternative/fix

amsgrad algo notes

cc @svekars @brycebortree @sekyondaMeta @AlannaBurke @vincentqb @jbschlosser @albanD @janeyx99 @crcrpar

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: docsRelated to our documentation, both in docs/ and docblocksmodule: optimizerRelated to torch.optimtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions