Skip to content

Subtle difference with Pytorch AdamW? #35504

Closed
@kyleliang919

Description

@kyleliang919

denom = exp_avg_sq.sqrt().add_(group["eps"])

It does correction after epsilon is added, whereas pytorch:
https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html#AdamW

step = _get_value(step_t)

bias_correction1 = 1 - beta1**step
bias_correction2 = 1 - beta2**step

step_size = lr / bias_correction1

bias_correction2_sqrt = bias_correction2**0.5

if amsgrad:
    # Maintains the maximum of all 2nd moment running avg. till now
    torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])

    # Use the max. for normalizing running avg. of gradient
    denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
else:
    denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)

param.addcdiv_(exp_avg, denom, value=-step_size)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions