Subtle difference with Pytorch AdamW?

https://github.com/huggingface/transformers/blob/e5fd865ebae062b7cf03a81b8c6affeb39f30bec/src/transformers/optimization.py#L648
It does correction after epsilon is added, whereas pytorch：
https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html#AdamW
```python
step = _get_value(step_t)

bias_correction1 = 1 - beta1**step
bias_correction2 = 1 - beta2**step

step_size = lr / bias_correction1

bias_correction2_sqrt = bias_correction2**0.5

if amsgrad:
    # Maintains the maximum of all 2nd moment running avg. till now
    torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])

    # Use the max. for normalizing running avg. of gradient
    denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
else:
    denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)

param.addcdiv_(exp_avg, denom, value=-step_size)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subtle difference with Pytorch AdamW? #35504

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Subtle difference with Pytorch AdamW? #35504

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions