Closed
Description
transformers/src/transformers/optimization.py
Line 648 in e5fd865
It does correction after epsilon is added, whereas pytorch:
https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html#AdamW
step = _get_value(step_t)
bias_correction1 = 1 - beta1**step
bias_correction2 = 1 - beta2**step
step_size = lr / bias_correction1
bias_correction2_sqrt = bias_correction2**0.5
if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
# Use the max. for normalizing running avg. of gradient
denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
else:
denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)
param.addcdiv_(exp_avg, denom, value=-step_size)
Metadata
Metadata
Assignees
Labels
No labels