Nadam optimizer differences #1440

albertz · 2023-10-18T15:59:08Z

Our TF-layers Nadam optimizer is basically the same as Adam except that we use use_nesterov=True for training_ops.apply_adam. It is based on TF 1.15 tensorflow/contrib/opt/python/training/nadam_optimizer.py. So it also has the same options as normal Adam:

learning_rate=0.001
beta1=0.9
beta2=0.999
epsilon=1e-8

I noticed that tf.keras.optimizers.experimental.Nadam has some different options:

epsilon=1e-07
weight_decay=None
clipnorm=None
clipvalue=None
global_clipnorm=None
use_ema=False
ema_momentum=0.99
ema_overwrite_frequency=None

Ok, I did not further look into this. The clipping and weight decay probably is added here to decouple it. The use_ema is disabled by default, so the ema_... options are not used. So maybe it is mostly the same. Except of a different epsilon default.

See also:
#766 (comment)
keras-team/keras#15710

Now I noticed, in PyTorch, torch.optim.NAdam again has different options:

lr=0.002
eps=1e-08
weight_decay=0
momentum_decay=0.004
decoupled_weight_decay=False

I specifically wonder about the momentum_decay. What is this?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nadam optimizer differences #1440

Nadam optimizer differences #1440

albertz commented Oct 18, 2023

Nadam optimizer differences #1440

Nadam optimizer differences #1440

Comments

albertz commented Oct 18, 2023