beta2_power is applied incorrectly in Adam optimizer #9047

taion · 2017-04-07T16:08:11Z

In adam.py and in the ApplyAdam op, the denominator is effectively:

(tf.sqrt(v_t) + epsilon_t) / tf.sqrt(1 - beta2_power)

However, this appears incorrect – per the paper, the correct EMA adjustment should give:

tf.sqrt(v_t / (1 - beta2_power)) + epsilon_t

Otherwise, when epsilon_t is large relative to tf.sqrt(v_t), the effective epsilon used in the denominator is also scaled up by the correction factor, which doesn't match what's in the paper.

Does this seem right, or am I missing something here?

The text was updated successfully, but these errors were encountered:

asimshankar · 2017-04-07T22:39:53Z

@georgedahl @allenlavoie @alextp : Any comments on this (or know whom to redirect to?)

allenlavoie · 2017-04-07T23:24:47Z

Just so I'm clear, you do agree that the TensorFlow implementation matches the expression just before Section 2.1, with the "epsilon hat" in it? It's just that "epsilon hat" is a scaled version of the epsilon in Algorithm 1 in the paper, right?

Is there a use-case for having the un-scaled epsilon as an option?

lvilnis · 2017-04-07T23:25:20Z

I think this is correct. If we look in the paper https://arxiv.org/pdf/1412.6980.pdf at the bottom of section 2 (Algorithm), it says "the efficiency of algorithm 1 can, at the expense of clarity, be improved..." and lists an update like what we are doing, no?

lvilnis · 2017-04-07T23:32:10Z

Ah, the "epsilon hat" vs the regular epsilon is what I was missing (though "epsilon hat" is never defined that I can see, I assume it is a scaled version). I agree this is probably a bug.

taion · 2017-04-07T23:49:17Z

Oops, yeah – I think it should be something like:

epsilon_hat = epsilon_t * tf.sqrt(1 - beta2_power)

Otherwise without this scaling the effective epsilon is scaled by 1. / tf.sqrt(1 - beta2_power).

With the default settings it probably barely matters (so you get an epsilon of ~3e-7 instead of 1e-8... big deal), but if you follow the recommendations in the comments and e.g. set epsilon to 1 or 0.1, then you effectively will barely update the weights for the first few iterations until beta2_power gets closer to 0 – like if you set epsilon to 1, then the effective "epsilon" you'd use for the first iteration would be ~31.

I can't really think of a reason why you'd want to incorrectly scale the epsilon in this case.

allenlavoie · 2017-04-08T00:40:56Z

On the other hand, the recommendation in that comment is definitely referring to epsilon_hat being 1 or 0.1, not epsilon (i.e. the person who wrote it was using TensorFlow's implementation of Adam).

@skywaLKer518: Any thoughts on epsilon vs. epsilon_hat? It looks like you authored that bit of advice on large values of epsilon for Adam.

skywaLKer518 · 2017-04-11T17:25:39Z

It is very possible that the optimizer gets modified after I tested and added the comments (I'm not exactly sure now -- the code is definitely at least re-organized and I do not recall the use of epsilon_hat clearly) so that the comment might not apply any more. But at the time we test it on Inception (summer 2015), we observe much better performance with large epsilon (e.g. 1.0) than the default values like 1e-8 (which sometimes even causes objective divergence if I remember correctly).

taion · 2017-04-11T17:45:46Z

epsilon and epsilon_hat end up approximately the same after a few thousand iterations anyway – not using epsilon_hat just means that training will start out very slow at the beginning when using a large value of epsilon.

allenlavoie · 2017-04-11T17:49:43Z

Interesting. I'm inclined to keep behavior as-is, since it probably doesn't matter for the on-label use of avoiding short-term numerical instability due to near-zero gradients. In fact we'd need to be careful that a correction didn't introduce numerical issues.

The off-label advice for making long-term training more stable has developed for epsilon_hat, so we'd make people re-tune their hyperparameters for questionable benefit by changing the default (i.e. maybe very small updates for the first few thousand iterations is desirable).

However, the documentation should certainly be updated to let people know that it's epsilon_hat rather than epsilon that they're setting.

@taion: Does that sound reasonable? I'm happy to put together the documentation changes.

taion · 2017-04-11T18:00:53Z

Just checked a few other packages –

Keras uses the same incorrect formulation as here: https://github.com/fchollet/keras/blob/7f58b6fbe702c1936e88a878002ee6e9c469bc77/keras/optimizers.py#L389-L400
PyTorch uses the same incorrect formulation as here: https://github.com/pytorch/pytorch/blob/f17cfe42936310a2e3fd573e1f4dec8c684d4003/torch/optim/adam.py#L68-L72
Lasagne uses the correct implementation, but in the unoptimized form: https://github.com/Lasagne/Lasagne/blob/45bb5689f0b2edb7114608e88305e8074d29bbe7/lasagne/updates.py#L620-L622

I mean, I don't know. The correct version seems unlikely to cause issues for the on-label case, or people using Lasagne would have had problems.

I don't really want to train an InceptionNet from scratch to see what this does for the off-label case; I will just note that when I tried using a larger epsilon a while ago, my model seemed to not train at all even when I used a higher learning rate, which would suggest that the current implementation makes things worse in the off-label case of using a higher epsilon.

allenlavoie · 2017-04-12T17:47:02Z

Thank you for taking a look at other frameworks! That's an interesting list.

In terms of numerical issues, I'm particularly worried about float16 users. AdamOptimizer already requires tuning the default epsilon_hat of 1e-8 to at least 1e-7 to avoid errors. If we naively divided it by anything more than 3, it would underflow. Specifying epsilon_hat directly at least makes reasoning about precision issues a bit easier.

taion · 2017-04-12T18:45:51Z

Oh! I didn't think of the float16 case.

Maybe the first-best would be to rename the current epsilon to epsilon_hat (with a deprecation notice for the moment), then possibly add a proper epsilon later on?

allenlavoie · 2017-04-12T20:55:49Z

The current formulation is sufficiently useful for numerical stability that we'll want to keep it around. I'll do the documentation fix and mark this as closed once that propagates. Feel free to open a feature request for a second epsilon parameter ("epsilon_nohat"?), or work on a pull request, but I don't think it's going to be a priority without experimental evidence that it's useful.

taion · 2017-04-12T21:04:37Z

Sounds good, thanks.

allenlavoie · 2017-04-12T21:06:36Z

Fix is submitting, should be synced within a day or so. Thank you for the report!

taion · 2017-04-13T20:32:42Z

Thanks!

asimshankar added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug labels Apr 7, 2017

drpngx closed this as completed in 1926b9d Apr 13, 2017

Xiadalei mentioned this issue Dec 12, 2023

咨询一下AdamW的实现细节 PaddlePaddle/Paddle#59512

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beta2_power is applied incorrectly in Adam optimizer #9047

beta2_power is applied incorrectly in Adam optimizer #9047

taion commented Apr 7, 2017 •

edited

asimshankar commented Apr 7, 2017

allenlavoie commented Apr 7, 2017

lvilnis commented Apr 7, 2017

lvilnis commented Apr 7, 2017

taion commented Apr 7, 2017 •

edited

allenlavoie commented Apr 8, 2017

skywaLKer518 commented Apr 11, 2017 •

edited

taion commented Apr 11, 2017

allenlavoie commented Apr 11, 2017

taion commented Apr 11, 2017

allenlavoie commented Apr 12, 2017

taion commented Apr 12, 2017

allenlavoie commented Apr 12, 2017

taion commented Apr 12, 2017

allenlavoie commented Apr 12, 2017

taion commented Apr 13, 2017

beta2_power is applied incorrectly in Adam optimizer #9047

beta2_power is applied incorrectly in Adam optimizer #9047

Comments

taion commented Apr 7, 2017 • edited

asimshankar commented Apr 7, 2017

allenlavoie commented Apr 7, 2017

lvilnis commented Apr 7, 2017

lvilnis commented Apr 7, 2017

taion commented Apr 7, 2017 • edited

allenlavoie commented Apr 8, 2017

skywaLKer518 commented Apr 11, 2017 • edited

taion commented Apr 11, 2017

allenlavoie commented Apr 11, 2017

taion commented Apr 11, 2017

allenlavoie commented Apr 12, 2017

taion commented Apr 12, 2017

allenlavoie commented Apr 12, 2017

taion commented Apr 12, 2017

allenlavoie commented Apr 12, 2017

taion commented Apr 13, 2017

taion commented Apr 7, 2017 •

edited

taion commented Apr 7, 2017 •

edited

skywaLKer518 commented Apr 11, 2017 •

edited