New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
beta2_power is applied incorrectly in Adam optimizer #9047
Comments
@georgedahl @allenlavoie @alextp : Any comments on this (or know whom to redirect to?) |
Just so I'm clear, you do agree that the TensorFlow implementation matches the expression just before Section 2.1, with the "epsilon hat" in it? It's just that "epsilon hat" is a scaled version of the epsilon in Algorithm 1 in the paper, right? Is there a use-case for having the un-scaled epsilon as an option? |
I think this is correct. If we look in the paper https://arxiv.org/pdf/1412.6980.pdf at the bottom of section 2 (Algorithm), it says "the efficiency of algorithm 1 can, at the expense of clarity, be improved..." and lists an update like what we are doing, no? |
Ah, the "epsilon hat" vs the regular epsilon is what I was missing (though "epsilon hat" is never defined that I can see, I assume it is a scaled version). I agree this is probably a bug. |
Oops, yeah – I think it should be something like: epsilon_hat = epsilon_t * tf.sqrt(1 - beta2_power) Otherwise without this scaling the effective epsilon is scaled by With the default settings it probably barely matters (so you get an epsilon of ~3e-7 instead of 1e-8... big deal), but if you follow the recommendations in the comments and e.g. set epsilon to 1 or 0.1, then you effectively will barely update the weights for the first few iterations until I can't really think of a reason why you'd want to incorrectly scale the epsilon in this case. |
On the other hand, the recommendation in that comment is definitely referring to epsilon_hat being 1 or 0.1, not epsilon (i.e. the person who wrote it was using TensorFlow's implementation of Adam). @skywaLKer518: Any thoughts on epsilon vs. epsilon_hat? It looks like you authored that bit of advice on large values of epsilon for Adam. |
It is very possible that the optimizer gets modified after I tested and added the comments (I'm not exactly sure now -- the code is definitely at least re-organized and I do not recall the use of epsilon_hat clearly) so that the comment might not apply any more. But at the time we test it on Inception (summer 2015), we observe much better performance with large epsilon (e.g. 1.0) than the default values like 1e-8 (which sometimes even causes objective divergence if I remember correctly). |
epsilon and epsilon_hat end up approximately the same after a few thousand iterations anyway – not using epsilon_hat just means that training will start out very slow at the beginning when using a large value of epsilon. |
Interesting. I'm inclined to keep behavior as-is, since it probably doesn't matter for the on-label use of avoiding short-term numerical instability due to near-zero gradients. In fact we'd need to be careful that a correction didn't introduce numerical issues. The off-label advice for making long-term training more stable has developed for epsilon_hat, so we'd make people re-tune their hyperparameters for questionable benefit by changing the default (i.e. maybe very small updates for the first few thousand iterations is desirable). However, the documentation should certainly be updated to let people know that it's epsilon_hat rather than epsilon that they're setting. @taion: Does that sound reasonable? I'm happy to put together the documentation changes. |
Just checked a few other packages –
I mean, I don't know. The correct version seems unlikely to cause issues for the on-label case, or people using Lasagne would have had problems. I don't really want to train an InceptionNet from scratch to see what this does for the off-label case; I will just note that when I tried using a larger epsilon a while ago, my model seemed to not train at all even when I used a higher learning rate, which would suggest that the current implementation makes things worse in the off-label case of using a higher epsilon. |
Thank you for taking a look at other frameworks! That's an interesting list. In terms of numerical issues, I'm particularly worried about float16 users. AdamOptimizer already requires tuning the default epsilon_hat of 1e-8 to at least 1e-7 to avoid errors. If we naively divided it by anything more than 3, it would underflow. Specifying epsilon_hat directly at least makes reasoning about precision issues a bit easier. |
Oh! I didn't think of the float16 case. Maybe the first-best would be to rename the current |
The current formulation is sufficiently useful for numerical stability that we'll want to keep it around. I'll do the documentation fix and mark this as closed once that propagates. Feel free to open a feature request for a second epsilon parameter ("epsilon_nohat"?), or work on a pull request, but I don't think it's going to be a priority without experimental evidence that it's useful. |
Sounds good, thanks. |
Fix is submitting, should be synced within a day or so. Thank you for the report! |
Thanks! |
In
adam.py
and in theApplyAdam
op, the denominator is effectively:However, this appears incorrect – per the paper, the correct EMA adjustment should give:
Otherwise, when
epsilon_t
is large relative totf.sqrt(v_t)
, the effective epsilon used in the denominator is also scaled up by the correction factor, which doesn't match what's in the paper.Does this seem right, or am I missing something here?
The text was updated successfully, but these errors were encountered: