Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beta2_power is applied incorrectly in Adam optimizer #9047

Closed
taion opened this issue Apr 7, 2017 · 16 comments
Closed

beta2_power is applied incorrectly in Adam optimizer #9047

taion opened this issue Apr 7, 2017 · 16 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@taion
Copy link
Contributor

taion commented Apr 7, 2017

In adam.py and in the ApplyAdam op, the denominator is effectively:

(tf.sqrt(v_t) + epsilon_t) / tf.sqrt(1 - beta2_power)

However, this appears incorrect – per the paper, the correct EMA adjustment should give:

tf.sqrt(v_t / (1 - beta2_power)) + epsilon_t

Otherwise, when epsilon_t is large relative to tf.sqrt(v_t), the effective epsilon used in the denominator is also scaled up by the correction factor, which doesn't match what's in the paper.

Does this seem right, or am I missing something here?

@asimshankar
Copy link
Contributor

@georgedahl @allenlavoie @alextp : Any comments on this (or know whom to redirect to?)

@asimshankar asimshankar added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug labels Apr 7, 2017
@allenlavoie
Copy link
Member

Just so I'm clear, you do agree that the TensorFlow implementation matches the expression just before Section 2.1, with the "epsilon hat" in it? It's just that "epsilon hat" is a scaled version of the epsilon in Algorithm 1 in the paper, right?

Is there a use-case for having the un-scaled epsilon as an option?

@lvilnis
Copy link

lvilnis commented Apr 7, 2017

I think this is correct. If we look in the paper https://arxiv.org/pdf/1412.6980.pdf at the bottom of section 2 (Algorithm), it says "the efficiency of algorithm 1 can, at the expense of clarity, be improved..." and lists an update like what we are doing, no?

image

@lvilnis
Copy link

lvilnis commented Apr 7, 2017

Ah, the "epsilon hat" vs the regular epsilon is what I was missing (though "epsilon hat" is never defined that I can see, I assume it is a scaled version). I agree this is probably a bug.

@taion
Copy link
Contributor Author

taion commented Apr 7, 2017

Oops, yeah – I think it should be something like:

epsilon_hat = epsilon_t * tf.sqrt(1 - beta2_power)

Otherwise without this scaling the effective epsilon is scaled by 1. / tf.sqrt(1 - beta2_power).

With the default settings it probably barely matters (so you get an epsilon of ~3e-7 instead of 1e-8... big deal), but if you follow the recommendations in the comments and e.g. set epsilon to 1 or 0.1, then you effectively will barely update the weights for the first few iterations until beta2_power gets closer to 0 – like if you set epsilon to 1, then the effective "epsilon" you'd use for the first iteration would be ~31.

I can't really think of a reason why you'd want to incorrectly scale the epsilon in this case.

@allenlavoie
Copy link
Member

On the other hand, the recommendation in that comment is definitely referring to epsilon_hat being 1 or 0.1, not epsilon (i.e. the person who wrote it was using TensorFlow's implementation of Adam).

@skywaLKer518: Any thoughts on epsilon vs. epsilon_hat? It looks like you authored that bit of advice on large values of epsilon for Adam.

@skywaLKer518
Copy link

skywaLKer518 commented Apr 11, 2017

It is very possible that the optimizer gets modified after I tested and added the comments (I'm not exactly sure now -- the code is definitely at least re-organized and I do not recall the use of epsilon_hat clearly) so that the comment might not apply any more. But at the time we test it on Inception (summer 2015), we observe much better performance with large epsilon (e.g. 1.0) than the default values like 1e-8 (which sometimes even causes objective divergence if I remember correctly).

@taion
Copy link
Contributor Author

taion commented Apr 11, 2017

epsilon and epsilon_hat end up approximately the same after a few thousand iterations anyway – not using epsilon_hat just means that training will start out very slow at the beginning when using a large value of epsilon.

@allenlavoie
Copy link
Member

Interesting. I'm inclined to keep behavior as-is, since it probably doesn't matter for the on-label use of avoiding short-term numerical instability due to near-zero gradients. In fact we'd need to be careful that a correction didn't introduce numerical issues.

The off-label advice for making long-term training more stable has developed for epsilon_hat, so we'd make people re-tune their hyperparameters for questionable benefit by changing the default (i.e. maybe very small updates for the first few thousand iterations is desirable).

However, the documentation should certainly be updated to let people know that it's epsilon_hat rather than epsilon that they're setting.

@taion: Does that sound reasonable? I'm happy to put together the documentation changes.

@taion
Copy link
Contributor Author

taion commented Apr 11, 2017

Just checked a few other packages –

I mean, I don't know. The correct version seems unlikely to cause issues for the on-label case, or people using Lasagne would have had problems.

I don't really want to train an InceptionNet from scratch to see what this does for the off-label case; I will just note that when I tried using a larger epsilon a while ago, my model seemed to not train at all even when I used a higher learning rate, which would suggest that the current implementation makes things worse in the off-label case of using a higher epsilon.

@allenlavoie
Copy link
Member

Thank you for taking a look at other frameworks! That's an interesting list.

In terms of numerical issues, I'm particularly worried about float16 users. AdamOptimizer already requires tuning the default epsilon_hat of 1e-8 to at least 1e-7 to avoid errors. If we naively divided it by anything more than 3, it would underflow. Specifying epsilon_hat directly at least makes reasoning about precision issues a bit easier.

@taion
Copy link
Contributor Author

taion commented Apr 12, 2017

Oh! I didn't think of the float16 case.

Maybe the first-best would be to rename the current epsilon to epsilon_hat (with a deprecation notice for the moment), then possibly add a proper epsilon later on?

@allenlavoie
Copy link
Member

The current formulation is sufficiently useful for numerical stability that we'll want to keep it around. I'll do the documentation fix and mark this as closed once that propagates. Feel free to open a feature request for a second epsilon parameter ("epsilon_nohat"?), or work on a pull request, but I don't think it's going to be a priority without experimental evidence that it's useful.

@taion
Copy link
Contributor Author

taion commented Apr 12, 2017

Sounds good, thanks.

@allenlavoie
Copy link
Member

Fix is submitting, should be synced within a day or so. Thank you for the report!

@drpngx drpngx closed this as completed in 1926b9d Apr 13, 2017
@taion
Copy link
Contributor Author

taion commented Apr 13, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants