Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does transformer use noam as the learning rate decay scheme? #280

Closed
anglil opened this Issue Sep 5, 2017 · 13 comments

Comments

Projects
None yet
7 participants
@anglil
Copy link

anglil commented Sep 5, 2017

What's the benefit of "noam" in machine translation compared to exponential or square root decay?
Also, why is it that in the base mode of the transformer. gradient clipping is set to 0?

@chengyong3001

This comment has been minimized.

Copy link

chengyong3001 commented Oct 11, 2017

I am also confused about why they multiply the weight decay with a very large number 5000. It is so weird.

@lukaszkaiser

This comment has been minimized.

Copy link
Member

lukaszkaiser commented Oct 13, 2017

All the multiplications are performed because T2T uses normalized values: we try to make the learning rate of 0.1 work with various optimizers (normally Adam would use 0.002 or so) and we try to make weight-decay per-parameter (people usually tune it per-model, but then whenever you change hidden_size you need to change that too, and a number of other things and so on). I can see that the normalized choices are sometimes confusing when compared to the literature, but it's very hard to make different models and hparam_sets work ok without some form of normalization.

@colmantse

This comment has been minimized.

Copy link

colmantse commented Oct 24, 2017

is there any pointer to noam decay? i was googling but cannot find how exactly is noam decay done.

@lukaszkaiser

This comment has been minimized.

Copy link
Member

lukaszkaiser commented Oct 24, 2017

For now, the best pointer is Section 5.3 in the "Attention Is All You Need" paper (https://arxiv.org/abs/1706.03762). The equation is there already, but we'll try to add more details and intuition in a later version of the paper.

@rafaelvalle

This comment has been minimized.

Copy link

rafaelvalle commented Jan 22, 2018

@lukaszkaiser hey lukas, what's the explanation behind increasing and decreasing the learning rate as described in the "Attention Is All You Need" paper?

@martinpopel

This comment has been minimized.

Copy link
Contributor

martinpopel commented Jan 22, 2018

@rafaelvalle: decreasing the learning rate aka learning rate decay (usually exponential, piecewise-constant or inverse-time) is a standard practice in ML for decades. Increasing the learning rate in the early stages with a warmup (usually linear or exponential growth) is a more recent practice, popular esp. in deep learning on ImageNet, see e.g. He et al. 2016 or Goyal et al. 2017.
The "noam" scheme is just a particular way how to put the warmup and decay together (linear warmup for a given number of steps followed by exponential decay).

Learning rate schedules is an active research area. See e.g. papers on cyclical learning rate (corresponding to learning_rate_decay_scheme=cosine available in tensor2tensor) and super-convergence, which provide also more insights into the theory behind the learning rate, batch size, gradient noise etc.

@rafaelvalle

This comment has been minimized.

Copy link

rafaelvalle commented Jan 22, 2018

Thank you for sharing all this info @martinpopel. I'm aware of exponential decay and other schedules but didn't pay much attention to "warmups". Are there any papers that study warmup and provide a theoretical explanation to it?

@martinpopel

This comment has been minimized.

Copy link
Contributor

martinpopel commented Jan 22, 2018

@rafaelvalle: Yes. All four papers I referenced study warmup and the last three provide some motivation/theory. The cyclical learning rate is a kind of generalization of warmup+decay ("noam" scheme does just one cycle). I am not aware of any published experiments with CLR in NMT (I am running some now).

@rolloff

This comment has been minimized.

Copy link

rolloff commented Jan 26, 2018

One hypothesis I have is that adding gradient clipping back in would allow us to use no warmup on these models. When I set warmup to 0, the loss will NAN. This could be caused by a very large gradient norm, which gradient clipping would help with.

@martinpopel

This comment has been minimized.

Copy link
Contributor

martinpopel commented Jan 26, 2018

I've tried various learning rate schedules with Transformer and NMT. I haven't seen any NaN loss, but sometimes the (training and test) loss suddenly jumps too high (and either stays there and the training diverges, or jumps back, but the convergence is slowed down). In my experience this happens when the learning rate is too high in the early stages of training (and this is where warmup helps), but it seem it also depends on the speed how learning rate grows (i.e. on the slope of the warmup part in learning_rate curve in TensorBoard). I've tried also no warmup and constant learning rate (learning_rate_decay_scheme=none) 0.2 and it had exactly the same results as noam (even after 1 day of training on 8 GPUs). When I increased the constant lr to 0.3, the training diverged.
I think both of these observations can be explained by the fact that the default optimizer is Adam, which should not be so sensitive to the learning rate (scheme). Intuitively, if your scheme makes learning rate too high or too low, Adam can counteract it, unless your scheme makes the change too quickly (please, let me know if I am wrong). At first, I was a bit surprised that T2T uses by default both learning rate decay scheme and Adam, but there may be some reasons for this choice.

@rolloff

This comment has been minimized.

Copy link

rolloff commented Jan 26, 2018

Adam's test error is equally as sensitive as SGD to learning rate. You must tune both the learning rate and the learning rate decay for Adam . It is a myth that Adam can counteract a poorly tuned learning rate. See this graph I made for CIfar10. From my preliminary experiments, the same holds for the Transformer. :
screenshot 2018-01-26 15 52 58

@rolloff

This comment has been minimized.

Copy link

rolloff commented Jan 27, 2018

On en->de wmt32k base model, using batch size=8192, warm up 0, lr=0.4, optimizer='Momentum', I see INFO:tensorflow:loss = 9.583992, step = 1 ERROR:tensorflow:Model diverged with loss = NaN.

@martinpopel

This comment has been minimized.

Copy link
Contributor

martinpopel commented Jan 27, 2018

@rolloff: I have never seen the "Model diverged with loss = NaN" error you report. So by "diverged training" I mean just that the loss suddenly jumps too high and the BLEU jumps to almost 0 and stays there forever.

Thanks for sharing your graph of Adam vs. SGD. What does "step size" mean? The learning rate? The batch size? Or the number of steps after which the learning rate is halved in the piecewise-constant lr scheme?
I don't claim Adam can counteract a poor learning rate. I have just seen that very different learning rate schedules (with lr e.g. 0.6 and 0.1 at one moment) resulted in the same training loss curve (with all the bumps as measured each 100 steps). I hypothesize that it is because of the adaptive nature of Adam. When I switched from Adam to Momentum, I was not able to replicate this phenomena (moreover, the constant learning rate with Momentum resulted in a very slow convergence).
On the other hand, a quite small difference in the initial learning rate or the slope of warmup can result in a huge difference in BLEU even with Adam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.