Why does transformer use noam as the learning rate decay scheme? #280

anglil · 2017-09-05T23:52:11Z

What's the benefit of "noam" in machine translation compared to exponential or square root decay?
Also, why is it that in the base mode of the transformer. gradient clipping is set to 0?

chengyong3001 · 2017-10-11T08:38:56Z

I am also confused about why they multiply the weight decay with a very large number 5000. It is so weird.

lukaszkaiser · 2017-10-13T00:52:01Z

All the multiplications are performed because T2T uses normalized values: we try to make the learning rate of 0.1 work with various optimizers (normally Adam would use 0.002 or so) and we try to make weight-decay per-parameter (people usually tune it per-model, but then whenever you change hidden_size you need to change that too, and a number of other things and so on). I can see that the normalized choices are sometimes confusing when compared to the literature, but it's very hard to make different models and hparam_sets work ok without some form of normalization.

colmantse · 2017-10-24T06:23:07Z

is there any pointer to noam decay? i was googling but cannot find how exactly is noam decay done.

lukaszkaiser · 2017-10-24T19:50:21Z

For now, the best pointer is Section 5.3 in the "Attention Is All You Need" paper (https://arxiv.org/abs/1706.03762). The equation is there already, but we'll try to add more details and intuition in a later version of the paper.

rafaelvalle · 2018-01-22T07:07:12Z

@lukaszkaiser hey lukas, what's the explanation behind increasing and decreasing the learning rate as described in the "Attention Is All You Need" paper?

martinpopel · 2018-01-22T16:20:11Z

@rafaelvalle: decreasing the learning rate aka learning rate decay (usually exponential, piecewise-constant or inverse-time) is a standard practice in ML for decades. Increasing the learning rate in the early stages with a warmup (usually linear or exponential growth) is a more recent practice, popular esp. in deep learning on ImageNet, see e.g. He et al. 2016 or Goyal et al. 2017.
The "noam" scheme is just a particular way how to put the warmup and decay together (linear warmup for a given number of steps followed by exponential decay).

Learning rate schedules is an active research area. See e.g. papers on cyclical learning rate (corresponding to learning_rate_decay_scheme=cosine available in tensor2tensor) and super-convergence, which provide also more insights into the theory behind the learning rate, batch size, gradient noise etc.

rafaelvalle · 2018-01-22T16:45:09Z

Thank you for sharing all this info @martinpopel. I'm aware of exponential decay and other schedules but didn't pay much attention to "warmups". Are there any papers that study warmup and provide a theoretical explanation to it?

martinpopel · 2018-01-22T17:02:48Z

@rafaelvalle: Yes. All four papers I referenced study warmup and the last three provide some motivation/theory. The cyclical learning rate is a kind of generalization of warmup+decay ("noam" scheme does just one cycle). I am not aware of any published experiments with CLR in NMT (I am running some now).

rolloff · 2018-01-26T18:35:59Z

One hypothesis I have is that adding gradient clipping back in would allow us to use no warmup on these models. When I set warmup to 0, the loss will NAN. This could be caused by a very large gradient norm, which gradient clipping would help with.

martinpopel · 2018-01-26T23:17:06Z

I've tried various learning rate schedules with Transformer and NMT. I haven't seen any NaN loss, but sometimes the (training and test) loss suddenly jumps too high (and either stays there and the training diverges, or jumps back, but the convergence is slowed down). In my experience this happens when the learning rate is too high in the early stages of training (and this is where warmup helps), but it seem it also depends on the speed how learning rate grows (i.e. on the slope of the warmup part in learning_rate curve in TensorBoard). I've tried also no warmup and constant learning rate (learning_rate_decay_scheme=none) 0.2 and it had exactly the same results as noam (even after 1 day of training on 8 GPUs). When I increased the constant lr to 0.3, the training diverged.
I think both of these observations can be explained by the fact that the default optimizer is Adam, which should not be so sensitive to the learning rate (scheme). Intuitively, if your scheme makes learning rate too high or too low, Adam can counteract it, unless your scheme makes the change too quickly (please, let me know if I am wrong). At first, I was a bit surprised that T2T uses by default both learning rate decay scheme and Adam, but there may be some reasons for this choice.

rolloff · 2018-01-26T23:57:28Z

Adam's test error is equally as sensitive as SGD to learning rate. You must tune both the learning rate and the learning rate decay for Adam . It is a myth that Adam can counteract a poorly tuned learning rate. See this graph I made for CIfar10. From my preliminary experiments, the same holds for the Transformer. :

rolloff · 2018-01-27T00:09:03Z

On en->de wmt32k base model, using batch size=8192, warm up 0, lr=0.4, optimizer='Momentum', I see INFO:tensorflow:loss = 9.583992, step = 1 ERROR:tensorflow:Model diverged with loss = NaN.

martinpopel · 2018-01-27T00:25:05Z

@rolloff: I have never seen the "Model diverged with loss = NaN" error you report. So by "diverged training" I mean just that the loss suddenly jumps too high and the BLEU jumps to almost 0 and stays there forever.

Thanks for sharing your graph of Adam vs. SGD. What does "step size" mean? The learning rate? The batch size? Or the number of steps after which the learning rate is halved in the piecewise-constant lr scheme?
I don't claim Adam can counteract a poor learning rate. I have just seen that very different learning rate schedules (with lr e.g. 0.6 and 0.1 at one moment) resulted in the same training loss curve (with all the bumps as measured each 100 steps). I hypothesize that it is because of the adaptive nature of Adam. When I switched from Adam to Momentum, I was not able to replicate this phenomena (moreover, the constant learning rate with Momentum resulted in a very slow convergence).
On the other hand, a quite small difference in the initial learning rate or the slope of warmup can result in a huge difference in BLEU even with Adam.

Bollegala · 2019-12-29T20:50:52Z

Why is this called "noam" schedule? The following paper does not name that equation as such.

For now, the best pointer is Section 5.3 in the "Attention Is All You Need" paper (https://arxiv.org/abs/1706.03762). The equation is there already, but we'll try to add more details and intuition in a later version of the paper.

martinpopel · 2019-12-29T21:04:38Z

Noam Shazeer is the second author of the paper.

Bollegala · 2019-12-30T23:17:32Z

Did not notice that. Thanks for the quick response!

tonysy · 2020-03-02T05:39:32Z

Training Tips for the Transformer Model https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf

lukaszkaiser closed this as completed Oct 13, 2017

begeekmyfriend mentioned this issue Apr 16, 2018

Evaluation on Chinese mandarin Rayhane-mamah/Tacotron-2#18

Closed

pedrolarben mentioned this issue Feb 3, 2021

Possible gradient explosion when running with default configuration pedrolarben/ElectricDemandForecasting-DL#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does transformer use noam as the learning rate decay scheme? #280

Why does transformer use noam as the learning rate decay scheme? #280

anglil commented Sep 5, 2017 •

edited

Loading

chengyong3001 commented Oct 11, 2017

lukaszkaiser commented Oct 13, 2017

colmantse commented Oct 24, 2017

lukaszkaiser commented Oct 24, 2017

rafaelvalle commented Jan 22, 2018

martinpopel commented Jan 22, 2018 •

edited

Loading

rafaelvalle commented Jan 22, 2018

martinpopel commented Jan 22, 2018

rolloff commented Jan 26, 2018

martinpopel commented Jan 26, 2018

rolloff commented Jan 26, 2018 •

edited

Loading

rolloff commented Jan 27, 2018 •

edited

Loading

martinpopel commented Jan 27, 2018 •

edited

Loading

Bollegala commented Dec 29, 2019

martinpopel commented Dec 29, 2019

Bollegala commented Dec 30, 2019

tonysy commented Mar 2, 2020

Why does transformer use noam as the learning rate decay scheme? #280

Why does transformer use noam as the learning rate decay scheme? #280

Comments

anglil commented Sep 5, 2017 • edited Loading

chengyong3001 commented Oct 11, 2017

lukaszkaiser commented Oct 13, 2017

colmantse commented Oct 24, 2017

lukaszkaiser commented Oct 24, 2017

rafaelvalle commented Jan 22, 2018

martinpopel commented Jan 22, 2018 • edited Loading

rafaelvalle commented Jan 22, 2018

martinpopel commented Jan 22, 2018

rolloff commented Jan 26, 2018

martinpopel commented Jan 26, 2018

rolloff commented Jan 26, 2018 • edited Loading

rolloff commented Jan 27, 2018 • edited Loading

martinpopel commented Jan 27, 2018 • edited Loading

Bollegala commented Dec 29, 2019

martinpopel commented Dec 29, 2019

Bollegala commented Dec 30, 2019

tonysy commented Mar 2, 2020

anglil commented Sep 5, 2017 •

edited

Loading

martinpopel commented Jan 22, 2018 •

edited

Loading

rolloff commented Jan 26, 2018 •

edited

Loading

rolloff commented Jan 27, 2018 •

edited

Loading

martinpopel commented Jan 27, 2018 •

edited

Loading