Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Why does transformer use noam as the learning rate decay scheme? #280

Closed
anglil opened this issue Sep 5, 2017 · 17 comments
Closed

Why does transformer use noam as the learning rate decay scheme? #280

anglil opened this issue Sep 5, 2017 · 17 comments

Comments

@anglil
Copy link

anglil commented Sep 5, 2017

What's the benefit of "noam" in machine translation compared to exponential or square root decay?
Also, why is it that in the base mode of the transformer. gradient clipping is set to 0?

@chengyong3001
Copy link

I am also confused about why they multiply the weight decay with a very large number 5000. It is so weird.

@lukaszkaiser
Copy link
Contributor

All the multiplications are performed because T2T uses normalized values: we try to make the learning rate of 0.1 work with various optimizers (normally Adam would use 0.002 or so) and we try to make weight-decay per-parameter (people usually tune it per-model, but then whenever you change hidden_size you need to change that too, and a number of other things and so on). I can see that the normalized choices are sometimes confusing when compared to the literature, but it's very hard to make different models and hparam_sets work ok without some form of normalization.

@colmantse
Copy link

is there any pointer to noam decay? i was googling but cannot find how exactly is noam decay done.

@lukaszkaiser
Copy link
Contributor

For now, the best pointer is Section 5.3 in the "Attention Is All You Need" paper (https://arxiv.org/abs/1706.03762). The equation is there already, but we'll try to add more details and intuition in a later version of the paper.

@rafaelvalle
Copy link

@lukaszkaiser hey lukas, what's the explanation behind increasing and decreasing the learning rate as described in the "Attention Is All You Need" paper?

@martinpopel
Copy link
Contributor

martinpopel commented Jan 22, 2018

@rafaelvalle: decreasing the learning rate aka learning rate decay (usually exponential, piecewise-constant or inverse-time) is a standard practice in ML for decades. Increasing the learning rate in the early stages with a warmup (usually linear or exponential growth) is a more recent practice, popular esp. in deep learning on ImageNet, see e.g. He et al. 2016 or Goyal et al. 2017.
The "noam" scheme is just a particular way how to put the warmup and decay together (linear warmup for a given number of steps followed by exponential decay).

Learning rate schedules is an active research area. See e.g. papers on cyclical learning rate (corresponding to learning_rate_decay_scheme=cosine available in tensor2tensor) and super-convergence, which provide also more insights into the theory behind the learning rate, batch size, gradient noise etc.

@rafaelvalle
Copy link

Thank you for sharing all this info @martinpopel. I'm aware of exponential decay and other schedules but didn't pay much attention to "warmups". Are there any papers that study warmup and provide a theoretical explanation to it?

@martinpopel
Copy link
Contributor

@rafaelvalle: Yes. All four papers I referenced study warmup and the last three provide some motivation/theory. The cyclical learning rate is a kind of generalization of warmup+decay ("noam" scheme does just one cycle). I am not aware of any published experiments with CLR in NMT (I am running some now).

@rolloff
Copy link

rolloff commented Jan 26, 2018

One hypothesis I have is that adding gradient clipping back in would allow us to use no warmup on these models. When I set warmup to 0, the loss will NAN. This could be caused by a very large gradient norm, which gradient clipping would help with.

@martinpopel
Copy link
Contributor

I've tried various learning rate schedules with Transformer and NMT. I haven't seen any NaN loss, but sometimes the (training and test) loss suddenly jumps too high (and either stays there and the training diverges, or jumps back, but the convergence is slowed down). In my experience this happens when the learning rate is too high in the early stages of training (and this is where warmup helps), but it seem it also depends on the speed how learning rate grows (i.e. on the slope of the warmup part in learning_rate curve in TensorBoard). I've tried also no warmup and constant learning rate (learning_rate_decay_scheme=none) 0.2 and it had exactly the same results as noam (even after 1 day of training on 8 GPUs). When I increased the constant lr to 0.3, the training diverged.
I think both of these observations can be explained by the fact that the default optimizer is Adam, which should not be so sensitive to the learning rate (scheme). Intuitively, if your scheme makes learning rate too high or too low, Adam can counteract it, unless your scheme makes the change too quickly (please, let me know if I am wrong). At first, I was a bit surprised that T2T uses by default both learning rate decay scheme and Adam, but there may be some reasons for this choice.

@rolloff
Copy link

rolloff commented Jan 26, 2018

Adam's test error is equally as sensitive as SGD to learning rate. You must tune both the learning rate and the learning rate decay for Adam . It is a myth that Adam can counteract a poorly tuned learning rate. See this graph I made for CIfar10. From my preliminary experiments, the same holds for the Transformer. :
screenshot 2018-01-26 15 52 58

@rolloff
Copy link

rolloff commented Jan 27, 2018

On en->de wmt32k base model, using batch size=8192, warm up 0, lr=0.4, optimizer='Momentum', I see INFO:tensorflow:loss = 9.583992, step = 1 ERROR:tensorflow:Model diverged with loss = NaN.

@martinpopel
Copy link
Contributor

martinpopel commented Jan 27, 2018

@rolloff: I have never seen the "Model diverged with loss = NaN" error you report. So by "diverged training" I mean just that the loss suddenly jumps too high and the BLEU jumps to almost 0 and stays there forever.

Thanks for sharing your graph of Adam vs. SGD. What does "step size" mean? The learning rate? The batch size? Or the number of steps after which the learning rate is halved in the piecewise-constant lr scheme?
I don't claim Adam can counteract a poor learning rate. I have just seen that very different learning rate schedules (with lr e.g. 0.6 and 0.1 at one moment) resulted in the same training loss curve (with all the bumps as measured each 100 steps). I hypothesize that it is because of the adaptive nature of Adam. When I switched from Adam to Momentum, I was not able to replicate this phenomena (moreover, the constant learning rate with Momentum resulted in a very slow convergence).
On the other hand, a quite small difference in the initial learning rate or the slope of warmup can result in a huge difference in BLEU even with Adam.

@Bollegala
Copy link

Why is this called "noam" schedule? The following paper does not name that equation as such.

For now, the best pointer is Section 5.3 in the "Attention Is All You Need" paper (https://arxiv.org/abs/1706.03762). The equation is there already, but we'll try to add more details and intuition in a later version of the paper.

@martinpopel
Copy link
Contributor

Noam Shazeer is the second author of the paper.

@Bollegala
Copy link

Did not notice that. Thanks for the quick response!

@tonysy
Copy link

tonysy commented Mar 2, 2020

Training Tips for the Transformer Model https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants