-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Why does transformer use noam as the learning rate decay scheme? #280
Comments
I am also confused about why they multiply the weight decay with a very large number 5000. It is so weird. |
All the multiplications are performed because T2T uses normalized values: we try to make the learning rate of 0.1 work with various optimizers (normally Adam would use 0.002 or so) and we try to make weight-decay per-parameter (people usually tune it per-model, but then whenever you change hidden_size you need to change that too, and a number of other things and so on). I can see that the normalized choices are sometimes confusing when compared to the literature, but it's very hard to make different models and hparam_sets work ok without some form of normalization. |
is there any pointer to noam decay? i was googling but cannot find how exactly is noam decay done. |
For now, the best pointer is Section 5.3 in the "Attention Is All You Need" paper (https://arxiv.org/abs/1706.03762). The equation is there already, but we'll try to add more details and intuition in a later version of the paper. |
@lukaszkaiser hey lukas, what's the explanation behind increasing and decreasing the learning rate as described in the "Attention Is All You Need" paper? |
@rafaelvalle: decreasing the learning rate aka learning rate decay (usually exponential, piecewise-constant or inverse-time) is a standard practice in ML for decades. Increasing the learning rate in the early stages with a warmup (usually linear or exponential growth) is a more recent practice, popular esp. in deep learning on ImageNet, see e.g. He et al. 2016 or Goyal et al. 2017. Learning rate schedules is an active research area. See e.g. papers on cyclical learning rate (corresponding to |
Thank you for sharing all this info @martinpopel. I'm aware of exponential decay and other schedules but didn't pay much attention to "warmups". Are there any papers that study warmup and provide a theoretical explanation to it? |
@rafaelvalle: Yes. All four papers I referenced study warmup and the last three provide some motivation/theory. The cyclical learning rate is a kind of generalization of warmup+decay ("noam" scheme does just one cycle). I am not aware of any published experiments with CLR in NMT (I am running some now). |
One hypothesis I have is that adding gradient clipping back in would allow us to use no warmup on these models. When I set warmup to 0, the loss will NAN. This could be caused by a very large gradient norm, which gradient clipping would help with. |
I've tried various learning rate schedules with Transformer and NMT. I haven't seen any NaN loss, but sometimes the (training and test) loss suddenly jumps too high (and either stays there and the training diverges, or jumps back, but the convergence is slowed down). In my experience this happens when the learning rate is too high in the early stages of training (and this is where warmup helps), but it seem it also depends on the speed how learning rate grows (i.e. on the slope of the warmup part in learning_rate curve in TensorBoard). I've tried also no warmup and constant learning rate ( |
On en->de wmt32k base model, using batch size=8192, warm up 0, lr=0.4, optimizer='Momentum', I see |
@rolloff: I have never seen the "Model diverged with loss = NaN" error you report. So by "diverged training" I mean just that the loss suddenly jumps too high and the BLEU jumps to almost 0 and stays there forever. Thanks for sharing your graph of Adam vs. SGD. What does "step size" mean? The learning rate? The batch size? Or the number of steps after which the learning rate is halved in the piecewise-constant lr scheme? |
Why is this called "noam" schedule? The following paper does not name that equation as such.
|
Noam Shazeer is the second author of the paper. |
Did not notice that. Thanks for the quick response! |
Training Tips for the Transformer Model https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf |
What's the benefit of "noam" in machine translation compared to exponential or square root decay?
Also, why is it that in the base mode of the transformer. gradient clipping is set to 0?
The text was updated successfully, but these errors were encountered: