- External Links
- Quick Guide Important infomation on hyper-params.
- Update Plan
- Installation and usage
- A quick look at the algorithm
- Detailed Discussions
- Reproduce results in the paper
- Citation
Project Page, arXiv , Reddit , Twitter
- SN-GAN https://github.com/juntang-zhuang/SNGAN-AdaBelief
- Transformer (PyTorch 1.1) https://github.com/juntang-zhuang/transformer-adabelief
- Transformer (PyTorch 1.6) https://github.com/juntang-zhuang/fairseq-adabelief
- Reinforcement Learning (Toy) https://github.com/juntang-zhuang/rainbow-adabelief
In the next release of adabelief-pytorch
, we will modify the default of several arguments, in order to fit the needs of for general tasks such as GAN and Transformer. Please check if you specify these arguments or use the default when upgrade from version 0.0.5 to higher.
Version | epsilon | weight_decouple | rectify |
---|---|---|---|
adabelief-pytorch=0.0.5 | 1e-8 | False | False |
latest version 0.1.0>0.0.5 | 1e-16 | True | True |
In the next release, we will modify adabelief-tf
to have the same feature as adabelief-pytorch
, inlcuding decoupled weight decay and learning rate rectification. Furthermore, we will add support for TensorFlow>=2.0 and Keras. A nightly version is in pypi_packages/adabelief_tf0.1.0
, note that this version is not pushed to pypi yet.
AdaBelief uses a different denominator from Adam, and is orthogonal to other techniques such as recification, decoupled weight decay, weight averaging et.al. This implies when you use some techniques with Adam, to get a good result with AdaBelief you might still need those techniques.
-
epsilon
in AdaBelief plays a different role as in Adam, typically when you useepslison=x
in Adam, usingepsilon=x*x
will give similar results in AdaBelief. The default valueepsilon=1e-8
is not a good option in many cases, will modify it later to 1e-12 or 1e-16 later. -
If you task needs a "non-adaptive" optimizer, which means SGD performs much better than Adam(W), such as on image recognition, you need to set a large
epsilon
(e.g. 1e-8,1e-10) for AdaBelief to make it morenon-adaptive
; if your task needs a reallyadaptive
optimizer, which means Adam is much better than SGD, such as GAN, then the recommendedepsilon
for AdaBelief is small (1e-12, 1e-16 ...). -
If decoupled weight decay is very important for your task, which means AdamW is much better than Adam, then you need to set
weight_decouple
as True to turn on decoupled decay in AdaBelief. Note that many optimizers uses decoupled weight decay without specifying it as an options, e.g. RAdam, but we provide it as an option so users are aware of what technique is actually used. -
Don't use "gradient threshold" (clamp each element independently) in AdaBelief, it could result in division by 0 and explosion in update; but "gradient clip" (shrink amplitude of the gradient vector but keeps its direction) is fine, though from my limited experience sometimes the clip range needs to be the same or larger than Adam.
-
Settings to reproduce results in this repository. Note that
epsilon
andrectify
are quite important, and vary with tasks. For scenario where "adaptivity" is crucial, such as SN-GAN and Transformer, use a smallepsilon
(1e-12 or 1e-16), and turn onrectify
.
Task | lr | beta1 | beta2 | epsilon | weight_decay | weight_decouple | rectify | fixed_decay | amsgrad |
---|---|---|---|---|---|---|---|---|---|
Cifar | 1e-3 | 0.9 | 0.999 | 1e-8 | 5e-4 | False | False | False | False |
ImageNet | 1e-3 | 0.9 | 0.999 | 1e-8 | 1e-2 | True | False | False | False |
LSTM-1layer | 1e-3 | 0.9 | 0.999 | 1e-16 | 1.2e-6 | False | False | False | False |
LSTm 2,3 layer | 1e-2 | 0.9 | 0.999 | 1e-12 | 1.2e-6. | False | False | False | False |
GAN (small) | 2e-4 | 0.5 | 0.999 | 1e-12 | 0 | True=False (decay=0) | False | False | False |
SN-GAN (large) | 2e-4 | 0.5 | 0.999 | 1e-16 | 0 | True=False (decay=0) | True | False | False |
Transformer | 5e-4 | 0.9 | 0.999 | 1e-16 | 1e-4 | True | True | False | False |
Reinforce | 1e-4 | 0.9 | 0.999 | 1e-10 | 0.0 | True=False (decay=0) | True | False | False |
Someone (under the wechat group Jiqizhixin) points out that the results on GAN is bad, this might be due to the choice of GAN model (We pick the simplest code example from PyTorch docs without adding more tricks), and we did not perform cherry-picking or worsen the baseline perfomance intentionally. We will update results on new GANs (e.g. SN-GAN) and release code later.Upload code for LSTM experiments.(10/23/2020) Transformer trains fine locally with PyTorch 1.1 CUDA9.0 (BLEU score 35.74 (highest is 35.85) on IWSLT14 DE-En with small transformer), but works much worse on a server with PyTorch 1.4 CUDA 10.0 (BLEU score < 26) using the same code. The code is to reproduce the error is at: https://github.com/juntang-zhuang/transformer-adabeliefTest AdaBelief on more examples, such as Transformer, Reinforcement Learning.- Merge Tensorflow improvements
Compare the rectified update, currently the implementation is slightly different fromRAdam
implementation.- Correct the coding error in RangerAdaBelief
- Updated results on an SN-GAN is in https://github.com/juntang-zhuang/SNGAN-AdaBelief, AdaBelief achieves 12.36 FID (lower is better) on Cifar10, while Adam achieves 13.25 (number taken from the log of official repository
PyTorch-studioGAN
). - LSTM experiments uploaded to
PyTorch_Experiments/LSTM
- Identify the problem of Transformer with PyTorch 1.4, to be an old version
fairseq
is incompatible with new version PyTorch, works fine with latestfairseq
.
Code on Transformer to work with PyTorch 1.6 is at: https://github.com/juntang-zhuang/fairseq-adabelief
Code for transformer to work with PyTorch 1.1 and CUDA9.0 is at: https://github.com/juntang-zhuang/transformer-adabelief - Tested on a toy example of reinforcement learning.
( Results in the paper are all generated using the PyTorch implementation in adabelief-pytorch
package, which is the ONLY package that I have extensively tested for now.)
Please install latest version (0.1.0), previous version (0.0.5) uses different default arguments.
pip install adabelief-pytorch==0.1.0
from adabelief_pytorch import AdaBelief
optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False)
pip install ranger-adabelief==0.0.9
from ranger_adabelief import RangerAdaBelief
optimizer = RangerAdaBelief(model.parameters(), lr=1e-3, eps=1e-12, betas=(0.9,0.999))
Current tensorflow implementation is imcomplete, and does not support decoupled weight decay and rectification. Will update in release 0.1.0
pip install adabelief-tf==0.0.1
from adabelief_tf impoty AdaBeliefOptimizer
optimizer = AdaBeliefOptimizer(learning_rate, epsilon=1e-12)
See folder PyTorch_Experiments
, for each subfolder, execute sh run.sh
. See readme.txt
in each subfolder for visualization, or
refer to jupyter notebook for visualization.
Please instal the latest version from pip, old versions might suffer from bugs. Source code for up-to-date package is available in folder pypi_packages
.
-
Decoupling (argument
weight_decouple
appears inAdaBelief
andRangerAdaBelief
):
Currently there are two ways to perform weight decay for adaptive optimizers, directly apply it to the gradient (Adam), ordecouple
weight decay from gradient descent (AdamW). This is passed to the optimizer by argumentweight_decouple (default: False)
. -
Fixed ratio (argument
fixed_decay (default: False)
appears inAdaBelief
):
(1) Ifweight_decouple == False
, then this argument does not affect optimization.
(2) Ifweight_decouple == True
:
- If
fixed_decay == False
, the weight is multiplied by1 -lr x weight_decay
- If
fixed_decay == True
, the weight is multiplied by1 - weight_decay
. This is implemented as an option but not used to produce results in the paper. -
What is the acutal weight-decay we are using?
This is seldom discussed in the literature, but personally I think it's very important. When we setweight_decay=1e-4
for SGD, the weight is scaled by1 - lr x weight_decay
. Two points need to be emphasized: (1)lr
in SGD is typically larger than Adam (0.1 vs 0.001), so the weight decay in Adam needs to be set as a larger number to compensate. (2)lr
decays, this means typically we use a larger weight decay in early phases, and use a small weight decay in late phases.
AdaBelief seems to require a different epsilon
from Adam. In CV tasks in this paper, epsilon
is set as 1e-8
. For GAN training and LSTM, it's set as 1e-12
. We recommend try different epsilon
values in practice, and sweep through a large region, e.g. 1e-8, 1e-10, 1e-12, 1e-14, 1e-16, 1e-18
. Typically a smaller epsilon
makes it more adaptive.
Whether to turn on the rectification as in RAdam. The recitification basically uses SGD in early phases for warmup, then switch to Adam. Rectification is implemented as an option, but is never used to produce results in the paper.
Whether to take the max (over history) of denominator, same as AMSGrad. It's set as False for all experiments.
- Results in the paper are generated using the PyTorch implementation in
adabelief-pytorch
package. This is the ONLY package that I have extensively tested for now. - We also provide a modification of
ranger
optimizer inranger-adabelief
which combinesRAdam + LookAhead + Gradient Centralization + AdaBelief
, but this is not used in the paper and is not extensively tested. - The
adabelief-tf
is a naive implementation in Tensorflow. It lacks many features such asdecoupled weight decay
, and is not extensively tested. Currently I don't have plans to improve it since I seldom use Tensorflow, please contact me if you want to collaborate and improve it.
The experiments on Cifar is the same as demo in AdaBound, with the only difference is the optimizer. The ImageNet experiment uses a different learning rate schedule, typically is decayed by 1/10 at epoch 30, 60, and ends at 90. For some reasons I have not extensively experimented, AdaBelief performs good when decayed at epoch 70, 80 and ends at 90, using the default lr schedule produces a slightly worse result. If you have any ideas on this please open an issue here or email me.
I got some feedbacks on RNN on reddit discussion, here are a few tips:
- The epsilon is suggested to set as a smaller value for RNN (e.g. 1e-12, 1e-14, 1e-16) though the default is 1e-8. Please try different epsilon values, it varies from task to task.
- I might confuse "gradient threshold" with "gradient clip" in previous readme, clarify below:
(1) By "gradient threshold" I refer to element-wise operation, which only takes values between a certain region [a,b]. Values outside this region will be set as a and b respectively.
(2) By "gradient clip" I refer to the operation on a vector or tensor. Suppose X is a tensor, if ||X|| > thres, then X <- X/||X|| * thres. Take X as a vector, "gradient clip" shrinks the amplitude but keeps the direction.
(3) "Gradient threshold" is incompatible with AdaBelief, because if gt is thresholded for a long time, then |gt-mt|~=0, and the division will explode; however, "gradient clip" is fine for Adabelief, yet the clip range still needs tuning (perhaps AdaBelief needs a larger range than Adam).
Please contact me at j.zhuang@yale.edu
or open an issue here if you would like to help improve it, especially the tensorflow version, or explore combination with other methods, some discussion on the theory part, or combination with other methods to create a better optimizer. Any thoughts are welcome!
@article{zhuang2020adabelief,
title={AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients},
author={Zhuang, Juntang and Tang, Tommy and Ding, Yifan and Tatikonda, Sekhar and Dvornek, Nicha and Papademetris, Xenophon and Duncan, James},
journal={Conference on Neural Information Processing Systems},
year={2020}
}