-
Notifications
You must be signed in to change notification settings - Fork 23.9k
To add warm-up scheduler to optim #60836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 6d4aade (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
beeecf8
to
3463ced
Compare
Codecov Report
@@ Coverage Diff @@
## master #60836 +/- ##
===========================================
+ Coverage 60.53% 76.26% +15.72%
===========================================
Files 684 2062 +1378
Lines 88652 205622 +116970
===========================================
+ Hits 53666 156812 +103146
- Misses 34986 48810 +13824 |
88df5ed
to
fa80a5b
Compare
94423e5
to
fd66891
Compare
825595a
to
44afe75
Compare
@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@iramazanli Thanks for the PR!
Do you think it's worth providing an example of how this is done in the documentation? cc @fmassa any thoughts on this? |
Actually, here https://pytorch.org/docs/stable/optim.html we have example for the general case of chaining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot Ilqar (and sorry for the delay in reviewing the PR)
As a side comment and for a follow-up PR, I think it would make it much easier for users to use WarmUpLR if we have a LRScheduler that combines multiple schedulers together.
This way, users don't need to change their training loop to add warmup.
Something in the lines of
scheduler = CombinedScheduler([scheduler1, scheduler2])
so that instead of doing
scheduler1.step()
scheduler2.step()
the user just needs to do
scheduler.step()
Also, @datumbox once this PR is merged, could you update the torchvision reference to use PyTorch's WarmUpLR
?
44afe75
to
6d4aade
Compare
@iramazanli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
I completely agree with the idea of Combined Scheduler. Thanks for the suggestion ! There will be a follow-up PR regarding a bug we have discussed offline for chaining schedulers (general case). After that bug fix, usage of warmup scheduler as a chain will be more safe. |
@iramazanli merged this pull request in cec08e7. |
Summary: Warm up of learning rate scheduling has initially been discussed by Priya et. al. in the paper: https://arxiv.org/pdf/1706.02677.pdf . In the section 2.2 of the paper they discussed and proposed idea of warming up learning schedulers in order to prevent big variance / noise in the learning rate. Then idea has been further discussed in the following papers: * Akilesh Gotmare et al. https://arxiv.org/abs/1810.13243 * Bernstein et al http://proceedings.mlr.press/v80/bernstein18a/bernstein18a.pdf * Liyuan Liu et al: https://arxiv.org/pdf/1908.03265.pdf There are two type of popularly used learning rate warm up ideas * Constant warmup (start with very small constant learning rate) * Linear Warmup ( start with small learning rate and gradually increase) In this PR we are adding warm up as learning rate scheduler. Note that learning rates are chainable, which means that we can merge warmup scheduler with any other learning rate scheduler to make more sophisticated learning rate scheduler. ## Linear Warmup Linear Warmup is multiplying learning rate with pre-defined constant - warmup_factor in the first epoch (epoch 0). Then targeting to increase this multiplication constant to one in warmup_iters many epochs. Hence we can derive the formula at i-th step to have multiplication constant equal to: warmup_factor + (1-warmup_factor) * i / warmup_iters Moreover, the fraction of this quantity at point i to point i-1 will give us 1 + (1.0 - warmup_factor) / [warmup_iters*warmup_factor+(i-1)*(1-warmup_factor)] which is used in get_lr() method in our implementation. Below we provide an example how to use linear warmup scheduler and to give an example to show how does it works. ```python import torch from torch.nn import Parameter from torch.optim import SGD from torch.optim.lr_scheduler import WarmUpLR model = [Parameter(torch.randn(2, 2, requires_grad=True))] optimizer = SGD(model, 0.1) scheduler = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=10, warmup_method="linear") for epoch in range(15): print(epoch, scheduler.get_last_lr()[0]) optimizer.step() scheduler.step() ``` ``` 0 0.010000000000000002 1 0.019000000000000003 2 0.028000000000000008 3 0.03700000000000001 4 0.04600000000000001 5 0.055000000000000014 6 0.06400000000000002 7 0.07300000000000002 8 0.08200000000000003 9 0.09100000000000004 10 0.10000000000000005 11 0.10000000000000005 12 0.10000000000000005 13 0.10000000000000005 14 0.10000000000000005 ``` ## Constant Warmup Constant warmup has straightforward idea, to multiply learning rate by warmup_factor until we reach to epoch warmup_factor, then do nothing for following epochs ```python import torch from torch.nn import Parameter from torch.optim import SGD from torch.optim.lr_scheduler import WarmUpLR model = [Parameter(torch.randn(2, 2, requires_grad=True))] optimizer = SGD(model, 0.1) scheduler = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant") for epoch in range(10): print(epoch, scheduler.get_last_lr()[0]) optimizer.step() scheduler.step() ``` ``` 0 0.010000000000000002 1 0.010000000000000002 2 0.010000000000000002 3 0.010000000000000002 4 0.010000000000000002 5 0.10000000000000002 6 0.10000000000000002 7 0.10000000000000002 8 0.10000000000000002 9 0.10000000000000002 ``` Pull Request resolved: #60836 Reviewed By: saketh-are Differential Revision: D29537615 Pulled By: iramazanli fbshipit-source-id: d910946027acc52663b301f9c56ade686e62cb69
Summary: It has been discussed in the #60836 (comment) that we have observed an obstacle to chain some type of learning rate schedulers. In particular we observed * some of the learning rate schedulers returns initial learning rates at epoch 0 as ``` return self.base_lrs` ``` * This can be a problem when two schedulers called as chained as ``` scheduler1.step() scheduler2.step() ``` in particular, we completely ignore the effect of scheduler1 at epoch 0. This could not be an issue if at epoch 0, scheduler1 was ineffective as in many schedulers, however for schedulers as WarmUp Schedulers, where at epoch 0 schedulers multiplicative value is smaller than 1 this could lead to undesired behaviors. The following code snippet illustrates the problem better ## Reproducing the bug ```python import torch from torch.nn import Parameter from torch.optim import SGD from torch.optim.lr_scheduler import WarmUpLR, ExponentialLR model = [Parameter(torch.randn(2, 2, requires_grad=True))] optimizer = SGD(model, 1.0) scheduler1 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant") scheduler2 = ExponentialLR(optimizer, gamma=0.9) for epoch in range(10): print(epoch, scheduler2.get_last_lr()[0]) optimizer.step() scheduler1.step() scheduler2.step() ``` ### Current Result ``` 0 1.0 1 0.9 2 0.81 3 0.7290000000000001 4 0.6561000000000001 5 5.904900000000001 6 5.314410000000001 7 4.782969000000001 8 4.304672100000001 9 3.874204890000001 ``` ### Expected Result ``` 0 1.0 1 0.9 2 0.81 3 0.7290000000000001 4 0.6561000000000001 5 0.5904900000000001 6 0.5314410000000001 7 0.4782969000000001 8 0.4304672100000001 9 0.3874204890000001 ``` Pull Request resolved: #63457 Reviewed By: datumbox Differential Revision: D30424160 Pulled By: iramazanli fbshipit-source-id: 3e15af8d278c872cd6f53406b55f4d3ce5002867
Summary: Partially unblocks pytorch/vision#4281 Previously we have added WarmUp Schedulers to PyTorch Core in the PR : #60836 which had two mode of execution - linear and constant depending on warming up function. In this PR we are changing this interface to more direct form, as separating linear and constant modes to separate Schedulers. In particular ```Python scheduler1 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant") scheduler2 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="linear") ``` will look like ```Python scheduler1 = ConstantLR(optimizer, warmup_factor=0.1, warmup_iters=5) scheduler2 = LinearLR(optimizer, warmup_factor=0.1, warmup_iters=5) ``` correspondingly. Pull Request resolved: #64395 Reviewed By: datumbox Differential Revision: D30753688 Pulled By: iramazanli fbshipit-source-id: e47f86d12033f80982ddf1faf5b46873adb4f324
Warm up of learning rate scheduling has initially been discussed by Priya et. al. in the paper: https://arxiv.org/pdf/1706.02677.pdf .
In the section 2.2 of the paper they discussed and proposed idea of warming up learning schedulers in order to prevent big variance / noise in the learning rate. Then idea has been further discussed in the following papers:
There are two type of popularly used learning rate warm up ideas
In this PR we are adding warm up as learning rate scheduler. Note that learning rates are chainable, which means that we can merge warmup scheduler with any other learning rate scheduler to make more sophisticated learning rate scheduler.
Linear Warmup
Linear Warmup is multiplying learning rate with pre-defined constant - warmup_factor in the first epoch (epoch 0). Then targeting to increase this multiplication constant to one in warmup_iters many epochs. Hence we can derive the formula at i-th step to have multiplication constant equal to:
Moreover, the fraction of this quantity at point i to point i-1 will give us
which is used in get_lr() method in our implementation. Below we provide an example how to use linear warmup scheduler and to give an example to show how does it works.
Constant Warmup
Constant warmup has straightforward idea, to multiply learning rate by warmup_factor until we reach to epoch warmup_factor, then do nothing for following epochs