Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Cyclical Learning Rates #2016

Closed
wants to merge 20 commits into from
Closed

Conversation

@thomasjpfan
Copy link
Collaborator

@thomasjpfan thomasjpfan commented Jul 8, 2017

Adds feature requested in #1909. Mimics the parameters from https://github.com/bckenstler/CLR. Since Cyclical Learning Rate (CLR) requires updating the learning rate after every batch, I added batch_step which should be called after feeding a batch through a neutral network.

@Jiaming-Liu
Copy link
Contributor

@Jiaming-Liu Jiaming-Liu commented Jul 10, 2017

The usage of lambda would make the instance of CyclicLR not pickle-able. Also, it would be nice if this can be written as a subclass of _LRScheduler. (or maybe three subclasses?)

@thomasjpfan
Copy link
Collaborator Author

@thomasjpfan thomasjpfan commented Jul 10, 2017

_LRScheduler interface seems to imply that step is called at the end of each epoch. The cyclical learning rate policy updates the learning rate after every batch. If CyclicLR were to subclass _LRSchleduler it would have to raise an exception for calling step and ask the user to call batch_step instead. Basically it comes down to a design decision.

@Jiaming-Liu
Copy link
Contributor

@Jiaming-Liu Jiaming-Liu commented Jul 10, 2017

You are right -- some scheduling methods require a different calling frequency. The main discrepancy would be the timing for .step() and the name of some methods/attributes/parameters. It might be helpful if there is another abstract class for batch-wise scheduler. @apaszke Any suggestion?

@soumith soumith added this to Ready for Review in PR Status Jul 12, 2017
@sai-prasanna
Copy link

@sai-prasanna sai-prasanna commented Feb 28, 2018

Anything to be done to move this forward? Would love to help.

@thomasjpfan
Copy link
Collaborator Author

@thomasjpfan thomasjpfan commented Feb 28, 2018

I think the only thing stopping this PR is the added complexity of the batch_step function to CyclicLR. All the other schedulers are updated after every epoch, while CyclicLR is updated every batch.

@Aphaniteja
Copy link

@Aphaniteja Aphaniteja commented Mar 22, 2018

I hope this gets added. Recent papers(https://github.com/timgaripov/swa) are building upon this concept.

@thomasjpfan
Copy link
Collaborator Author

@thomasjpfan thomasjpfan commented Mar 24, 2018

https://github.com/timgaripov/swa could be implemented in this _LRScheduler framework. The step function would have to return the weight of the moving average, and the user would have to update the moving average of the model weights themselves.

zou3519 pushed a commit to zou3519/pytorch that referenced this issue Mar 30, 2018
* Rename caffe2_ref_test.py to c2_ref_test.py

* Rename the module name doc too
@sean-adler
Copy link

@sean-adler sean-adler commented Apr 15, 2018

You are right -- some scheduling methods require a different calling frequency. The main discrepancy would be the timing for .step() and the name of some methods/attributes/parameters. It might be helpful if there is another abstract class for batch-wise scheduler. @apaszke Any suggestion?

Sorry to ask a newbie question - is it necessary to have a different API for updating the learning rate per-batch rather than per-epoch?

If CyclicLR were rewritten to subclass _LRScheduler and implement step() instead of batch_step(), it kind of seems like doing this would be fine:

for epoch in range(num_epochs):
    for batch in data_loader:
        optimizer.zero_grad()
        output = model(input)
        loss = criterion(output, target)
        loss.backward()

        optimizer.step()
        scheduler.step()  #  <- apply CLR per-batch

Because step_size is an argument to CyclicLR in this changeset, making cycles follow a schedule similar to those in the paper by e.g. setting step_size to a multiple of len(data_loader) seems like it'd make the above work (and be reasonably straightforward).

(One downside of doing this is that CyclicLR would have a misleading last_epoch attribute, but I have no idea how big a deal that is.)

I have very little pytorch-specific context here, so any feedback would be hugely appreciated.

@thomasjpfan
Copy link
Collaborator Author

@thomasjpfan thomasjpfan commented Apr 16, 2018

Here are the following options:

  1. Extending the step function to be used in batch and epoch scheduling may cause user confusion. A user trying out different schedulers would need to know to move the step function call when using CyclicLR. last_epoch would need to be interpreted as "the last batch index".

  2. Adding the batch_step function to CyclicLR explicitly states that the step function should be called per batch. Adding another LR scheduling interface may increase the complexity of the overall lr_scheduler module.

It all comes down to a stylistic decision. I choose option 2 for this PR.

@EtienneDesticourt
Copy link

@EtienneDesticourt EtienneDesticourt commented May 9, 2018

Very useful stuff, can't wait for this PR to be merged!

I've been using it to train my models and I think the step_size parameter is confusing though. From its name you would expect it to be the amount of change in the learning rate at each step i.e: step_size = (max_lr - min_lr) / num_steps_per_cycle. Whereas in this implementation step_size is actually the number of steps per half cycle (I also think the "number of training iterations per half cycle" part of the description is needlessly cryptic, speaking in terms of steps makes it much clearer, the mental connection between training iterations and steps happens when you call batch_step either way.)

So to sum up I propose something like this instead:
num_steps(int): Number of steps per half cycle. Authors suggest setting num_steps 2-8 x training steps in epoch. Default: 2000

@AutuanLiu
Copy link

@AutuanLiu AutuanLiu commented May 9, 2018

  1. we can use lr_scheduler.LambdaLR to do this, here are the examples
def cyclical_lr(step_sz, min_lr=0.001, max_lr=1, mode='triangular', scale_func=None, scale_md='cycles', gamma=1.):
    """implements a cyclical learning rate policy (CLR).
    Notes: the learning rate of optimizer should be 1

    Parameters:
    ----------
    mode : str, optional
        one of {triangular, triangular2, exp_range}. 
    scale_md : str, optional
        {'cycles', 'iterations'}.
    gamma : float, optional
        constant in 'exp_range' scaling function: gamma**(cycle iterations)
    
    Examples:
    --------
    >>> # the learning rate of optimizer should be 1
    >>> optimizer = optim.SGD(model.parameters(), lr=1.)
    >>> step_size = 2*len(train_loader)
    >>> clr = cyclical_lr(step_size, min_lr=0.001, max_lr=0.005)
    >>> scheduler = lr_scheduler.LambdaLR(optimizer, [clr])
    >>> # some other operations
    >>> scheduler.step()
    >>> optimizer.step()
    """
    if scale_func == None:
        if mode == 'triangular':
            scale_fn = lambda x: 1.
            scale_mode = 'cycles'
        elif mode == 'triangular2':
            scale_fn = lambda x: 1 / (2.**(x - 1))
            scale_mode = 'cycles'
        elif mode == 'exp_range':
            scale_fn = lambda x: gamma**(x)
            scale_mode = 'iterations'
        else:
            raise ValueError(f'The {mode} is not valid value!')
    else:
        scale_fn = scale_func
        scale_mode = scale_md

    lr_lambda = lambda iters: min_lr + (max_lr - min_lr) * rel_val(iters, step_sz, scale_mode)

    def rel_val(iteration, stepsize, mode):
        cycle = math.floor(1 + iteration / (2 * stepsize))
        x = abs(iteration / stepsize - 2 * cycle + 1)
        if mode == 'cycles':
            return max(0, (1 - x)) * scale_fn(cycle)
        elif mode == 'iterations':
            return max(0, (1 - x)) * scale_fn(iteration)
        else:
            raise ValueError(f'The {scale_mode} is not valid value!')

    return lr_lambda
  1. example2
optimizer = optim.SGD(model.parameters(), lr=1.)
clr = cyclical_lr(step_size, min_lr=0.001, max_lr=1, mode='triangular2')
scheduler = lr_scheduler.LambdaLR(optimizer, [clr])
scheduler.step()
optimizer.step()
  1. reference
    1. keras CLR
    2. clr.py

@Randl
Copy link
Contributor

@Randl Randl commented Jun 14, 2018

What's up with this PR?

@Randl
Copy link
Contributor

@Randl Randl commented Jun 17, 2018

@thomasjpfan @apaszke What if we add batch_step to _LRScheduler? It would add some complexity, but at least it will be uniform along different schedulers. The code of form

for epoch in epochs:
    scheduler.step()
    ...
    for batch in batches:
        scheduler.batch_step()
        ...

would work for all schedulers then.

@apaszke
Copy link
Contributor

@apaszke apaszke commented Jun 18, 2018

But that's so verbose... If we need a per-batch callback we might want to write a wrapper around an iterator that returns the batches (e.g. DataLoader)

@sethah
Copy link
Contributor

@sethah sethah commented Aug 2, 2018

In general I don't see why step has to correspond to anything specific at all. In any case, the scheduler progresses the learning rate according to some shape, which is just a function of how many times step() has been called. If I wanted to use the ExponentialLR scheduler to decay the learning rate every batch instead of every epoch, then I could simply call step() after each batch.

The scheduler implementations define the shape, the users define the semantics of step. There was never any real need to link step and epoch in the first place, AFAICT. Am I missing something? In that case, we could just use step to mean progress the learning rate, and cyclical learning rates could be implemented by calling step after every batch. The idea that the step functions need to correspond to either batch or epoch or anything else seems misguided, especially since the user is free to call either method whenever they see fit.

@thomasjpfan
Copy link
Collaborator Author

@thomasjpfan thomasjpfan commented Aug 2, 2018

@sethah Good point. I agree that it would be better to just let the user decide when to call step. I will update this PR with that in mind.

@slowbull
Copy link

@slowbull slowbull commented Nov 13, 2018

Have anyone tried it on training cifar or imagenet? any code available?

@suruoxi
Copy link

@suruoxi suruoxi commented Nov 29, 2018

davidtvs added a commit to davidtvs/kaggle-hpaic that referenced this issue Dec 14, 2018
Slightly modified version of pull request #2016 from the pytorch repo: pytorch/pytorch#2016
The implementation follows the paper "Cyclical Learning Rates for Training Neural Networks": https://arxiv.org/abs/1506.01186
@Randl
Copy link
Contributor

@Randl Randl commented Dec 31, 2018

ping reviewers

@@ -4,6 +4,8 @@
from torch._six import inf
from bisect import bisect_right
from functools import partial

import numpy as np
Copy link
Member

@soumith soumith Jan 9, 2019

numpy is an optional dependency into PyTorch.
Please remove the dependence of np in this file.

def test_lambda_lr(self):
epochs = 10
self.opt.param_groups[0]['lr'] = 0.05
self.opt.param_groups[1]['lr'] = 0.4
targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x) for x in range(epochs)]]
targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x)
Copy link
Member

@soumith soumith Jan 9, 2019

please only change lines relevant to your PR.
We have line limits of 120 characters, so your changes here and above in this file are not needed, and actively introduce noise.

scale_fn (function): Custom scaling policy defined by a single
argument lambda function, where
0 <= scale_fn(x) <= 1 for all x >= 0.
mode paramater is ignored
Copy link
Member

@soumith soumith Jan 9, 2019

*parameter

and some scaling of the amplitude; therefore
max_lr may not actually be reached depending on
scaling function. Default: 0.006
step_size_up (int): Number of training iterations in the
Copy link
Member

@soumith soumith Jan 9, 2019

this has a default value below, but doc is missing

scaling function. Default: 0.006
step_size_up (int): Number of training iterations in the
increasing half of a cycle.
step_size_down (int): Number of training iterations in the
Copy link
Member

@soumith soumith Jan 9, 2019

this has a default value below, but doc is missing

@klicperajo
Copy link

@klicperajo klicperajo commented Jan 18, 2019

Both the Leslie Smith paper that introduced the 1cycle schedule (https://arxiv.org/pdf/1803.09820.pdf) and the fast.ai library (https://docs.fast.ai/callbacks.one_cycle.html#Training-with-the-1cycle-policy) recommend to decrease the momentum as the learning rate increases, which helps with convergence. Would this be possible to add?

@thomasjpfan
Copy link
Collaborator Author

@thomasjpfan thomasjpfan commented Jan 18, 2019

The 1cycle policy can be configured by setting the step_size_up to half the total number of training iterations. Yes, it is possible to add the momentum piece to this PR.

@AlexMRuch
Copy link

@AlexMRuch AlexMRuch commented Jan 18, 2019

Looking forward to this addition!

@zdevito zdevito removed their request for review Feb 13, 2019
@zou3519 zou3519 removed their request for review Feb 15, 2019
@gchanan gchanan removed their request for review Feb 28, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

@sampepose has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@soumith
Copy link
Member

@soumith soumith commented Mar 27, 2019

finished by #18001 which is landing

@soumith soumith closed this Mar 27, 2019
facebook-github-bot added a commit that referenced this issue Mar 28, 2019
Summary:
This implements a cyclical learning rate (CLR) schedule with an optional inverse cyclical momentum. More info about CLR: https://github.com/bckenstler/CLR

This is finishing what #2016 started. Resolves #1909.
Pull Request resolved: #18001

Differential Revision: D14451845

Pulled By: sampepose

fbshipit-source-id: 8f682e0c3dee3a73bd2b14cc93fcf5f0e836b8c9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Status
Ready for Review
Linked issues

Successfully merging this pull request may close these issues.

None yet