New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Cyclical Learning Rates #2016

Open
wants to merge 20 commits into
base: master
from

Conversation

@thomasjpfan
Copy link

thomasjpfan commented Jul 8, 2017

Adds feature requested in #1909. Mimics the parameters from https://github.com/bckenstler/CLR. Since Cyclical Learning Rate (CLR) requires updating the learning rate after every batch, I added batch_step which should be called after feeding a batch through a neutral network.

@Jiaming-Liu

This comment has been minimized.

Copy link
Contributor

Jiaming-Liu commented Jul 10, 2017

The usage of lambda would make the instance of CyclicLR not pickle-able. Also, it would be nice if this can be written as a subclass of _LRScheduler. (or maybe three subclasses?)

@thomasjpfan

This comment has been minimized.

Copy link

thomasjpfan commented Jul 10, 2017

_LRScheduler interface seems to imply that step is called at the end of each epoch. The cyclical learning rate policy updates the learning rate after every batch. If CyclicLR were to subclass _LRSchleduler it would have to raise an exception for calling step and ask the user to call batch_step instead. Basically it comes down to a design decision.

@Jiaming-Liu

This comment has been minimized.

Copy link
Contributor

Jiaming-Liu commented Jul 10, 2017

You are right -- some scheduling methods require a different calling frequency. The main discrepancy would be the timing for .step() and the name of some methods/attributes/parameters. It might be helpful if there is another abstract class for batch-wise scheduler. @apaszke Any suggestion?

@soumith soumith added this to Ready for Review in PR Status Jul 12, 2017

@sai-prasanna

This comment has been minimized.

Copy link

sai-prasanna commented Feb 28, 2018

Anything to be done to move this forward? Would love to help.

@thomasjpfan thomasjpfan force-pushed the thomasjpfan:master branch from 169d535 to 7cce1d4 Feb 28, 2018

@thomasjpfan

This comment has been minimized.

Copy link

thomasjpfan commented Feb 28, 2018

I think the only thing stopping this PR is the added complexity of the batch_step function to CyclicLR. All the other schedulers are updated after every epoch, while CyclicLR is updated every batch.

@Aphaniteja

This comment has been minimized.

Copy link

Aphaniteja commented Mar 22, 2018

I hope this gets added. Recent papers(https://github.com/timgaripov/swa) are building upon this concept.

@thomasjpfan

This comment has been minimized.

Copy link

thomasjpfan commented Mar 24, 2018

https://github.com/timgaripov/swa could be implemented in this _LRScheduler framework. The step function would have to return the weight of the moving average, and the user would have to update the moving average of the model weights themselves.

zou3519 pushed a commit to zou3519/pytorch that referenced this pull request Mar 30, 2018

Rename caffe2_ref_test.py to c2_ref_test.py (pytorch#2016)
* Rename caffe2_ref_test.py to c2_ref_test.py

* Rename the module name doc too
@sean-adler

This comment has been minimized.

Copy link

sean-adler commented Apr 15, 2018

You are right -- some scheduling methods require a different calling frequency. The main discrepancy would be the timing for .step() and the name of some methods/attributes/parameters. It might be helpful if there is another abstract class for batch-wise scheduler. @apaszke Any suggestion?

Sorry to ask a newbie question - is it necessary to have a different API for updating the learning rate per-batch rather than per-epoch?

If CyclicLR were rewritten to subclass _LRScheduler and implement step() instead of batch_step(), it kind of seems like doing this would be fine:

for epoch in range(num_epochs):
    for batch in data_loader:
        optimizer.zero_grad()
        output = model(input)
        loss = criterion(output, target)
        loss.backward()

        optimizer.step()
        scheduler.step()  #  <- apply CLR per-batch

Because step_size is an argument to CyclicLR in this changeset, making cycles follow a schedule similar to those in the paper by e.g. setting step_size to a multiple of len(data_loader) seems like it'd make the above work (and be reasonably straightforward).

(One downside of doing this is that CyclicLR would have a misleading last_epoch attribute, but I have no idea how big a deal that is.)

I have very little pytorch-specific context here, so any feedback would be hugely appreciated.

@thomasjpfan

This comment has been minimized.

Copy link

thomasjpfan commented Apr 16, 2018

Here are the following options:

  1. Extending the step function to be used in batch and epoch scheduling may cause user confusion. A user trying out different schedulers would need to know to move the step function call when using CyclicLR. last_epoch would need to be interpreted as "the last batch index".

  2. Adding the batch_step function to CyclicLR explicitly states that the step function should be called per batch. Adding another LR scheduling interface may increase the complexity of the overall lr_scheduler module.

It all comes down to a stylistic decision. I choose option 2 for this PR.

@EtienneDesticourt

This comment has been minimized.

Copy link

EtienneDesticourt commented May 9, 2018

Very useful stuff, can't wait for this PR to be merged!

I've been using it to train my models and I think the step_size parameter is confusing though. From its name you would expect it to be the amount of change in the learning rate at each step i.e: step_size = (max_lr - min_lr) / num_steps_per_cycle. Whereas in this implementation step_size is actually the number of steps per half cycle (I also think the "number of training iterations per half cycle" part of the description is needlessly cryptic, speaking in terms of steps makes it much clearer, the mental connection between training iterations and steps happens when you call batch_step either way.)

So to sum up I propose something like this instead:
num_steps(int): Number of steps per half cycle. Authors suggest setting num_steps 2-8 x training steps in epoch. Default: 2000

@AutuanLiu

This comment has been minimized.

Copy link

AutuanLiu commented May 9, 2018

  1. we can use lr_scheduler.LambdaLR to do this, here are the examples
def cyclical_lr(step_sz, min_lr=0.001, max_lr=1, mode='triangular', scale_func=None, scale_md='cycles', gamma=1.):
    """implements a cyclical learning rate policy (CLR).
    Notes: the learning rate of optimizer should be 1

    Parameters:
    ----------
    mode : str, optional
        one of {triangular, triangular2, exp_range}. 
    scale_md : str, optional
        {'cycles', 'iterations'}.
    gamma : float, optional
        constant in 'exp_range' scaling function: gamma**(cycle iterations)
    
    Examples:
    --------
    >>> # the learning rate of optimizer should be 1
    >>> optimizer = optim.SGD(model.parameters(), lr=1.)
    >>> step_size = 2*len(train_loader)
    >>> clr = cyclical_lr(step_size, min_lr=0.001, max_lr=0.005)
    >>> scheduler = lr_scheduler.LambdaLR(optimizer, [clr])
    >>> # some other operations
    >>> scheduler.step()
    >>> optimizer.step()
    """
    if scale_func == None:
        if mode == 'triangular':
            scale_fn = lambda x: 1.
            scale_mode = 'cycles'
        elif mode == 'triangular2':
            scale_fn = lambda x: 1 / (2.**(x - 1))
            scale_mode = 'cycles'
        elif mode == 'exp_range':
            scale_fn = lambda x: gamma**(x)
            scale_mode = 'iterations'
        else:
            raise ValueError(f'The {mode} is not valid value!')
    else:
        scale_fn = scale_func
        scale_mode = scale_md

    lr_lambda = lambda iters: min_lr + (max_lr - min_lr) * rel_val(iters, step_sz, scale_mode)

    def rel_val(iteration, stepsize, mode):
        cycle = math.floor(1 + iteration / (2 * stepsize))
        x = abs(iteration / stepsize - 2 * cycle + 1)
        if mode == 'cycles':
            return max(0, (1 - x)) * scale_fn(cycle)
        elif mode == 'iterations':
            return max(0, (1 - x)) * scale_fn(iteration)
        else:
            raise ValueError(f'The {scale_mode} is not valid value!')

    return lr_lambda
  1. example2
optimizer = optim.SGD(model.parameters(), lr=1.)
clr = cyclical_lr(step_size, min_lr=0.001, max_lr=1, mode='triangular2')
scheduler = lr_scheduler.LambdaLR(optimizer, [clr])
scheduler.step()
optimizer.step()
  1. reference
    1. keras CLR
    2. clr.py
@Randl

This comment has been minimized.

Copy link
Contributor

Randl commented Jun 14, 2018

What's up with this PR?

@thomasjpfan thomasjpfan force-pushed the thomasjpfan:master branch from 7cce1d4 to c66f8e8 Jun 14, 2018

@Randl

This comment has been minimized.

Copy link
Contributor

Randl commented Jun 17, 2018

@thomasjpfan @apaszke What if we add batch_step to _LRScheduler? It would add some complexity, but at least it will be uniform along different schedulers. The code of form

for epoch in epochs:
    scheduler.step()
    ...
    for batch in batches:
        scheduler.batch_step()
        ...

would work for all schedulers then.

@apaszke

This comment has been minimized.

Copy link
Member

apaszke commented Jun 18, 2018

But that's so verbose... If we need a per-batch callback we might want to write a wrapper around an iterator that returns the batches (e.g. DataLoader)

@sethah

This comment has been minimized.

Copy link
Contributor

sethah commented Aug 2, 2018

In general I don't see why step has to correspond to anything specific at all. In any case, the scheduler progresses the learning rate according to some shape, which is just a function of how many times step() has been called. If I wanted to use the ExponentialLR scheduler to decay the learning rate every batch instead of every epoch, then I could simply call step() after each batch.

The scheduler implementations define the shape, the users define the semantics of step. There was never any real need to link step and epoch in the first place, AFAICT. Am I missing something? In that case, we could just use step to mean progress the learning rate, and cyclical learning rates could be implemented by calling step after every batch. The idea that the step functions need to correspond to either batch or epoch or anything else seems misguided, especially since the user is free to call either method whenever they see fit.

@thomasjpfan

This comment has been minimized.

Copy link

thomasjpfan commented Aug 2, 2018

@sethah Good point. I agree that it would be better to just let the user decide when to call step. I will update this PR with that in mind.

@thomasjpfan thomasjpfan force-pushed the thomasjpfan:master branch from c66f8e8 to 31eb5d3 Aug 2, 2018

@thomasjpfan thomasjpfan requested review from SsnL and zou3519 as code owners Aug 2, 2018

@thomasjpfan thomasjpfan force-pushed the thomasjpfan:master branch from 58115f7 to 9a7115c Aug 2, 2018

@thomasjpfan thomasjpfan force-pushed the thomasjpfan:master branch from 15eb122 to b3de6d1 Aug 2, 2018

@thomasjpfan thomasjpfan force-pushed the thomasjpfan:master branch from 9e540bc to 2c26064 Aug 3, 2018

thomasjpfan added some commits Aug 23, 2018

@slowbull

This comment has been minimized.

Copy link

slowbull commented Nov 13, 2018

Have anyone tried it on training cifar or imagenet? any code available?

@suruoxi

This comment has been minimized.

Copy link

suruoxi commented Nov 29, 2018

davidtvs added a commit to davidtvs/kaggle-hpaic that referenced this pull request Dec 14, 2018

Add CyclicLR implementation
Slightly modified version of pull request #2016 from the pytorch repo: pytorch/pytorch#2016
The implementation follows the paper "Cyclical Learning Rates for Training Neural Networks": https://arxiv.org/abs/1506.01186
@Randl

This comment has been minimized.

Copy link
Contributor

Randl commented Dec 31, 2018

ping reviewers

@@ -4,6 +4,8 @@
from torch._six import inf
from bisect import bisect_right
from functools import partial

import numpy as np

This comment has been minimized.

@soumith

soumith Jan 9, 2019

Member

numpy is an optional dependency into PyTorch.
Please remove the dependence of np in this file.

def test_lambda_lr(self):
epochs = 10
self.opt.param_groups[0]['lr'] = 0.05
self.opt.param_groups[1]['lr'] = 0.4
targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x) for x in range(epochs)]]
targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x)

This comment has been minimized.

@soumith

soumith Jan 9, 2019

Member

please only change lines relevant to your PR.
We have line limits of 120 characters, so your changes here and above in this file are not needed, and actively introduce noise.

scale_fn (function): Custom scaling policy defined by a single
argument lambda function, where
0 <= scale_fn(x) <= 1 for all x >= 0.
mode paramater is ignored

This comment has been minimized.

@soumith

soumith Jan 9, 2019

Member

*parameter

and some scaling of the amplitude; therefore
max_lr may not actually be reached depending on
scaling function. Default: 0.006
step_size_up (int): Number of training iterations in the

This comment has been minimized.

@soumith

soumith Jan 9, 2019

Member

this has a default value below, but doc is missing

scaling function. Default: 0.006
step_size_up (int): Number of training iterations in the
increasing half of a cycle.
step_size_down (int): Number of training iterations in the

This comment has been minimized.

@soumith

soumith Jan 9, 2019

Member

this has a default value below, but doc is missing

thomasjpfan added some commits Jan 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment