demonstration of scheduler-style learning rate #1012

zcain117 · 2019-09-07T01:37:05Z

Showing option #3 for the learning rate scheduling that we need for resnet50 to pass 76% accuracy.

See here for option #2 and here for option #1.

At first I was hesitant about this route since so it differs from the Pytorch paradigm for how they intend schedulers to be used (which is to call scheduler.step() once per epoch).

However, the code is much cleaner in this version and it will be easier to assign models to different schedulers as we expand the number of models we support.

zcain117 · 2019-09-07T01:39:12Z

Note that none of the existing pytorch schedulers allow for warmup epochs: https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html

taylanbil · 2019-09-07T01:41:27Z

yeah I vote for this one.

dlibenzi

In general, I have a slight preference for the optimizer style as I see it cluttering less the main loop.
No very strong opinion though.
What do others think?

test/schedulers.py

zcain117 · 2019-09-08T17:48:00Z

In general, I have a slight preference for the optimizer style as I see it cluttering less the main loop.
No very strong opinion though.
What do others think?

I think this scheduler-style fits best with the Pytorch model. They use scheduler objects to wrap optimizers to update learning rates, and now I realize they are already moving toward a step-based scheduler update system.

They have 1 scheduler already where they hack in a step-dependent scheduler.step() call: https://github.com/pytorch/pytorch/blob/master/torch/optim/lr_scheduler.py#L715-L729

The version I've made in this PR is very similar, except I explicitly provide the step arg instead of the hack they do with encoding step+epoch in the epoch arg in the example linked above. The resulting train loop I have in this PR is very similar to their example train loop for that step-dependent scheduler. Theirs looks like this:

        >>> scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
        >>> iters = len(dataloader)
        >>> for epoch in range(20):
        >>>     for i, sample in enumerate(dataloader):
        >>>         inputs, labels = sample['inputs'], sample['labels']
        >>>         scheduler.step(epoch + i / iters)
        >>>         optimizer.zero_grad()
        >>>         outputs = net(inputs)
        >>>         loss = criterion(outputs, labels)
        >>>         loss.backward()
        >>>         optimizer.step()

jysohn23 · 2019-09-08T21:00:04Z

As discussed offline on Friday, I prefer the lr_scheduler step type as opposed to having to subclass optimizers. +1 for this PR!

dlibenzi · 2019-09-08T21:54:51Z

As discussed offline on Friday, I prefer the lr_scheduler step type as opposed to having to subclass optimizers. +1 for this PR!

Subclassing, like I commented in that PR, would have been the way. Wrapping would have.
We cannot have one different optimizer per each base optimizer.

As I said, I only have a slight preference for the optimizer wrapping, so I am essentially OK with both.

But let's see if @ailzhang can ask inside the pytorch community for guidance ...

ailzhang · 2019-09-09T17:41:24Z

@fmassa and I both voted for scheduler-style I think.
@fmassa also mentioned PT should probably allow a warmup scheduler in the future.

dlibenzi · 2019-09-09T17:43:45Z

@fmassa and I both voted for scheduler-style I think.
@fmassa also mentioned PT should probably allow a warmup scheduler in the future.

Thanks @ailzhang !
OK @zcain117 let's go with that!

…enet-learning-rate-scheduler

test/test_train_imagenet.py

test/schedulers.py

test/test_train_imagenet.py

test/schedulers.py

test/test_train_imagenet.py

…enet-learning-rate-scheduler

test/test_utils.py

test/schedulers.py

test/test_utils.py

dlibenzi

Why did you update tensorflow commit ID?

test/test_train_imagenet.py

zcain117 · 2019-09-10T16:41:19Z

Why did you update tensorflow commit ID?

I haven't touched anything in third_party/
Not sure why it's showing up as a diff

test/schedulers.py

dlibenzi

Make sure this works on XLA CPU, and pytorch CPU (pass [] as devices to DataParallel).

zcain117 · 2019-09-10T18:55:26Z

Make sure this works on XLA CPU, and pytorch CPU (pass [] as devices to DataParallel).

Just checked using --num_cores=0 and also

export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"
export XRT_WORKERS="localservice:0;grpc://localhost:40934"
unset XRT_TPU_CONFIG

Both versions were updating the optimizer learning rate correctly

taylanbil

lgtm

test/schedulers.py

zcain117 · 2019-09-11T01:23:50Z

I'm still working through a new accuracy issue here. When I kick off resnet50 on imagenet now, the job gets to 12% accuracy on epoch 2 or 3 and then stays at that accuracy forever. Trying to tell if this is related to this change or to new code that went in since I created the original red VM. Also trying to tell if this is the result of building from head (my runs where I got good accuracy were using the pytorch-nightly conda env, not building from head)

zcain117 · 2019-09-14T17:58:50Z

An update:
We narrowed down the regression above: between August 28 and Sept 12, something regressed so that we max out around 15% accuracy with no learning rate scheduler. However, it seems that with learning rate scheduling, we still hit >76% accuracy.

I patched this PR into a fresh pytorch-nightly conda env in a new red VM and accuracy was >76% at 90 epochs. Therefore, this PR seems pretty safe to submit.

I also wanted to patch this PR and build from head using build_torch_wheels.sh. However, AFAICT at the time I grabbed the source from master, Torch had submitted this PR, which broke this PR and I wasn't able to run. I didn't have time to investigate until now, but they fixed it soon after here.

I want to make another red VM, pull latest source, patch in this PR, build from head, and run for 90 epochs to verify that the torch bug is fixed and also that we can pass 76% accuracy using this PR + build_torch_wheels.sh

demonstration of scheduler-style learning rate

6a4dd61

zcain117 requested review from taylanbil, dlibenzi and jysohn23 September 7, 2019 01:37

zcain117 mentioned this pull request Sep 7, 2019

demonstration of optimizer-style learning rate #1011

Closed

zcain117 mentioned this pull request Sep 7, 2019

Add imagenet learning schedule #1004

Closed

dlibenzi requested changes Sep 7, 2019

View reviewed changes

test/schedulers.py Outdated Show resolved Hide resolved

Merge branch 'master' of https://github.com/pytorch/xla into add-imag…

fa5d233

…enet-learning-rate-scheduler

taylanbil requested changes Sep 9, 2019

View reviewed changes

jysohn23 reviewed Sep 9, 2019

View reviewed changes

test/test_train_imagenet.py Outdated Show resolved Hide resolved

test/schedulers.py Outdated Show resolved Hide resolved

taylanbil reviewed Sep 9, 2019

View reviewed changes

test/test_train_imagenet.py Show resolved Hide resolved

test/test_train_imagenet.py Outdated Show resolved Hide resolved

zcain117 added 2 commits September 9, 2019 23:06

match pytorch api

7b4b072

Merge branch 'master' of https://github.com/pytorch/xla into add-imag…

fd6ae7e

…enet-learning-rate-scheduler

dlibenzi requested changes Sep 9, 2019

View reviewed changes

zcain117 added 3 commits September 10, 2019 00:13

model args and naming

c302fdd

remove unused constant

824bc6b

line break

092c60f

dlibenzi requested changes Sep 10, 2019

View reviewed changes

test/test_train_imagenet.py Outdated Show resolved Hide resolved

zcain117 added 2 commits September 10, 2019 17:32

fix num devices

e7ef064

git rm -r --cached third_party/tensorflow

d31ec4d

dlibenzi requested changes Sep 10, 2019

View reviewed changes

test/schedulers.py Outdated Show resolved Hide resolved

fetch third_party/tensorflow

10622f2

zcain117 added 3 commits September 10, 2019 17:47

merged

c5d5384

pass scheduler args; not FLAGS

585fa91

git checkout origin/master third_party/tensorflow

47c7ecb

dlibenzi approved these changes Sep 10, 2019

View reviewed changes

log optimizers learning rate

6db60c9

taylanbil approved these changes Sep 10, 2019

View reviewed changes

jysohn23 approved these changes Sep 10, 2019

View reviewed changes

test/schedulers.py Outdated Show resolved Hide resolved

zcain117 mentioned this pull request Sep 11, 2019

accuracy regression for resnet50 in test_train_imagenet.py #1025

Closed

resolve merge conflicts and switched to 1e-6

9a91ffd

zcain117 merged commit bc14eb2 into master Sep 16, 2019

demonstration of scheduler-style learning rate #1012

demonstration of scheduler-style learning rate #1012

Uh oh!

Conversation

zcain117 commented Sep 7, 2019

Uh oh!

zcain117 commented Sep 7, 2019

Uh oh!

taylanbil commented Sep 7, 2019

Uh oh!

dlibenzi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zcain117 commented Sep 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jysohn23 commented Sep 8, 2019

Uh oh!

dlibenzi commented Sep 8, 2019

Uh oh!

ailzhang commented Sep 9, 2019

Uh oh!

dlibenzi commented Sep 9, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlibenzi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zcain117 commented Sep 10, 2019

Uh oh!

Uh oh!

dlibenzi left a comment

Choose a reason for hiding this comment

Uh oh!

zcain117 commented Sep 10, 2019

Uh oh!

taylanbil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zcain117 commented Sep 11, 2019

Uh oh!

zcain117 commented Sep 14, 2019

Uh oh!

Uh oh!

zcain117 commented Sep 8, 2019 •

edited

Loading