Skip to content

Conversation

zcain117
Copy link
Collaborator

@zcain117 zcain117 commented Sep 7, 2019

Showing option #3 for the learning rate scheduling that we need for resnet50 to pass 76% accuracy.

See here for option #2 and here for option #1.

At first I was hesitant about this route since so it differs from the Pytorch paradigm for how they intend schedulers to be used (which is to call scheduler.step() once per epoch).

However, the code is much cleaner in this version and it will be easier to assign models to different schedulers as we expand the number of models we support.

@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 7, 2019

Note that none of the existing pytorch schedulers allow for warmup epochs: https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html

@taylanbil
Copy link
Collaborator

yeah I vote for this one.

Copy link
Collaborator

@dlibenzi dlibenzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I have a slight preference for the optimizer style as I see it cluttering less the main loop.
No very strong opinion though.
What do others think?

@zcain117
Copy link
Collaborator Author

zcain117 commented Sep 8, 2019

In general, I have a slight preference for the optimizer style as I see it cluttering less the main loop.
No very strong opinion though.
What do others think?

I think this scheduler-style fits best with the Pytorch model. They use scheduler objects to wrap optimizers to update learning rates, and now I realize they are already moving toward a step-based scheduler update system.

They have 1 scheduler already where they hack in a step-dependent scheduler.step() call: https://github.com/pytorch/pytorch/blob/master/torch/optim/lr_scheduler.py#L715-L729

The version I've made in this PR is very similar, except I explicitly provide the step arg instead of the hack they do with encoding step+epoch in the epoch arg in the example linked above. The resulting train loop I have in this PR is very similar to their example train loop for that step-dependent scheduler. Theirs looks like this:

        >>> scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
        >>> iters = len(dataloader)
        >>> for epoch in range(20):
        >>>     for i, sample in enumerate(dataloader):
        >>>         inputs, labels = sample['inputs'], sample['labels']
        >>>         scheduler.step(epoch + i / iters)
        >>>         optimizer.zero_grad()
        >>>         outputs = net(inputs)
        >>>         loss = criterion(outputs, labels)
        >>>         loss.backward()
        >>>         optimizer.step()

@jysohn23
Copy link
Collaborator

jysohn23 commented Sep 8, 2019

As discussed offline on Friday, I prefer the lr_scheduler step type as opposed to having to subclass optimizers. +1 for this PR!

@dlibenzi
Copy link
Collaborator

dlibenzi commented Sep 8, 2019

As discussed offline on Friday, I prefer the lr_scheduler step type as opposed to having to subclass optimizers. +1 for this PR!

Subclassing, like I commented in that PR, would have been the way. Wrapping would have.
We cannot have one different optimizer per each base optimizer.

As I said, I only have a slight preference for the optimizer wrapping, so I am essentially OK with both.

But let's see if @ailzhang can ask inside the pytorch community for guidance ...

@ailzhang
Copy link
Contributor

ailzhang commented Sep 9, 2019

@fmassa and I both voted for scheduler-style I think.
@fmassa also mentioned PT should probably allow a warmup scheduler in the future.

@dlibenzi
Copy link
Collaborator

dlibenzi commented Sep 9, 2019

@fmassa and I both voted for scheduler-style I think.
@fmassa also mentioned PT should probably allow a warmup scheduler in the future.

Thanks @ailzhang !
OK @zcain117 let's go with that!

Copy link
Collaborator

@dlibenzi dlibenzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you update tensorflow commit ID?

@zcain117
Copy link
Collaborator Author

Why did you update tensorflow commit ID?

I haven't touched anything in third_party/
Not sure why it's showing up as a diff

Copy link
Collaborator

@dlibenzi dlibenzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this works on XLA CPU, and pytorch CPU (pass [] as devices to DataParallel).

@zcain117
Copy link
Collaborator Author

Make sure this works on XLA CPU, and pytorch CPU (pass [] as devices to DataParallel).

Just checked using --num_cores=0 and also

export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"
export XRT_WORKERS="localservice:0;grpc://localhost:40934"
unset XRT_TPU_CONFIG

Both versions were updating the optimizer learning rate correctly

Copy link
Collaborator

@taylanbil taylanbil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zcain117
Copy link
Collaborator Author

I'm still working through a new accuracy issue here. When I kick off resnet50 on imagenet now, the job gets to 12% accuracy on epoch 2 or 3 and then stays at that accuracy forever. Trying to tell if this is related to this change or to new code that went in since I created the original red VM. Also trying to tell if this is the result of building from head (my runs where I got good accuracy were using the pytorch-nightly conda env, not building from head)

@zcain117
Copy link
Collaborator Author

An update:
We narrowed down the regression above: between August 28 and Sept 12, something regressed so that we max out around 15% accuracy with no learning rate scheduler. However, it seems that with learning rate scheduling, we still hit >76% accuracy.

I patched this PR into a fresh pytorch-nightly conda env in a new red VM and accuracy was >76% at 90 epochs. Therefore, this PR seems pretty safe to submit.

I also wanted to patch this PR and build from head using build_torch_wheels.sh. However, AFAICT at the time I grabbed the source from master, Torch had submitted this PR, which broke this PR and I wasn't able to run. I didn't have time to investigate until now, but they fixed it soon after here.

I want to make another red VM, pull latest source, patch in this PR, build from head, and run for 90 epochs to verify that the torch bug is fixed and also that we can pass 76% accuracy using this PR + build_torch_wheels.sh

@zcain117 zcain117 merged commit bc14eb2 into master Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants