Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118

lly-zero-one · 2020-10-09T22:27:02Z

Summary: Switch to use the multi-tensor version RMSProp optimizer in the classy vision flow and also make the numeric match with old one

Test Plan: Flow canary: f223946625

Differential Revision: D24102016

facebook-github-bot · 2020-10-09T22:27:35Z

This pull request was exported from Phabricator. Differential Revision: D24102016

dr-ci · 2020-10-09T22:45:58Z

💊 CI failures summary and remediations

As of commit 82c0de1 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

5/5 broken upstream at merge base 6cd8b5e on Oct 20 from 5:43pm to 7:40pm PDT (6 commits; 253918e - 2181449)

🚧 5 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_bionic_py3_8_gcc9_coverage_test on Oct 20 from 5:43pm to 7:40pm PDT (6 commits; 253918e - 2181449)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 on Oct 20 from 5:43pm to 7:40pm PDT (6 commits; 253918e - 2181449)
- 🔁 rerun
pytorch_linux_bionic_py3_6_clang9_test on Oct 20 from 5:43pm to 7:40pm PDT (6 commits; 253918e - 2181449)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_test1 on Oct 20 from 5:43pm to 7:40pm PDT (6 commits; 253918e - 2181449)
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_test on Oct 20 from 5:43pm to 7:40pm PDT (6 commits; 253918e - 2181449)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

vincentqb · 2020-10-14T16:24:59Z

torch/optim/_multi_tensor/rmsprop.py


    """

-    def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0, centered=False):
+    def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0, centered=False, lr_in_momentum=False):


This is not standard in pytorch, and doesn't seem like a standard type of argument for optimizers. What are alternatives?

Is the goal only to improve performance? In that case it would be good to have a benchmark.

Thanks, make sense. I think I need to change the title for OSS. Basically, we switch to use the multi-tensor version for an internal flow and tried to make the accuracy match, but that flow related change was not pulled out.

@vincentqb could you help to review the extension and also the numeric change?

For the API change, I do not see a reason to add this in the API of RMSProp. No other optimizers have that.

If there is a runtime speed improvement for running the step, I don't believe this would justify adding this API, since this is still a change in the RMSProp algorithm we have.

As I mention in comment, if there is a discrepancy between RMSProp with and without multitensor, this needs to be investigated.

The change relates to #23796, and the difference between pytorch and tensorflow -- and not between the original implementation and the multi-tensor.

pritamdamania87 · 2020-10-16T01:02:01Z

torch/lib/c10d/ProcessGroupNCCL.cpp

@@ -384,7 +384,7 @@ void ProcessGroupNCCL::WorkNCCL::synchronizeInternal(
      // Check for errors and throw appropriate exception.
      checkAndThrowException();
      std::this_thread::sleep_for(
-          std::chrono::milliseconds(kSynchronizeBusyWaitMillis));
+          std::chrono::microseconds(kSynchronizeBusyWaitMicros));


We hope to get rid of this busy wait in #45236, but I guess its fine to land this change if it is something urgent for classy vision.

Sorry, I was not aware this file is pulled into this PR. I will remove it.

No worries, although I was wondering if we were planning on making these changes as part of a separate PR?

facebook-github-bot · 2020-10-16T06:42:45Z

This pull request was exported from Phabricator. Differential Revision: D24102016

dzhulgakov · 2020-10-16T19:58:53Z

Can you update the PR description and maybe include a standalone benchmark results? (torch.utils.benchmark might be helpful)

facebook-github-bot · 2020-10-20T05:25:38Z

This pull request was exported from Phabricator. Differential Revision: D24102016

facebook-github-bot · 2020-10-20T05:41:40Z

This pull request was exported from Phabricator. Differential Revision: D24102016

facebook-github-bot · 2020-10-20T06:11:02Z

This pull request was exported from Phabricator. Differential Revision: D24102016

…torch#46118) Summary: Pull Request resolved: pytorch#46118 Switch to use the multi-tensor version RMSProp optimizer in the classy vision flow and also make the numeric match with old one Test Plan: Flow canary: f223946625 Differential Revision: D24102016 fbshipit-source-id: 362d525bc3e1728ee736e9806a49c5931e8b1cd5

facebook-github-bot · 2020-10-21T02:00:00Z

This pull request was exported from Phabricator. Differential Revision: D24102016

vincentqb · 2020-10-28T23:24:50Z

Summary: Switch to use the multi-tensor version RMSProp optimizer in the classy vision flow and also make the numeric match with old one

@lly-zero-one -- I would like more clarity about the switch to multi-tensor, and "make the numeric match with old one". Are you saying the implementation of multi-tensor doesn't match the implementation without? If this is so, then we need to add a test that can confirm that though it is hard to do in the open source without the internal models. We can sync up offline for this.

ngimel · 2020-10-29T00:01:13Z

@vincentqb it looks like there's sufficient demand for tf-style RMSProp optimizer and we are doing pytorch users a disservice by not offering it. I understand that for bc reasons we may be reluctant to change default RMSProp behavior, but then it makes sense to have RMSProp_tf optimizer (name can be anything) that does what's requested in #23796 and what @rwightman's linked optimizer does.

lly-zero-one · 2020-10-29T00:14:43Z

I could add an option to support the two versions.

lly-zero-one · 2020-10-29T00:17:36Z

@vincentqb it looks like there's sufficient demand for tf-style RMSProp optimizer and we are doing pytorch users a disservice by not offering it. I understand that for bc reasons we may be reluctant to change default RMSProp behavior, but then it makes sense to have RMSProp_tf optimizer (name can be anything) that does what's requested in #23796 and what @rwightman's linked optimizer does.

Some comments from internal team:

Seems that the pytorch rms requires a much smaller learning rate.  Currently the typical lr for tf_rms is ~0.1-0.2, while using such learning rate in pytorch_rms would lead to divergence.  I don't have extensive experiments on pytorch_rms.

We have reproduced many sota models with tf_rms.  But for pytorch_rms, some efforts are required to tune the hyper-parameters.

facebook-github-bot · 2020-11-21T01:10:51Z

Hi @lly-zero-one!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

ngimel · 2020-11-26T06:21:08Z

@vincentqb what's required to move this PR forward and provide rms version consistent with tf behavior? It seems to be requested internally and externally.

vincentqb · 2020-12-18T20:03:24Z

Thanks for the pull request @lly-zero-one. As far as I can see, there are mainly three changes requested in this PR.

`eps_inside_sqrt: see Difference in the implementation of rmsprop with Tensorflow #23796. As mentioned above by @nikitaved I would be ok with adding this parameter to RMSProp. We can wait to see if demand arises for other optimizer, e.g. Adam.
Decoupled weight decay: already implemented as part of AdamW (see also paper), and should not be part of this PR.
lr_in_momentum which should also be in a separate PR.

@lly-zero-one, I will close this pull request for now, but please feel free to open an issue for those points that are still needed.

vincentqb · 2020-12-18T20:04:26Z

torch/optim/_multi_tensor/rmsprop.py

@@ -78,7 +102,7 @@ def step(self, closure=None):
                        raise RuntimeError('RMSprop does not support sparse gradients')

                    grads.append(p.grad)
-                    params_with_grad.append(p)
+                    params_with_grad.append(p.data)


We recently cleaned up the use of .data inside optimizers. Are you aware of its interaction with multitensor?

rwightman · 2020-12-18T20:59:34Z

@vincentqb decoupled decay works perfectly well with rmsprop and some other optimizers besides the AdamW/SGDW of that paper. Since I was just working on some JAX code fixing up the Flax RMSProp, it's common to see optimizers there (various JAX libs) where any weight_decay applied with the optimizer is (usually) decoupled decay and L2 penalty (equiv to weight_decay here) is applied outside of the optimizer.

Since it doesn't look like these changes are going anywhere fast, I think I'll tackle the multi-tensor variant in timm ... been using it a lot so probably worthwhile. It should be noted that this changeset was still short one important difference for reproducing Google papers using RMSProp + TF, rms state init.

arikanev · 2021-03-12T19:14:40Z

Are there any plans to revive this or do we just have to use for example, @rwightman methodology?

lly-zero-one requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 9, 2020 22:27

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue fb-exported labels Oct 9, 2020

vincentqb reviewed Oct 14, 2020

View reviewed changes

pritamdamania87 reviewed Oct 16, 2020

View reviewed changes

lly-zero-one force-pushed the export-D24102016 branch from b4c9a59 to 7378385 Compare October 16, 2020 06:42

dzhulgakov requested a review from ngimel October 16, 2020 19:59

lly-zero-one force-pushed the export-D24102016 branch from 7378385 to e454458 Compare October 20, 2020 05:25

lly-zero-one force-pushed the export-D24102016 branch from e454458 to 2cea56d Compare October 20, 2020 05:41

lly-zero-one force-pushed the export-D24102016 branch from 2cea56d to 0ce4c70 Compare October 20, 2020 06:11

lly-zero-one force-pushed the export-D24102016 branch from 0ce4c70 to 82c0de1 Compare October 21, 2020 02:00

lly-zero-one changed the title ~~Use pytorch multi tensor rmsprop optimizer for better performance~~ Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. Oct 22, 2020

facebook-github-bot added the cla signed label Oct 30, 2020

pytorchbot added the open source label Nov 21, 2020

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 26, 2020

vincentqb added module: optimizer Related to torch.optim and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 18, 2020

vincentqb closed this Dec 18, 2020

vincentqb reviewed Dec 18, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118

Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118

lly-zero-one commented Oct 9, 2020 •

edited

facebook-github-bot commented Oct 9, 2020

dr-ci bot commented Oct 9, 2020 •

edited

vincentqb Oct 14, 2020

lly-zero-one Oct 16, 2020 •

edited

lly-zero-one Oct 22, 2020

vincentqb Oct 28, 2020

vincentqb Oct 28, 2020

pritamdamania87 Oct 16, 2020

lly-zero-one Oct 16, 2020

pritamdamania87 Oct 23, 2020

facebook-github-bot commented Oct 16, 2020

dzhulgakov commented Oct 16, 2020

facebook-github-bot commented Oct 20, 2020

facebook-github-bot commented Oct 20, 2020

facebook-github-bot commented Oct 20, 2020

facebook-github-bot commented Oct 21, 2020

vincentqb commented Oct 28, 2020

ngimel commented Oct 29, 2020

lly-zero-one commented Oct 29, 2020

lly-zero-one commented Oct 29, 2020

facebook-github-bot commented Nov 21, 2020

ngimel commented Nov 26, 2020

vincentqb commented Dec 18, 2020

vincentqb Dec 18, 2020

rwightman commented Dec 18, 2020 •

edited

arikanev commented Mar 12, 2021

Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118

Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118

Conversation

lly-zero-one commented Oct 9, 2020 • edited

facebook-github-bot commented Oct 9, 2020

dr-ci bot commented Oct 9, 2020 • edited

💊 CI failures summary and remediations

🚧 5 fixed upstream failures:

Choose a reason for hiding this comment

lly-zero-one Oct 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 16, 2020

dzhulgakov commented Oct 16, 2020

facebook-github-bot commented Oct 20, 2020

facebook-github-bot commented Oct 20, 2020

facebook-github-bot commented Oct 20, 2020

facebook-github-bot commented Oct 21, 2020

vincentqb commented Oct 28, 2020

ngimel commented Oct 29, 2020

lly-zero-one commented Oct 29, 2020

lly-zero-one commented Oct 29, 2020

facebook-github-bot commented Nov 21, 2020

ngimel commented Nov 26, 2020

vincentqb commented Dec 18, 2020

Choose a reason for hiding this comment

rwightman commented Dec 18, 2020 • edited

arikanev commented Mar 12, 2021

lly-zero-one commented Oct 9, 2020 •

edited

dr-ci bot commented Oct 9, 2020 •

edited

lly-zero-one Oct 16, 2020 •

edited

rwightman commented Dec 18, 2020 •

edited