-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118
Conversation
This pull request was exported from Phabricator. Differential Revision: D24102016 |
💊 CI failures summary and remediationsAs of commit 82c0de1 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 5 fixed upstream failures:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
torch/optim/_multi_tensor/rmsprop.py
Outdated
|
||
""" | ||
|
||
def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0, centered=False): | ||
def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0, centered=False, lr_in_momentum=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not standard in pytorch, and doesn't seem like a standard type of argument for optimizers. What are alternatives?
Is the goal only to improve performance? In that case it would be good to have a benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, make sense. I think I need to change the title for OSS. Basically, we switch to use the multi-tensor version for an internal flow and tried to make the accuracy match, but that flow related change was not pulled out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb could you help to review the extension and also the numeric change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the API change, I do not see a reason to add this in the API of RMSProp. No other optimizers have that.
If there is a runtime speed improvement for running the step, I don't believe this would justify adding this API, since this is still a change in the RMSProp algorithm we have.
As I mention in comment, if there is a discrepancy between RMSProp with and without multitensor, this needs to be investigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change relates to #23796, and the difference between pytorch and tensorflow -- and not between the original implementation and the multi-tensor.
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
@@ -384,7 +384,7 @@ void ProcessGroupNCCL::WorkNCCL::synchronizeInternal( | |||
// Check for errors and throw appropriate exception. | |||
checkAndThrowException(); | |||
std::this_thread::sleep_for( | |||
std::chrono::milliseconds(kSynchronizeBusyWaitMillis)); | |||
std::chrono::microseconds(kSynchronizeBusyWaitMicros)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We hope to get rid of this busy wait in #45236, but I guess its fine to land this change if it is something urgent for classy vision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was not aware this file is pulled into this PR. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, although I was wondering if we were planning on making these changes as part of a separate PR?
This pull request was exported from Phabricator. Differential Revision: D24102016 |
b4c9a59
to
7378385
Compare
Can you update the PR description and maybe include a standalone benchmark results? (torch.utils.benchmark might be helpful) |
7378385
to
e454458
Compare
This pull request was exported from Phabricator. Differential Revision: D24102016 |
e454458
to
2cea56d
Compare
This pull request was exported from Phabricator. Differential Revision: D24102016 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D24102016 |
2cea56d
to
0ce4c70
Compare
…torch#46118) Summary: Pull Request resolved: pytorch#46118 Switch to use the multi-tensor version RMSProp optimizer in the classy vision flow and also make the numeric match with old one Test Plan: Flow canary: f223946625 Differential Revision: D24102016 fbshipit-source-id: 362d525bc3e1728ee736e9806a49c5931e8b1cd5
This pull request was exported from Phabricator. Differential Revision: D24102016 |
0ce4c70
to
82c0de1
Compare
@lly-zero-one -- I would like more clarity about the switch to multi-tensor, and "make the numeric match with old one". Are you saying the implementation of multi-tensor doesn't match the implementation without? If this is so, then we need to add a test that can confirm that though it is hard to do in the open source without the internal models. We can sync up offline for this. |
@vincentqb it looks like there's sufficient demand for tf-style RMSProp optimizer and we are doing pytorch users a disservice by not offering it. I understand that for bc reasons we may be reluctant to change default RMSProp behavior, but then it makes sense to have RMSProp_tf optimizer (name can be anything) that does what's requested in #23796 and what @rwightman's linked optimizer does. |
I could add an option to support the two versions. |
Some comments from internal team:
|
Hi @lly-zero-one! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but we do not have a signature on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
@vincentqb what's required to move this PR forward and provide rms version consistent with tf behavior? It seems to be requested internally and externally. |
Thanks for the pull request @lly-zero-one. As far as I can see, there are mainly three changes requested in this PR.
@lly-zero-one, I will close this pull request for now, but please feel free to open an issue for those points that are still needed. |
@@ -78,7 +102,7 @@ def step(self, closure=None): | |||
raise RuntimeError('RMSprop does not support sparse gradients') | |||
|
|||
grads.append(p.grad) | |||
params_with_grad.append(p) | |||
params_with_grad.append(p.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently cleaned up the use of .data
inside optimizers. Are you aware of its interaction with multitensor?
@vincentqb decoupled decay works perfectly well with rmsprop and some other optimizers besides the AdamW/SGDW of that paper. Since I was just working on some JAX code fixing up the Flax RMSProp, it's common to see optimizers there (various JAX libs) where any weight_decay applied with the optimizer is (usually) decoupled decay and L2 penalty (equiv to weight_decay here) is applied outside of the optimizer. Since it doesn't look like these changes are going anywhere fast, I think I'll tackle the multi-tensor variant in |
Are there any plans to revive this or do we just have to use for example, @rwightman methodology? |
Summary: Switch to use the multi-tensor version RMSProp optimizer in the classy vision flow and also make the numeric match with old one
Test Plan: Flow canary: f223946625
Differential Revision: D24102016