-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure DDP + Pipe works with find_unused_parameters. #49908
Ensure DDP + Pipe works with find_unused_parameters. #49908
Conversation
As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 9eb0386 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) ghstack-source-id: 119172479 Pull Request resolved: #49908
As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]
Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119203190 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]
Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119233824 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]
Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119348679 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
Codecov Report
@@ Coverage Diff @@
## gh/pritamdamania87/195/base #49908 +/- ##
===============================================================
- Coverage 80.68% 80.67% -0.01%
===============================================================
Files 1902 1899 -3
Lines 206351 205952 -399
===============================================================
- Hits 166496 166159 -337
+ Misses 39855 39793 -62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just had a few questions inline
torch/nn/parallel/distributed.py
Outdated
if RPC_AVAILABLE and isinstance(self.module, Pipe): | ||
# Unwrap RRef to get real output for Pipe. | ||
# TODO: Needs to be reworked for cross host pipelining. | ||
self.reducer.prepare_for_backward(list(_find_tensors(output.local_value()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the RRef always guaranteed to be owned by the node that this runs on? How exactly do we guarantee this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is this the only special case that we're going to have where we have to unwrap to get the actual tensors needed in prepare_for_backwards
? I'm a bit concerned about too tightly coupling Pipe with DDP, when they should probably be as independent as they can be (i.e., ideally DDP won't have to deal with pipe-specific logic). Could we eventually refactor to make that so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the RRef always guaranteed to be owned by the node that this runs on? How exactly do we guarantee this?
This kinda works right now since Pipe is single host only, but as I mentioned in the comment above, this needs to be reworked for cross host pipelining.
I'm a bit concerned about too tightly coupling Pipe with DDP, when they should probably be as independent as they can be (i.e., ideally DDP won't have to deal with pipe-specific logic). Could we eventually refactor to make that so?
This is a good point and when I thought about it a bit more, I think we can make this independent of Pipe for now and also more generic. What we can do is enhance find_tensors
to unwrap a local RRef and look for tensors inside if it encounters a local RRef.
As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]
Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119573413 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
@rohan-varma Updated the PR, could you take another look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, agreed that enhancing _find_tensors
is a better approach. It looks like test_backward_node_failure_python_udf
is failing on this PR, though it's probably flakiness and unrelated.
This pull request has been merged in f39f258. |
Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]
Summary: Pull Request resolved: #60118 Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. ghstack-source-id: 131688187 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29167283 fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8
Stack from ghstack:
As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.
This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
Differential Revision: D25719922