Ensure DDP + Pipe works with find_unused_parameters. #49908

pritamdamania87 · 2020-12-29T00:03:56Z

Stack from ghstack:

Ensure DDP + Pipe works with find_unused_parameters. #49908 Ensure DDP + Pipe works with find_unused_parameters.

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: D25719922

As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]

facebook-github-bot · 2020-12-29T00:04:01Z

💊 CI failures summary and remediations

As of commit 9eb0386 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 08 04:07:50 AssertionError: False is not true : Scalars failed to compare as equal! Comparing -11 and 0 gives a difference of 11, but the allowed difference with rtol=0 and atol=0 is only 0!

Jan 08 04:07:50 ----------------------------------------------------------------------
Jan 08 04:07:50 Traceback (most recent call last):
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 280, in wrapper
Jan 08 04:07:50     self._join_processes(fn)
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 397, in _join_processes
Jan 08 04:07:50     self._check_return_codes(elapsed_time)
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 443, in _check_return_codes
Jan 08 04:07:50     i, first_process.exitcode, p.exitcode
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1225, in assertEqual
Jan 08 04:07:50     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jan 08 04:07:50 AssertionError: False is not true : Scalars failed to compare as equal! Comparing -11 and 0 gives a difference of 11, but the allowed difference with rtol=0 and atol=0 is only 0!
Jan 08 04:07:50 Expect process 1 exit code to match Process 0 exit code of 0, but got -11
Jan 08 04:07:50 
Jan 08 04:07:50 ----------------------------------------------------------------------
Jan 08 04:07:50 Ran 380 tests in 811.357s
Jan 08 04:07:50 
Jan 08 04:07:50 FAILED (failures=1, skipped=40)
Jan 08 04:07:50 
Jan 08 04:07:50 Generating XML reports...
Jan 08 04:07:50 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpComparisonTestWithSpawn-20210108035418.xml
Jan 08 04:07:50 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpUnderDistAutogradTestWithSpawn-20210108035418.xml

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 57 times.

As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) ghstack-source-id: 119172479 Pull Request resolved: #49908

As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]

Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119203190 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]

Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119233824 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]

Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119348679 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

codecov · 2021-01-05T03:43:22Z

Codecov Report

Merging #49908 (49544a4) into gh/pritamdamania87/195/base (fbdb782) will decrease coverage by 0.00%.
The diff coverage is 41.86%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/195/base   #49908      +/-   ##
===============================================================
- Coverage                        80.68%   80.67%   -0.01%     
===============================================================
  Files                             1902     1899       -3     
  Lines                           206351   205952     -399     
===============================================================
- Hits                            166496   166159     -337     
+ Misses                           39855    39793      -62

rohan-varma

LGTM, just had a few questions inline

rohan-varma · 2021-01-06T22:15:06Z

torch/nn/parallel/distributed.py

+                if RPC_AVAILABLE and isinstance(self.module, Pipe):
+                    # Unwrap RRef to get real output for Pipe.
+                    # TODO: Needs to be reworked for cross host pipelining.
+                    self.reducer.prepare_for_backward(list(_find_tensors(output.local_value())))


Is the RRef always guaranteed to be owned by the node that this runs on? How exactly do we guarantee this?

Also, is this the only special case that we're going to have where we have to unwrap to get the actual tensors needed in prepare_for_backwards? I'm a bit concerned about too tightly coupling Pipe with DDP, when they should probably be as independent as they can be (i.e., ideally DDP won't have to deal with pipe-specific logic). Could we eventually refactor to make that so?

Is the RRef always guaranteed to be owned by the node that this runs on? How exactly do we guarantee this?

This kinda works right now since Pipe is single host only, but as I mentioned in the comment above, this needs to be reworked for cross host pipelining.

I'm a bit concerned about too tightly coupling Pipe with DDP, when they should probably be as independent as they can be (i.e., ideally DDP won't have to deal with pipe-specific logic). Could we eventually refactor to make that so?

This is a good point and when I thought about it a bit more, I think we can make this independent of Pipe for now and also more generic. What we can do is enhance find_tensors to unwrap a local RRef and look for tensors inside if it encounters a local RRef.

As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/) [ghstack-poisoned]

Pull Request resolved: #49908 As described in #49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119573413 Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

pritamdamania87 · 2021-01-08T02:35:52Z

@rohan-varma Updated the PR, could you take another look? Thanks!

rohan-varma

Looks good, agreed that enhancing _find_tensors is a better approach. It looks like test_backward_node_failure_python_udf is failing on this PR, though it's probably flakiness and unrelated.

facebook-github-bot · 2021-01-12T00:54:24Z

This pull request has been merged in f39f258.

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

…ic graph" Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Summary: Pull Request resolved: #60118 Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. ghstack-source-id: 131688187 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29167283 fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8

pritamdamania87 requested review from mrshenli, rohan-varma and zhaojuanmao as code owners December 29, 2020 00:03

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 29, 2020

rohan-varma approved these changes Jan 6, 2021

View reviewed changes

pritamdamania87 requested a review from rohan-varma January 8, 2021 02:35

pritamdamania87 requested a review from mingzhe09088 as a code owner January 8, 2021 02:35

rohan-varma approved these changes Jan 11, 2021

View reviewed changes

facebook-github-bot closed this in f39f258 Jan 12, 2021

facebook-github-bot added the Merged label Jan 12, 2021

facebook-github-bot deleted the gh/pritamdamania87/195/head branch January 15, 2021 15:17

rohan-varma mentioned this pull request Jun 16, 2021

Fix Pipe + DDP for unused parameters, static graph #60118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure DDP + Pipe works with find_unused_parameters. #49908

Ensure DDP + Pipe works with find_unused_parameters. #49908

pritamdamania87 commented Dec 29, 2020 •

edited

Loading

facebook-github-bot commented Dec 29, 2020 •

edited

Loading

codecov bot commented Jan 5, 2021 •

edited

Loading

rohan-varma left a comment

rohan-varma Jan 6, 2021

rohan-varma Jan 6, 2021

pritamdamania87 Jan 8, 2021

pritamdamania87 commented Jan 8, 2021

rohan-varma left a comment

facebook-github-bot commented Jan 12, 2021

Ensure DDP + Pipe works with find_unused_parameters. #49908

Ensure DDP + Pipe works with find_unused_parameters. #49908

Conversation

pritamdamania87 commented Dec 29, 2020 • edited Loading

facebook-github-bot commented Dec 29, 2020 • edited Loading

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

codecov bot commented Jan 5, 2021 • edited Loading

Codecov Report

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Jan 6, 2021

Choose a reason for hiding this comment

rohan-varma Jan 6, 2021

Choose a reason for hiding this comment

pritamdamania87 Jan 8, 2021

Choose a reason for hiding this comment

pritamdamania87 commented Jan 8, 2021

rohan-varma left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 12, 2021

pritamdamania87 commented Dec 29, 2020 •

edited

Loading

facebook-github-bot commented Dec 29, 2020 •

edited

Loading

codecov bot commented Jan 5, 2021 •

edited

Loading