Fix Pipe + DDP for unused parameters, static graph #60118

rohan-varma · 2021-06-16T18:48:37Z

Stack from ghstack:

[DDP] Support for multiple backwards #59359 [DDP] Support for multiple backwards
Correct backend in pipe_with_ddp_test #60123 Correct backend in pipe_with_ddp_test
Fix Pipe + DDP for unused parameters, static graph #60118 Fix Pipe + DDP for unused parameters, static graph

Pipe + DDP has a few issues:

with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since enable static graph training in DDP #55248
when find_unused_parameters=True, also does not result in gradient synchronization. does not work since [DDP] Support not all outputs used in loss calculation #57081

The reason for both cases is that calling DDPSink.apply(output_tensor) does not call the custom backward of DDPSink when the output_tensor is actually an OwnerRRef, which is the case when running DDP in Pipe. This is because we do backward on the rref.local_value() which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: D29167283

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

facebook-github-bot · 2021-06-16T18:48:41Z

💊 CI failures summary and remediations

As of commit 1d9a066 (more details on the Dr. CI page and at hud.pytorch.org/pr/60118):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

rohan-varma · 2021-06-16T21:35:37Z

test_ddp_logging_data_cpu is known to be flaky - #60130

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

pritamdamania87

Thanks for fixing this!

pritamdamania87 · 2021-06-17T01:25:24Z

torch/nn/parallel/distributed.py

+def _tree_flatten_with_rref(output):
+    output_is_rref = RPC_AVAILABLE and isinstance(output, RRef)
+    if output_is_rref:
+        output_tensor_list, treespec = tree_flatten(output.local_value())
+    else:
+        output_tensor_list, treespec = tree_flatten(output)
+    return output_tensor_list, treespec, output_is_rref


Probably add some comments here about this logic.

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

facebook-github-bot · 2021-06-17T20:43:21Z

This pull request has been merged in acd914f.

rohan-varma requested review from cbalioglu, H-Huang, mingzhe09088, mrshenli, pritamdamania87, wayi1 and zhaojuanmao as code owners June 16, 2021 18:48

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 16, 2021

facebook-github-bot added the cla signed label Jun 16, 2021

This was referenced Jun 16, 2021

Correct backend in pipe_with_ddp_test #60123

Closed

[DDP] Support for multiple backwards #59359

Closed

rohan-varma added 2 commits June 16, 2021 14:37

pritamdamania87 approved these changes Jun 17, 2021

View reviewed changes

facebook-github-bot closed this in acd914f Jun 17, 2021

facebook-github-bot added the Merged label Jun 17, 2021

facebook-github-bot deleted the gh/rohan-varma/331/head branch June 21, 2021 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Pipe + DDP for unused parameters, static graph #60118

Fix Pipe + DDP for unused parameters, static graph #60118

rohan-varma commented Jun 16, 2021 •

edited

facebook-github-bot commented Jun 16, 2021 •

edited

rohan-varma commented Jun 16, 2021

pritamdamania87 left a comment

pritamdamania87 Jun 17, 2021

facebook-github-bot commented Jun 17, 2021

Fix Pipe + DDP for unused parameters, static graph #60118

Fix Pipe + DDP for unused parameters, static graph #60118

Conversation

rohan-varma commented Jun 16, 2021 • edited

facebook-github-bot commented Jun 16, 2021 • edited

💊 CI failures summary and remediations

rohan-varma commented Jun 16, 2021

pritamdamania87 left a comment

Choose a reason for hiding this comment

pritamdamania87 Jun 17, 2021

Choose a reason for hiding this comment

facebook-github-bot commented Jun 17, 2021

rohan-varma commented Jun 16, 2021 •

edited

facebook-github-bot commented Jun 16, 2021 •

edited