Fix Pipe + DDP for unused parameters, static graph #60118

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248 2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/) [ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Pipe + DDP for unused parameters, static graph #60118

Fix Pipe + DDP for unused parameters, static graph #60118

Commits on Jun 16, 2021

Commits on Jun 17, 2021