Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Pipe + DDP for unused parameters, static graph #60118

Closed
wants to merge 5 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Jun 16, 2021

Stack from ghstack:

Pipe + DDP has a few issues:

  1. with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since enable static graph training in DDP #55248
  2. when find_unused_parameters=True, also does not result in gradient synchronization. does not work since [DDP] Support not all outputs used in loss calculation #57081

The reason for both cases is that calling DDPSink.apply(output_tensor) does not call the custom backward of DDPSink when the output_tensor is actually an OwnerRRef, which is the case when running DDP in Pipe. This is because we do backward on the rref.local_value() which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: D29167283

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 16, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 16, 2021

💊 CI failures summary and remediations

As of commit 1d9a066 (more details on the Dr. CI page and at hud.pytorch.org/pr/60118):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
@rohan-varma
Copy link
Member Author

test_ddp_logging_data_cpu is known to be flaky - #60130

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Comment on lines 28 to 34
def _tree_flatten_with_rref(output):
output_is_rref = RPC_AVAILABLE and isinstance(output, RRef)
if output_is_rref:
output_tensor_list, treespec = tree_flatten(output.local_value())
else:
output_tensor_list, treespec = tree_flatten(output)
return output_tensor_list, treespec, output_is_rref
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add some comments here about this logic.

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in acd914f.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/331/head branch June 21, 2021 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants