Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure DDP + Pipe works with find_unused_parameters. #49908

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Dec 29, 2020

Stack from ghstack:

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: D25719922

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 29, 2020

💊 CI failures summary and remediations

As of commit 9eb0386 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 08 04:07:50 AssertionError: False is not true : Scalars failed to compare as equal! Comparing -11 and 0 gives a difference of 11, but the allowed difference with rtol=0 and atol=0 is only 0!
Jan 08 04:07:50 ----------------------------------------------------------------------
Jan 08 04:07:50 Traceback (most recent call last):
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 280, in wrapper
Jan 08 04:07:50     self._join_processes(fn)
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 397, in _join_processes
Jan 08 04:07:50     self._check_return_codes(elapsed_time)
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 443, in _check_return_codes
Jan 08 04:07:50     i, first_process.exitcode, p.exitcode
Jan 08 04:07:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1225, in assertEqual
Jan 08 04:07:50     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jan 08 04:07:50 AssertionError: False is not true : Scalars failed to compare as equal! Comparing -11 and 0 gives a difference of 11, but the allowed difference with rtol=0 and atol=0 is only 0!
Jan 08 04:07:50 Expect process 1 exit code to match Process 0 exit code of 0, but got -11
Jan 08 04:07:50 
Jan 08 04:07:50 ----------------------------------------------------------------------
Jan 08 04:07:50 Ran 380 tests in 811.357s
Jan 08 04:07:50 
Jan 08 04:07:50 FAILED (failures=1, skipped=40)
Jan 08 04:07:50 
Jan 08 04:07:50 Generating XML reports...
Jan 08 04:07:50 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpComparisonTestWithSpawn-20210108035418.xml
Jan 08 04:07:50 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpUnderDistAutogradTestWithSpawn-20210108035418.xml

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 57 times.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 29, 2020
pritamdamania87 pushed a commit that referenced this pull request Dec 29, 2020
As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

ghstack-source-id: 119172479
Pull Request resolved: #49908
As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Dec 29, 2020
Pull Request resolved: #49908

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119203190

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Dec 30, 2020
Pull Request resolved: #49908

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119233824

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jan 5, 2021
Pull Request resolved: #49908

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119348679

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
@codecov
Copy link

codecov bot commented Jan 5, 2021

Codecov Report

Merging #49908 (49544a4) into gh/pritamdamania87/195/base (fbdb782) will decrease coverage by 0.00%.
The diff coverage is 41.86%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/195/base   #49908      +/-   ##
===============================================================
- Coverage                        80.68%   80.67%   -0.01%     
===============================================================
  Files                             1902     1899       -3     
  Lines                           206351   205952     -399     
===============================================================
- Hits                            166496   166159     -337     
+ Misses                           39855    39793      -62     

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just had a few questions inline

if RPC_AVAILABLE and isinstance(self.module, Pipe):
# Unwrap RRef to get real output for Pipe.
# TODO: Needs to be reworked for cross host pipelining.
self.reducer.prepare_for_backward(list(_find_tensors(output.local_value())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the RRef always guaranteed to be owned by the node that this runs on? How exactly do we guarantee this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this the only special case that we're going to have where we have to unwrap to get the actual tensors needed in prepare_for_backwards? I'm a bit concerned about too tightly coupling Pipe with DDP, when they should probably be as independent as they can be (i.e., ideally DDP won't have to deal with pipe-specific logic). Could we eventually refactor to make that so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the RRef always guaranteed to be owned by the node that this runs on? How exactly do we guarantee this?

This kinda works right now since Pipe is single host only, but as I mentioned in the comment above, this needs to be reworked for cross host pipelining.

I'm a bit concerned about too tightly coupling Pipe with DDP, when they should probably be as independent as they can be (i.e., ideally DDP won't have to deal with pipe-specific logic). Could we eventually refactor to make that so?

This is a good point and when I thought about it a bit more, I think we can make this independent of Pipe for now and also more generic. What we can do is enhance find_tensors to unwrap a local RRef and look for tensors inside if it encounters a local RRef.

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jan 8, 2021
Pull Request resolved: #49908

As described in #49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119573413

Differential Revision: [D25719922](https://our.internmc.facebook.com/intern/diff/D25719922/)
@pritamdamania87
Copy link
Contributor Author

@rohan-varma Updated the PR, could you take another look? Thanks!

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, agreed that enhancing _find_tensors is a better approach. It looks like test_backward_node_failure_python_udf is failing on this PR, though it's probably flakiness and unrelated.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f39f258.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/195/head branch January 15, 2021 15:17
rohan-varma added a commit that referenced this pull request Jun 16, 2021
Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 16, 2021
…ic graph"


Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 16, 2021
Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 16, 2021
…ic graph"


Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 16, 2021
Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 17, 2021
…ic graph"


Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 17, 2021
Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 17, 2021
…ic graph"


Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jun 17, 2021
Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081


The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.

Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jun 17, 2021
Summary:
Pull Request resolved: #60118

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081

The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
ghstack-source-id: 131688187

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29167283

fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants