Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using default process group in ProcessGroupAgent. #39909

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Jun 12, 2020

Stack from ghstack:

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: D22011868

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 12, 2020
As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)

ghstack-source-id: 105754232
Pull Request resolved: #39909
@dr-ci
Copy link

dr-ci bot commented Jun 12, 2020

💊 CI failures summary and remediations

As of commit ff7714c (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 29 times.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for adding this fix! We will need to modify some DDP+RPC tests accordingly.


try:
group = dc10d._get_default_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the unused dc10d import?

Comment on lines 183 to 191
if (rank != -1) and (rank != group.rank()):
raise RuntimeError(
"rank argument {} doesn't match pg rank {}".format(rank, group.rank())
)
if (world_size != -1) and (world_size != group.size()):
raise RuntimeError(
"world_size argument {} doesn't match pg size {}".format(
world_size, group.size()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it worth it to dedup these code with the one in PG backend?

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 12, 2020
Pull Request resolved: #39909

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583
ghstack-source-id: 105814774

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 13, 2020
Pull Request resolved: #39909

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583
ghstack-source-id: 105839679

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
@mrshenli
Copy link
Contributor

Test failure looks relevant

Jun 13 05:43:17 ======================================================================
Jun 13 05:43:17 ERROR [100.107s]: test_backward_ddp_outside (__main__.TestDdpUnderDistAutograd)
Jun 13 05:43:17 ----------------------------------------------------------------------
Jun 13 05:43:17 Traceback (most recent call last):
Jun 13 05:43:17   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
Jun 13 05:43:17     self._join_processes(fn)
Jun 13 05:43:17   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
Jun 13 05:43:17     self._check_return_codes(elapsed_time)
Jun 13 05:43:17   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes
Jun 13 05:43:17     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Jun 13 05:43:17 RuntimeError: Process 0 terminated or timed out after 100.07950329780579 seconds

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 16, 2020
Pull Request resolved: #39909

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583
ghstack-source-id: 105949167

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 16, 2020
Pull Request resolved: #39909

As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: #33583
ghstack-source-id: 105953303

Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@pritamdamania87
Copy link
Contributor Author

@mrshenli Looks like CI is clean now.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 145df30.

xwang233 pushed a commit to xwang233/pytorch that referenced this pull request Jun 20, 2020
Summary:
Pull Request resolved: pytorch#39909

As described in pytorch#33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: pytorch#33583
ghstack-source-id: 105953303

Test Plan: waitforbuildbot

Differential Revision: D22011868

fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/128/head branch June 20, 2020 14:16
lw added a commit that referenced this pull request Jul 1, 2020
It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent.

My understanding of the root cause is that the following code:
```
@_skip_if_tensorpipe_agent
def test_foo(self):
    self.assertEqual(2 + 2, 4)
```
ended up behaving somewhat like this:
```
def test_foo(self):
    def original_test_func(self):
        self.assertEqual(2 + 2, 4)
    return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func)
```
which means that the test body of the decorated method was not actually calling the original test method.

This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `@requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue.

Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868).

Differential Revision: [D22332611](https://our.internmc.facebook.com/intern/diff/D22332611/)

[ghstack-poisoned]
lw added a commit that referenced this pull request Jul 2, 2020
…ll agents"

It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent.

My understanding of the root cause is that the following code:
```
@_skip_if_tensorpipe_agent
def test_foo(self):
    self.assertEqual(2 + 2, 4)
```
ended up behaving somewhat like this:
```
def test_foo(self):
    def original_test_func(self):
        self.assertEqual(2 + 2, 4)
    return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func)
```
which means that the test body of the decorated method was not actually calling the original test method.

This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `@requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue.

Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868).

Differential Revision: [D22332611](https://our.internmc.facebook.com/intern/diff/D22332611/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jul 3, 2020
…40860)

Summary:
Pull Request resolved: #40860

It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent.

My understanding of the root cause is that the following code:
```
@_skip_if_tensorpipe_agent
def test_foo(self):
    self.assertEqual(2 + 2, 4)
```
ended up behaving somewhat like this:
```
def test_foo(self):
    def original_test_func(self):
        self.assertEqual(2 + 2, 4)
    return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func)
```
which means that the test body of the decorated method was not actually calling the original test method.

This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue.

Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868 (145df30)).
ghstack-source-id: 107045916

Test Plan: Discovered this as part of my refactoring, in D22332611. After fixing the decorator two tests started breaking (for real reasons). After fixing them all is passing.

Differential Revision: D22332611

fbshipit-source-id: f88ca5574675fdb3cd09a9f6da12bf1e25203a14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants