-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using default process group in ProcessGroupAgent. #39909
Avoid using default process group in ProcessGroupAgent. #39909
Conversation
As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/) [ghstack-poisoned]
As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/) ghstack-source-id: 105754232 Pull Request resolved: #39909
💊 CI failures summary and remediationsAs of commit ff7714c (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 29 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks for adding this fix! We will need to modify some DDP+RPC tests accordingly.
|
||
try: | ||
group = dc10d._get_default_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the unused dc10d import?
if (rank != -1) and (rank != group.rank()): | ||
raise RuntimeError( | ||
"rank argument {} doesn't match pg rank {}".format(rank, group.rank()) | ||
) | ||
if (world_size != -1) and (world_size != group.size()): | ||
raise RuntimeError( | ||
"world_size argument {} doesn't match pg size {}".format( | ||
world_size, group.size() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it worth it to dedup these code with the one in PG backend?
As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/) [ghstack-poisoned]
Pull Request resolved: #39909 As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 ghstack-source-id: 105814774 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/) [ghstack-poisoned]
Pull Request resolved: #39909 As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 ghstack-source-id: 105839679 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
Test failure looks relevant
|
As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/) [ghstack-poisoned]
Pull Request resolved: #39909 As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 ghstack-source-id: 105949167 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/) [ghstack-poisoned]
Pull Request resolved: #39909 As described in #33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: #33583 ghstack-source-id: 105953303 Differential Revision: [D22011868](https://our.internmc.facebook.com/intern/diff/D22011868/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
@mrshenli Looks like CI is clean now. |
This pull request has been merged in 145df30. |
Summary: Pull Request resolved: pytorch#39909 As described in pytorch#33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: pytorch#33583 ghstack-source-id: 105953303 Test Plan: waitforbuildbot Differential Revision: D22011868 fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent. My understanding of the root cause is that the following code: ``` @_skip_if_tensorpipe_agent def test_foo(self): self.assertEqual(2 + 2, 4) ``` ended up behaving somewhat like this: ``` def test_foo(self): def original_test_func(self): self.assertEqual(2 + 2, 4) return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func) ``` which means that the test body of the decorated method was not actually calling the original test method. This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `@requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue. Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868). Differential Revision: [D22332611](https://our.internmc.facebook.com/intern/diff/D22332611/) [ghstack-poisoned]
…ll agents" It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent. My understanding of the root cause is that the following code: ``` @_skip_if_tensorpipe_agent def test_foo(self): self.assertEqual(2 + 2, 4) ``` ended up behaving somewhat like this: ``` def test_foo(self): def original_test_func(self): self.assertEqual(2 + 2, 4) return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func) ``` which means that the test body of the decorated method was not actually calling the original test method. This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `@requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue. Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868). Differential Revision: [D22332611](https://our.internmc.facebook.com/intern/diff/D22332611/) [ghstack-poisoned]
…40860) Summary: Pull Request resolved: #40860 It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent. My understanding of the root cause is that the following code: ``` @_skip_if_tensorpipe_agent def test_foo(self): self.assertEqual(2 + 2, 4) ``` ended up behaving somewhat like this: ``` def test_foo(self): def original_test_func(self): self.assertEqual(2 + 2, 4) return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func) ``` which means that the test body of the decorated method was not actually calling the original test method. This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue. Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868 (145df30)). ghstack-source-id: 107045916 Test Plan: Discovered this as part of my refactoring, in D22332611. After fixing the decorator two tests started breaking (for real reasons). After fixing them all is passing. Differential Revision: D22332611 fbshipit-source-id: f88ca5574675fdb3cd09a9f6da12bf1e25203a14
Stack from ghstack:
As described in #33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.
To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.
Closes: #33583
Differential Revision: D22011868