Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d] Increment sequence numbers on collectives. #55718

Closed
wants to merge 11 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Apr 9, 2021

Stack from ghstack:

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: D27690690

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 9, 2021

💊 CI failures summary and remediations

As of commit 5a61cfb (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

/builder/manywheel/build_libtorch.sh: line 188: hexdump: command not found
++ echo '/pytorch /pytorch /pytorch ~/project
5a61cfb02ab1c548cf0346b99002ce5d7d1f32c5'
++ [[ gcc5.4_cxx11-abi == *\c\x\x\1\1\-\a\b\i* ]]
++ LIBTORCH_ABI=cxx11-abi-
++ set -x
++ mkdir -p /tmp/libtorch_housecpu
+++ cat debug/libtorch_cpu.so.dbg
+++ gzip -c
+++ tail -c8
+++ hexdump -n4 -e '"%x"'
/builder/manywheel/build_libtorch.sh: line 188: hexdump: command not found
++ CRC32=


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Apr 9, 2021
rohan-varma added a commit that referenced this pull request Apr 9, 2021
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

ghstack-source-id: 126216756
Pull Request resolved: #55718

if dist.get_world_size(process_group) > 2:
# Test when certain ranks don't call collectives
if dist.get_rank(process_group) not in [0, 1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to also add a non-contiguous number like 3 in the list of [0, 1].


@skip_if_lt_x_gpu(4)
@requires_nccl()
def test_sequence_num_incremented_nccl_subgroup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Create a helper function to avoid the boilerplate code here, which is same as test_sequence_num_incremented_gloo_subgroup? It seems that only one arg differs in these two tests.

Copy link
Contributor

@wayi1 wayi1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the debuggability!

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 17, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 126779220

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 19, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 126819376

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
test/distributed/test_c10d.py Outdated Show resolved Hide resolved
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 19, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 126896607

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 20, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 126971120

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 21, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127033533

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 21, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127099736

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 22, 2021
Pull Request resolved: #55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127215077

Differential Revision: [D27690690](https://our.internmc.facebook.com/intern/diff/D27690690/)
@rohan-varma
Copy link
Member Author

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7ff1990.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/287/head branch April 27, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127215077

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27690690

fbshipit-source-id: cb284b7c760763b7c0f814a41f06656fabf806d6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants