Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reland] Support torch.distributed.irecv(src=None, ...) #49383

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Dec 15, 2020

Stack from ghstack:

Reland of #47137

Differential Revision: D25551910

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 15, 2020
pritamdamania87 pushed a commit that referenced this pull request Dec 15, 2020
Reland of #47137

Differential Revision: [D25551910](https://our.internmc.facebook.com/intern/diff/D25551910/)

ghstack-source-id: 118586219
Pull Request resolved: #49383
Comment on lines 889 to 907
# Each rank would have 2 * (world_size - 1) sends, verify that
# globally we receive the same amount on the other end.
recv_ranks_tensor = torch.cat((torch.tensor(recv_ranks), torch.tensor(irecv_ranks)), 0)
global_recv_ranks = [
torch.empty_like(recv_ranks_tensor),
torch.empty_like(recv_ranks_tensor),
torch.empty_like(recv_ranks_tensor),
torch.empty_like(recv_ranks_tensor),
]
dist.all_gather(global_recv_ranks, recv_ranks_tensor)
global_recv_ranks_list = []
for tensor in global_recv_ranks:
global_recv_ranks_list += tensor.tolist()

from itertools import groupby
global_recv_ranks_list.sort()
frequency = [len(list(group)) for key, group in groupby(global_recv_ranks_list)]
self.assertEqual(dist.get_world_size(), len(frequency))
self.assertEqual([2 * (dist.get_world_size() - 1)] * dist.get_world_size(), frequency)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation here has been made more robust compared to #47137, since recvAnySource can potentially recv from anywhere.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 15, 2020

💊 CI failures summary and remediations

As of commit ff34456 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 16 20:57:11 torch_xla/csrc/aten_xla_type_default.cpp:3698:10: error: binding reference of type 'at::Tensor' to value of type 'const at::Tensor' drops 'const' qualifier
Dec 16 20:57:11 Target //tensorflow/compiler/xla/xla_client:libxla_computation_client.so up-to-date:
Dec 16 20:57:11   bazel-bin/tensorflow/compiler/xla/xla_client/libxla_computation_client.so
Dec 16 20:57:11 INFO: Elapsed time: 348.708s, Critical Path: 20.39s
Dec 16 20:57:11 INFO: 7336 processes: 7335 remote cache hit, 1 local.
Dec 16 20:57:11 INFO: Build completed successfully, 7692 total actions
Dec 16 20:57:11 INFO: Build completed successfully, 7692 total actions
Dec 16 20:57:11 + popd
Dec 16 20:57:11 + mkdir -p torch_xla/lib
Dec 16 20:57:11 + chmod 0644 /var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin/tensorflow/compiler/xla/xla_client/libxla_computation_client.so
Dec 16 20:57:11 + cp /var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin/tensorflow/compiler/xla/xla_client/libxla_computation_client.so torch_xla/lib
Dec 16 20:57:11 torch_xla/csrc/aten_xla_type_default.cpp:3698:10: error: binding reference of type 'at::Tensor' to value of type 'const at::Tensor' drops 'const' qualifier
Dec 16 20:57:11   return self;
Dec 16 20:57:11          ^~~~
Dec 16 20:57:11 1 error generated.
Dec 16 20:57:11 /opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py:364: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
Dec 16 20:57:11   warnings.warn(msg.format('we could not find ninja.'))
Dec 16 20:57:11 error: command 'clang-9' failed with exit status 1
Dec 16 20:57:11 + cleanup
Dec 16 20:57:11 + retcode=1
Dec 16 20:57:11 + set +x
Dec 16 20:57:11 6_64/wheel/torch/include/ATen/native

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 15 times.

pritamdamania87 pushed a commit that referenced this pull request Dec 16, 2020
Pull Request resolved: #49383

Reland of #47137
ghstack-source-id: 118669599

Differential Revision: [D25551910](https://our.internmc.facebook.com/intern/diff/D25551910/)
return pg.recv_anysource([tensor], tag)
else:
if pg is GroupMember.WORLD:
pg.recv([tensor], src, tag).wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we block in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a bug.

pritamdamania87 pushed a commit that referenced this pull request Dec 16, 2020
Pull Request resolved: #49383

Reland of #47137
ghstack-source-id: 118686023

Differential Revision: [D25551910](https://our.internmc.facebook.com/intern/diff/D25551910/)
pritamdamania87 pushed a commit that referenced this pull request Dec 16, 2020
Pull Request resolved: #49383

Reland of #47137
ghstack-source-id: 118735407

Differential Revision: [D25551910](https://our.internmc.facebook.com/intern/diff/D25551910/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in db2ecef.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/191/head branch December 20, 2020 15:18
hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 6, 2021
Summary:
Pull Request resolved: pytorch#49383

Reland of pytorch#47137
ghstack-source-id: 118735407

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D25551910

fbshipit-source-id: 2e1f2f77e7c69204056dfe6ed178e8ad7650ab32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants