Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[distributed] calling nccl reduce with inconsistent dst hangs #39706

Open
ssnl opened this issue Jun 9, 2020 · 1 comment
Open

[distributed] calling nccl reduce with inconsistent dst hangs #39706

ssnl opened this issue Jun 9, 2020 · 1 comment
Labels
module: deadlock Problems related to deadlocks (hang without exiting) module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ssnl
Copy link
Collaborator

ssnl commented Jun 9, 2020

Not sure if intended or avoidable, but if dst is inconsistent across ranks, reduce finishes, but future kernels seems to hang. E.g.,

import torch
torch.distributed.init_process_group('nccl', init_method='tcp://localhost:10402', world_size=2, rank=0)
x = torch.zeros(3, device=0)
torch.distributed.reduce(x, 0, torch.distributed.ReduceOp.SUM)
print(x)
import torch
torch.distributed.init_process_group('nccl', init_method='tcp://localhost:10402', world_size=2, rank=1)
x = torch.zeros(3, device=1)
torch.distributed.reduce(x, 1, torch.distributed.ReduceOp.SUM)
print(x)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

@albanD albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nccl Problems related to nccl support module: deadlock Problems related to deadlocks (hang without exiting) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 9, 2020
@mrshenli
Copy link
Contributor

mrshenli commented Jun 9, 2020

cc @agolynski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: deadlock Problems related to deadlocks (hang without exiting) module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants