-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[c10d] Change set timeout API name to _set_default_timeout #115197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115197
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 898fb31 with merge base 259a996 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
pg = dist.distributed_c10d._get_default_group() | ||
pg.allreduce(torch.rand(10).cuda(self.rank)) | ||
self._check_nccl_timeout(timedelta(seconds=123)) | ||
pg._get_backend(torch.device(f"cuda:{self.rank}"))._reset_nccl_collective_timeout(23000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly passing in 23000
will lead to error:
File "/data/users/fduwjj/pytorch/test/distributed/test_c10d_nccl.py", line 1265, in test_reset_nccl_pg_timeout
pg._get_backend(torch.device(f"cuda:{self.rank}"))._set_default_timeout(23000)
TypeError: _set_default_timeout(): incompatible function arguments. The following argument types are supported:
1. (self: torch._C._distributed_c10d.ProcessGroupNCCL, timeout_mil_sec: datetime.timedelta) -> None
Invoked with: <torch.distributed.distributed_c10d.ProcessGroupNCCL object at 0x7efd3e58b930>, 23000
To execute this test, run the following from the base repo dir:
python test/distributed/test_c10d_nccl.py -k test_reset_nccl_pg_timeout
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
cc: @H-Huang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is ok. we use timedelta as our API for other dist. timeouts so its consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is just to mention that passing int directly does not work in this case. (Maybe float might work?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: .github/workflows/trunk.yml / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral), .github/workflows/trunk.yml / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral) Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Somehow the feedback does not show up, this PR is to address the comment in #115141. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab mrshenli zhaojuanmao rohan-varma kiukchung lucasllc XilunWu tianyu-l yf225 [ghstack-poisoned]
Successfully rebased |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…15197) Somehow the feedback does not show up, this PR is to address the comment in pytorch#115141. Pull Request resolved: pytorch#115197 Approved by: https://github.com/XilunWu, https://github.com/wconstab
Stack from ghstack (oldest at bottom):
Somehow the feedback does not show up, this PR is to address the comment in #115141.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @wz337 @tianyu-l @wconstab @yf225 @kiukchung @LucasLLC