Skip to content

Conversation

fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Dec 5, 2023

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 5, 2023
Copy link

pytorch-bot bot commented Dec 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115197

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 898fb31 with merge base 259a996 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 5, 2023
pg = dist.distributed_c10d._get_default_group()
pg.allreduce(torch.rand(10).cuda(self.rank))
self._check_nccl_timeout(timedelta(seconds=123))
pg._get_backend(torch.device(f"cuda:{self.rank}"))._reset_nccl_collective_timeout(23000)
Copy link
Contributor Author

@fduwjj fduwjj Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly passing in 23000 will lead to error:

  File "/data/users/fduwjj/pytorch/test/distributed/test_c10d_nccl.py", line 1265, in test_reset_nccl_pg_timeout
    pg._get_backend(torch.device(f"cuda:{self.rank}"))._set_default_timeout(23000)
TypeError: _set_default_timeout(): incompatible function arguments. The following argument types are supported:
    1. (self: torch._C._distributed_c10d.ProcessGroupNCCL, timeout_mil_sec: datetime.timedelta) -> None

Invoked with: <torch.distributed.distributed_c10d.ProcessGroupNCCL object at 0x7efd3e58b930>, 23000

To execute this test, run the following from the base repo dir:
     python test/distributed/test_c10d_nccl.py -k test_reset_nccl_pg_timeout

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

cc: @H-Huang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is ok. we use timedelta as our API for other dist. timeouts so its consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is just to mention that passing int directly does not work in this case. (Maybe float might work?)

Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@fduwjj
Copy link
Contributor Author

fduwjj commented Dec 5, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@fduwjj
Copy link
Contributor Author

fduwjj commented Dec 6, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Somehow the feedback does not show up, this PR is to address the comment in #115141.

cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab mrshenli zhaojuanmao rohan-varma kiukchung lucasllc XilunWu tianyu-l yf225

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/fduwjj/116/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/115197)

pytorchmergebot pushed a commit that referenced this pull request Dec 6, 2023
@fduwjj
Copy link
Contributor Author

fduwjj commented Dec 6, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@albanD albanD added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 8, 2023
@facebook-github-bot facebook-github-bot deleted the gh/fduwjj/116/head branch December 9, 2023 15:27
dmenig pushed a commit to dmenig/pytorch that referenced this pull request Dec 21, 2023
…15197)

Somehow the feedback does not show up, this PR is to address the comment in pytorch#115141.

Pull Request resolved: pytorch#115197
Approved by: https://github.com/XilunWu, https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants