[ptd] drop ncclGroupStart/end for ncclCommInit (#124363) #124416

kwen2501 · 2024-04-18T18:22:36Z

Summary:

ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()

above pattern is only needed when we have single-thread to manage multiple GPUs

in our case, we always have 1 process managing 1 GPU, we don't need group operation.

Test Plan: CI

Differential Revision: D56274975

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Summary: Pull Request resolved: pytorch#124363 ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have *single-thread* to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975

pytorch-bot · 2024-04-18T18:22:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124416

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 11d40e0 with merge base 74bedbb ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_nested_tensor
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_nested_tensor
pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_nested_tensor

This comment was automatically generated by Dr. CI and updates every 15 minutes.

d4l3k · 2024-04-18T20:08:54Z

You can use multiple GPUs from a single PyTorch process, do we not support that with NCCL?

kwen2501 · 2024-04-19T13:10:02Z

NCCL supports that.
ProcessGroupNCCL supports that too if the multiple GPUs are each under a different thread.
The change here only concerns the case where 1 thread takes care of multiple GPUs -- which case we deprecated.

kwen2501 · 2024-04-19T13:10:31Z

@pytorchbot merge

pytorchmergebot · 2024-04-19T13:12:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#124416) Summary: ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have *single-thread* to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975 Co-authored-by: Cen Zhao <cenzhao@meta.com> Pull Request resolved: pytorch#124416 Approved by: https://github.com/shuqiangzhang

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 18, 2024

kwen2501 requested review from shuqiangzhang and wconstab April 18, 2024 18:51

shuqiangzhang approved these changes Apr 19, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 19, 2024

pytorchmergebot added the merging label Apr 19, 2024

pytorchmergebot closed this in 96724a7 Apr 19, 2024

pytorchmergebot added Merged and removed merging labels Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ptd] drop ncclGroupStart/end for ncclCommInit (#124363) #124416

[ptd] drop ncclGroupStart/end for ncclCommInit (#124363) #124416

Uh oh!

kwen2501 commented Apr 18, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Apr 18, 2024 •

edited

Loading

Uh oh!

d4l3k commented Apr 18, 2024

Uh oh!

kwen2501 commented Apr 19, 2024

Uh oh!

kwen2501 commented Apr 19, 2024

Uh oh!

pytorchmergebot commented Apr 19, 2024

Uh oh!

Uh oh!

[ptd] drop ncclGroupStart/end for ncclCommInit (#124363) #124416

[ptd] drop ncclGroupStart/end for ncclCommInit (#124363) #124416

Uh oh!

Conversation

kwen2501 commented Apr 18, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124416

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

d4l3k commented Apr 18, 2024

Uh oh!

kwen2501 commented Apr 19, 2024

Uh oh!

kwen2501 commented Apr 19, 2024

Uh oh!

pytorchmergebot commented Apr 19, 2024

Merge started

Uh oh!

Uh oh!

kwen2501 commented Apr 18, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 18, 2024 •

edited

Loading