Make the NCCL PG Options and Config copyable and safe to init standalone #155700

lw · 2025-06-11T13:53:42Z

Stack from ghstack (oldest at bottom):

-> Make the NCCL PG Options and Config copyable and safe to init standalone #155700

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-11T13:53:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155700

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 2 Pending, 4 Unrelated Failures

As of commit 55e61c4 with merge base 577baa4 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh) (trunk failure)
Process completed with exit code 1.
pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST
pull / linux-jammy-cuda12.8-py3.10-gcc11-sm89 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, lf.linux.2xlarge) (gh) (trunk failure)
inductor/test_flex_decoding.py::TestFlexDecodingCPU::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cpu

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d61468a Pull Request resolved: #155700

lw · 2025-06-11T13:59:56Z

torch/csrc/distributed/c10d/init.cpp

+          })
+      .def(
+          "__copy__",
+          [](const ncclConfig_t& self) { return ncclConfig_t(self); })
+      .def(
+          "__deepcopy__",
+          [](const ncclConfig_t& self, const py::dict& memo) {
+            return ncclConfig_t(self);
+          },
+          py::arg("memo"));


I introduce the (deep)copy methods because in our application we need to activate the new flag only in a subgroup, but we want to retrieve the rest of the config from the root PG, hence we need to make a copy of it in oder to avoid modifying it in-place.

atalman · 2025-06-11T14:02:26Z

Hi @lw I believe we would need to make an update here as well: https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L56

nWEIdia · 2025-06-11T17:13:27Z

Please see #155233
cc @Skylion007

[ghstack-poisoned]

lw · 2025-06-12T13:41:09Z

torch/csrc/distributed/c10d/init.cpp

+      .def(py::init([]() {
+        ncclConfig_t defaultCfg = NCCL_CONFIG_INITIALIZER;
+        return std::make_unique<ncclConfig_t>(defaultCfg);
+      }))


Without this change, it was unsafe to create a NCCLConfig instance directly (one could only create it indirectly via ProcessGroupNCCL::Options)

Before:

In [1]: import torch.distributed In [2]: opts = torch.distributed.ProcessGroupNCCL.Options() In [3]: cfg = torch.distributed.ProcessGroupNCCL.NCCLConfig() In [4]: opts.config.collnet_enable Out[4]: -2147483648 In [5]: cfg.collnet_enable Out[5]: 0

Now:

In [1]: import torch.distributed In [2]: opts = torch.distributed.ProcessGroupNCCL.Options() In [3]: cfg = torch.distributed.ProcessGroupNCCL.NCCLConfig() In [4]: opts.config.collnet_enable Out[4]: -2147483648 In [5]: cfg.collnet_enable Out[5]: -2147483648

Skylion007 · 2025-06-12T15:19:37Z

@lw should expose the flags you want, right? #155379

lw · 2025-06-12T15:24:05Z

@Skylion007 yeah #155379 does most of what this PR does, but there's also a few extra safety/usability changes that I'll have to land anyways

Skylion007 · 2025-06-12T15:34:58Z

We are blocked on landing #155233 due to the CUDA 12.9 upgrade since we are missing the updated NCCL libraries (and cuSparseLt libraries) for it.

[ghstack-poisoned]

ghstack-source-id: f25473f Pull Request resolved: #155700

kwen2501 · 2025-06-17T16:53:04Z

torch/csrc/distributed/c10d/init.cpp

+      .def(
+          "__copy__",
+          [](const ncclConfig_t& self) { return ncclConfig_t(self); })
+      .def(
+          "__deepcopy__",
+          [](const ncclConfig_t& self, const py::dict& memo) {
+            return ncclConfig_t(self);
+          },
+          py::arg("memo"));


Wondering how we could make this look better? Can pybind find the default copy constructor of a struct to fulfill __copy__?

I couldn't find a smarter way of doing this, and even within this file there are other instances which are done the same way:

pytorch/torch/csrc/distributed/c10d/init.cpp

Lines 869 to 876 in f45f483

.def(

"__copy__",

[](const ::c10d::ReduceOp& self) { return ::c10d::ReduceOp(self); })

.def(

"__deepcopy__",

[](const ::c10d::ReduceOp& self, const py::dict& memo) {

return ::c10d::ReduceOp(self);

})

[ghstack-poisoned]

ghstack-source-id: 3e6ead7 Pull Request resolved: #155700

lw · 2025-06-18T09:57:18Z

@pytorchbot merge

pytorchmergebot · 2025-06-18T09:59:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

e73d2f7

[ghstack-poisoned]

lw requested a review from jeffdaily as a code owner June 11, 2025 13:53

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: releng release notes category labels Jun 11, 2025

lw added a commit that referenced this pull request Jun 11, 2025

Update NCCL to 2.27.3 and expose collnetEnable config

a09ae81

ghstack-source-id: d61468a Pull Request resolved: #155700

lw commented Jun 11, 2025

View reviewed changes

Update

33b88e6

[ghstack-poisoned]

lw commented Jun 12, 2025

View reviewed changes

lw mentioned this pull request Apr 15, 2025

Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

Closed

Update

30d26c9

[ghstack-poisoned]

lw changed the title ~~Update NCCL to 2.27.3 and expose collnetEnable config~~ Make the NCCL PG Options and Config copyable and safe to init standalone Jun 17, 2025

lw added a commit that referenced this pull request Jun 17, 2025

Make the NCCL PG Options and Config copyable and safe to init standalone

6329ad4

ghstack-source-id: f25473f Pull Request resolved: #155700

kwen2501 approved these changes Jun 17, 2025

View reviewed changes

Update

55e61c4

[ghstack-poisoned]

lw added a commit that referenced this pull request Jun 18, 2025

Make the NCCL PG Options and Config copyable and safe to init standalone

3d0ed31

ghstack-source-id: 3e6ead7 Pull Request resolved: #155700

lw added topic: not user facing topic category and removed release notes: releng release notes category labels Jun 18, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 18, 2025

pytorchmergebot added the merging label Jun 18, 2025

pytorchmergebot added the Merged label Jun 18, 2025

pytorchmergebot closed this in b30e04b Jun 18, 2025

pytorchmergebot removed the merging label Jun 18, 2025

lw deleted the gh/lw/16/head branch June 19, 2025 08:29

	.def(
	"__copy__",
	[](const ::c10d::ReduceOp& self) { return ::c10d::ReduceOp(self); })
	.def(
	"__deepcopy__",
	[](const ::c10d::ReduceOp& self, const py::dict& memo) {
	return ::c10d::ReduceOp(self);
	})

Make the NCCL PG Options and Config copyable and safe to init standalone #155700

Make the NCCL PG Options and Config copyable and safe to init standalone #155700

Uh oh!

Conversation

lw commented Jun 11, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155700

⏳ 2 Pending, 4 Unrelated Failures

Uh oh!

lw Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

atalman commented Jun 11, 2025

Uh oh!

nWEIdia commented Jun 11, 2025

Uh oh!

lw Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Jun 12, 2025

Uh oh!

lw commented Jun 12, 2025

Uh oh!

Skylion007 commented Jun 12, 2025

Uh oh!

kwen2501 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

lw Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lw commented Jun 18, 2025

Uh oh!

pytorchmergebot commented Jun 18, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lw commented Jun 11, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 11, 2025 •

edited

Loading