Skip to content

Conversation

lw
Copy link
Contributor

@lw lw commented Jun 11, 2025

[ghstack-poisoned]
@lw lw requested a review from jeffdaily as a code owner June 11, 2025 13:53
Copy link

pytorch-bot bot commented Jun 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155700

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 2 Pending, 4 Unrelated Failures

As of commit 55e61c4 with merge base 577baa4 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: releng release notes category labels Jun 11, 2025
lw added a commit that referenced this pull request Jun 11, 2025
ghstack-source-id: d61468a
Pull Request resolved: #155700
Comment on lines +3222 to +3231
})
.def(
"__copy__",
[](const ncclConfig_t& self) { return ncclConfig_t(self); })
.def(
"__deepcopy__",
[](const ncclConfig_t& self, const py::dict& memo) {
return ncclConfig_t(self);
},
py::arg("memo"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduce the (deep)copy methods because in our application we need to activate the new flag only in a subgroup, but we want to retrieve the rest of the config from the root PG, hence we need to make a copy of it in oder to avoid modifying it in-place.

@atalman
Copy link
Contributor

atalman commented Jun 11, 2025

Hi @lw I believe we would need to make an update here as well: https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L56

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 11, 2025

Please see #155233
cc @Skylion007

[ghstack-poisoned]
Comment on lines +3200 to +3203
.def(py::init([]() {
ncclConfig_t defaultCfg = NCCL_CONFIG_INITIALIZER;
return std::make_unique<ncclConfig_t>(defaultCfg);
}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, it was unsafe to create a NCCLConfig instance directly (one could only create it indirectly via ProcessGroupNCCL::Options)

Before:

In [1]: import torch.distributed

In [2]: opts = torch.distributed.ProcessGroupNCCL.Options()

In [3]: cfg = torch.distributed.ProcessGroupNCCL.NCCLConfig()

In [4]: opts.config.collnet_enable
Out[4]: -2147483648

In [5]: cfg.collnet_enable
Out[5]: 0

Now:

In [1]: import torch.distributed

In [2]: opts = torch.distributed.ProcessGroupNCCL.Options()

In [3]: cfg = torch.distributed.ProcessGroupNCCL.NCCLConfig()

In [4]: opts.config.collnet_enable
Out[4]: -2147483648

In [5]: cfg.collnet_enable
Out[5]: -2147483648

@Skylion007
Copy link
Collaborator

@lw should expose the flags you want, right? #155379

@lw
Copy link
Contributor Author

lw commented Jun 12, 2025

@Skylion007 yeah #155379 does most of what this PR does, but there's also a few extra safety/usability changes that I'll have to land anyways

@Skylion007
Copy link
Collaborator

We are blocked on landing #155233 due to the CUDA 12.9 upgrade since we are missing the updated NCCL libraries (and cuSparseLt libraries) for it.

[ghstack-poisoned]
@lw lw changed the title Update NCCL to 2.27.3 and expose collnetEnable config Make the NCCL PG Options and Config copyable and safe to init standalone Jun 17, 2025
lw added a commit that referenced this pull request Jun 17, 2025
Comment on lines +3263 to +3271
.def(
"__copy__",
[](const ncclConfig_t& self) { return ncclConfig_t(self); })
.def(
"__deepcopy__",
[](const ncclConfig_t& self, const py::dict& memo) {
return ncclConfig_t(self);
},
py::arg("memo"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering how we could make this look better? Can pybind find the default copy constructor of a struct to fulfill __copy__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a smarter way of doing this, and even within this file there are other instances which are done the same way:

.def(
"__copy__",
[](const ::c10d::ReduceOp& self) { return ::c10d::ReduceOp(self); })
.def(
"__deepcopy__",
[](const ::c10d::ReduceOp& self, const py::dict& memo) {
return ::c10d::ReduceOp(self);
})

[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 18, 2025
@lw lw added topic: not user facing topic category and removed release notes: releng release notes category labels Jun 18, 2025
@lw
Copy link
Contributor Author

lw commented Jun 18, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@lw lw deleted the gh/lw/16/head branch June 19, 2025 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants