-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Make the NCCL PG Options and Config copyable and safe to init standalone #155700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155700
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 2 Pending, 4 Unrelated FailuresAs of commit 55e61c4 with merge base 577baa4 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
}) | ||
.def( | ||
"__copy__", | ||
[](const ncclConfig_t& self) { return ncclConfig_t(self); }) | ||
.def( | ||
"__deepcopy__", | ||
[](const ncclConfig_t& self, const py::dict& memo) { | ||
return ncclConfig_t(self); | ||
}, | ||
py::arg("memo")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduce the (deep)copy methods because in our application we need to activate the new flag only in a subgroup, but we want to retrieve the rest of the config from the root PG, hence we need to make a copy of it in oder to avoid modifying it in-place.
Hi @lw I believe we would need to make an update here as well: https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L56 |
Please see #155233 |
.def(py::init([]() { | ||
ncclConfig_t defaultCfg = NCCL_CONFIG_INITIALIZER; | ||
return std::make_unique<ncclConfig_t>(defaultCfg); | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, it was unsafe to create a NCCLConfig instance directly (one could only create it indirectly via ProcessGroupNCCL::Options)
Before:
In [1]: import torch.distributed
In [2]: opts = torch.distributed.ProcessGroupNCCL.Options()
In [3]: cfg = torch.distributed.ProcessGroupNCCL.NCCLConfig()
In [4]: opts.config.collnet_enable
Out[4]: -2147483648
In [5]: cfg.collnet_enable
Out[5]: 0
Now:
In [1]: import torch.distributed
In [2]: opts = torch.distributed.ProcessGroupNCCL.Options()
In [3]: cfg = torch.distributed.ProcessGroupNCCL.NCCLConfig()
In [4]: opts.config.collnet_enable
Out[4]: -2147483648
In [5]: cfg.collnet_enable
Out[5]: -2147483648
@Skylion007 yeah #155379 does most of what this PR does, but there's also a few extra safety/usability changes that I'll have to land anyways |
We are blocked on landing #155233 due to the CUDA 12.9 upgrade since we are missing the updated NCCL libraries (and cuSparseLt libraries) for it. |
.def( | ||
"__copy__", | ||
[](const ncclConfig_t& self) { return ncclConfig_t(self); }) | ||
.def( | ||
"__deepcopy__", | ||
[](const ncclConfig_t& self, const py::dict& memo) { | ||
return ncclConfig_t(self); | ||
}, | ||
py::arg("memo")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering how we could make this look better? Can pybind find the default copy constructor of a struct to fulfill __copy__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a smarter way of doing this, and even within this file there are other instances which are done the same way:
pytorch/torch/csrc/distributed/c10d/init.cpp
Lines 869 to 876 in f45f483
.def( | |
"__copy__", | |
[](const ::c10d::ReduceOp& self) { return ::c10d::ReduceOp(self); }) | |
.def( | |
"__deepcopy__", | |
[](const ::c10d::ReduceOp& self, const py::dict& memo) { | |
return ::c10d::ReduceOp(self); | |
}) |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k