-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable isinstance
with torch.distributed.ReduceOp
#87303
Changes from all commits
8a4f648
70bbba4
e8b7d1c
8002908
8c25b2f
2223c28
16bdeae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -515,10 +515,14 @@ An enum-like class for built-in communication hooks: ``ALLREDUCE`` and ``FP16_CO | |
R"(Sets the debug level of the torch.distributed package from the | ||
``TORCH_DISTRIBUTED_DEBUG`` environment variable.)"); | ||
|
||
// TODO(crcrpar): Hardening `ReduceOp`. | ||
// While keeping most op types as enum value, | ||
// making `PREMUL_SUM` callable, i.e., allowing for | ||
// `ReduceOp.PREMUL_SUM(scale)` might be better as per @wanchaol. | ||
// https://pybind11.readthedocs.io/en/stable/classes.html#enumerations-and-internal-types | ||
py::class_<::c10d::ReduceOp> reduce_op(module, "ReduceOp", R"( | ||
An enum-like class for available reduction operations: ``SUM``, ``PRODUCT``, | ||
``MIN``, ``MAX``, ``BAND``, ``BOR``, and ``BXOR``. | ||
``MIN``, ``MAX``, ``BAND``, ``BOR``, ``BXOR``, and ``PREMUL_SUM``. | ||
``BAND``, ``BOR``, and ``BXOR`` reductions are not available when | ||
using the ``NCCL`` backend. | ||
|
@@ -529,13 +533,16 @@ and only for NCCL versions 2.10 or later. | |
``PREMUL_SUM`` multiplies inputs by a given scalar locally before reduction. | ||
``PREMUL_SUM`` is only available with the ``NCCL`` backend, | ||
and only available for NCCL versions 2.11 or later. | ||
and only available for NCCL versions 2.11 or later. Users are supposed to | ||
use ``torch.distributed._make_nccl_premul_sum``. | ||
Additionally, ``MAX``, ``MIN`` and ``PRODUCT`` are not supported for complex tensors. | ||
The values of this class can be accessed as attributes, e.g., ``ReduceOp.SUM``. | ||
They are used in specifying strategies for reduction collectives, e.g., | ||
:func:`reduce`, :func:`all_reduce_multigpu`, etc.)"); | ||
:func:`reduce`, :func:`all_reduce_multigpu`, etc. | ||
This class does not support ``__members__`` property.)"); | ||
|
||
reduce_op.def(py::init<::c10d::ReduceOp::RedOpType>()) | ||
.def_readwrite("op", &::c10d::ReduceOp::op_); | ||
|
@@ -555,8 +562,14 @@ They are used in specifying strategies for reduction collectives, e.g., | |
[](const ::c10d::ReduceOp& self, const ::c10d::ReduceOp& other) { | ||
return self == other.op_; | ||
}) | ||
.def("__hash__", [](const ::c10d::ReduceOp& self) { return self.op_; }); | ||
|
||
.def("__hash__", [](const ::c10d::ReduceOp& self) { | ||
return static_cast<uint8_t>(self.op_); | ||
}); | ||
|
||
// note(crcrpar): Deliberately skip | ||
// [`export_values`](https://pybind11.readthedocs.io/en/stable/classes.html#enumerations-and-internal-types) | ||
// here and manually set values in Python side. See note "ReduceOp static | ||
// class attributes to support `isinstance`" | ||
py::enum_<::c10d::ReduceOp::RedOpType>(reduce_op, "RedOpType") | ||
.value("SUM", ::c10d::ReduceOp::RedOpType::SUM) | ||
.value("AVG", ::c10d::ReduceOp::RedOpType::AVG) | ||
|
@@ -566,10 +579,10 @@ They are used in specifying strategies for reduction collectives, e.g., | |
.value("BAND", ::c10d::ReduceOp::RedOpType::BAND) | ||
.value("BOR", ::c10d::ReduceOp::RedOpType::BOR) | ||
.value("BXOR", ::c10d::ReduceOp::RedOpType::BXOR) | ||
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM) | ||
.export_values(); | ||
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we just make pybind API on I feel maybe we can do sth like this |
||
|
||
// Ref: [Implicit | ||
// note(crcrpar): This could be removed because users will not pass | ||
// `RedOpType` to reduce collective ops Ref: [Implicit | ||
// conversions](https://pybind11.readthedocs.io/en/stable/advanced/classes.html#implicit-conversions) | ||
// Let us skip the explicit construction of `c10d::ReduceOp` from | ||
// `c10d::ReduceOp::RedOpType` in Python. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -233,6 +233,17 @@ def register_backend(cls, name, func, extended_api=False): | |
dist_backend = Backend | ||
|
||
|
||
# NOTE(crcrpar): [ReduceOp static class attributes to support `isinstance`] | ||
# A ReduceOp instance of `PREMUL_SUM` is supposed to be created via `_make_nccl_premul_sum` | ||
# while the other `op`s (meaning RedOpType members) can be directly passed to c10d reduce collectives. | ||
# I changed `ReduceOp` to struct from enum class and introduced RedOpType enum class for PREMUL_SUM, | ||
# which broke an implicit contract of ReduceOp being enum-like with which users apply isinstance to | ||
# `op`, for example, `isinstance(ReduceOp.SUM, ReduceOp)`: https://github.com/pytorch/pytorch/issues/87191 | ||
DENY_LIST = ("PREMUL_SUM", ) | ||
for _red_op_name, _red_op_value in ReduceOp.RedOpType.__members__.items(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @crcrpar I think this PR break the serialization of c10d.ReduceOp, with the newest nightly this now crashes:
I checked the nightly a few days ago and it works all fine, and the issue seems coming from the changes in this file, if I comment this changes, then it works again, could you help take a look on this and fix it? Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm wondering if we should try to override There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can confirm that I'm also running into the issue @wanchaol mentioned |
||
setattr(ReduceOp, _red_op_name, _red_op_value if _red_op_name in DENY_LIST else ReduceOp(_red_op_value)) | ||
|
||
|
||
class _reduce_op(object): | ||
r""" | ||
Deprecated enum-like class for reduction operations: ``SUM``, ``PRODUCT``, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason user need to use this API to construct a premul_sum? could user just create this reduce op type like other reduce ops? The fact that there are two APIs constructing different reduce ops are confusing