-
Notifications
You must be signed in to change notification settings - Fork 23.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable isinstance
with torch.distributed.ReduceOp
#87303
Conversation
Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87303
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 FailuresAs of commit 16bdeae: The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
1d5844e
to
e8b7d1c
Compare
.pyi changes are pure syntactic sugar and are not used for anything in particular... Let me think of a way to fix type ownership |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for caring about BC! Left some comment, I'm wondering if we should just do sth like ReduceOp.PREMUL_SUM(premul_value)
?
@@ -46,7 +47,9 @@ struct TORCH_API ReduceOp : torch::CustomClassHolder { | |||
|
|||
ReduceOp(RedOpType op) : op_(op) { | |||
TORCH_INTERNAL_ASSERT( | |||
op_ != PREMUL_SUM, "PREMUL_SUM requires a scale factor tensor or scalar argument"); | |||
op_ != PREMUL_SUM, | |||
"Use `torch.distributed._make_nccl_premul_sum` to create an instance of ReduceOp with PREMUL_SUM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason user need to use this API to construct a premul_sum? could user just create this reduce op type like other reduce ops? The fact that there are two APIs constructing different reduce ops are confusing
@@ -566,8 +571,7 @@ They are used in specifying strategies for reduction collectives, e.g., | |||
.value("BAND", ::c10d::ReduceOp::RedOpType::BAND) | |||
.value("BOR", ::c10d::ReduceOp::RedOpType::BOR) | |||
.value("BXOR", ::c10d::ReduceOp::RedOpType::BXOR) | |||
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM) | |||
.export_values(); | |||
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just make pybind API on ReduceOp
class directly, so that user can still do things like ReduceOp.SUM
(i.e. as a class attribute)?
I feel maybe we can do sth like this ReduceOp.PREMUL_SUM(premul_value)
looks cleaner than _make_nccl_premul_sum
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is the last day of cherry picking, let's get this in once the CI passed. Could you please create an issue to follow up on a more user friendly API once you are confident to expose premul_sum as public API? Thanks!
test/distributed/test_c10d_common.py
Outdated
): | ||
self.assertTrue(isinstance(reduce_op, c10d.ReduceOp)) | ||
for scale in ([torch.tensor(1.0)], 2.0): | ||
self.assertTrue(dist._make_nccl_premul_sum(scale), c10d.ReduceOp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we also need isinstance
here?
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
I need to get familiar with pybind11 more Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@pytorchbot merge -f "needed to fix regression in 1.13, failures unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hey @crcrpar. |
tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol
) tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: #87191 cc @kwen2501 Pull Request resolved: #87303 Approved by: https://github.com/wanchaol Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
# which broke an implicit contract of ReduceOp being enum-like with which users apply isinstance to | ||
# `op`, for example, `isinstance(ReduceOp.SUM, ReduceOp)`: https://github.com/pytorch/pytorch/issues/87191 | ||
DENY_LIST = ("PREMUL_SUM", ) | ||
for _red_op_name, _red_op_value in ReduceOp.RedOpType.__members__.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crcrpar I think this PR break the serialization of c10d.ReduceOp, with the newest nightly this now crashes:
import torch
import copy
a = torch.distributed.distributed_c10d.ReduceOp.SUM
copy.deepcopy(a)
I checked the nightly a few days ago and it works all fine, and the issue seems coming from the changes in this file, if I comment this changes, then it works again, could you help take a look on this and fix it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should try to override __isinstance__
instead of overriding the attributes on class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that I'm also running into the issue @wanchaol mentioned
A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)
A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)
A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol
Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - #81272 - #84243 - #87191 - #87303 - #87555 Ref: - pybind/pybind11#2696 Pull Request resolved: #88275 Approved by: https://github.com/wanchaol
tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol
) Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - pytorch#81272 - pytorch#84243 - pytorch#87191 - pytorch#87303 - pytorch#87555 Ref: - pybind/pybind11#2696 Pull Request resolved: pytorch#88275 Approved by: https://github.com/wanchaol
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...
Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: #87191
cc @kwen2501