Reenable `isinstance` with `torch.distributed.ReduceOp` #87303

crcrpar · 2022-10-19T18:11:16Z

tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: #87191

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

pytorch-bot · 2022-10-19T18:11:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87303

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit 16bdeae:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

malfet · 2022-10-20T00:06:01Z

.pyi changes are pure syntactic sugar and are not used for anything in particular... Let me think of a way to fix type ownership

wanchaol

Thanks for caring about BC! Left some comment, I'm wondering if we should just do sth like ReduceOp.PREMUL_SUM(premul_value)?

wanchaol · 2022-10-20T00:32:51Z

torch/csrc/distributed/c10d/Types.hpp

@@ -46,7 +47,9 @@ struct TORCH_API ReduceOp : torch::CustomClassHolder {

  ReduceOp(RedOpType op) : op_(op) {
    TORCH_INTERNAL_ASSERT(
-      op_ != PREMUL_SUM, "PREMUL_SUM requires a scale factor tensor or scalar argument");
+      op_ != PREMUL_SUM,
+      "Use `torch.distributed._make_nccl_premul_sum` to create an instance of ReduceOp with PREMUL_SUM"


what's the reason user need to use this API to construct a premul_sum? could user just create this reduce op type like other reduce ops? The fact that there are two APIs constructing different reduce ops are confusing

wanchaol · 2022-10-20T00:37:19Z

torch/csrc/distributed/c10d/init.cpp

@@ -566,8 +571,7 @@ They are used in specifying strategies for reduction collectives, e.g.,
      .value("BAND", ::c10d::ReduceOp::RedOpType::BAND)
      .value("BOR", ::c10d::ReduceOp::RedOpType::BOR)
      .value("BXOR", ::c10d::ReduceOp::RedOpType::BXOR)
-      .value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM)
-      .export_values();
+      .value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM);


Can we just make pybind API on ReduceOp class directly, so that user can still do things like ReduceOp.SUM (i.e. as a class attribute)?

I feel maybe we can do sth like this ReduceOp.PREMUL_SUM(premul_value) looks cleaner than _make_nccl_premul_sum

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

wanchaol

Given that this is the last day of cherry picking, let's get this in once the CI passed. Could you please create an issue to follow up on a more user friendly API once you are confident to expose premul_sum as public API? Thanks!

wanchaol · 2022-10-20T23:55:59Z

test/distributed/test_c10d_common.py

+        ):
+            self.assertTrue(isinstance(reduce_op, c10d.ReduceOp))
+        for scale in ([torch.tensor(1.0)], 2.0):
+            self.assertTrue(dist._make_nccl_premul_sum(scale), c10d.ReduceOp)


wouldn't we also need isinstance here?

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

I need to get familiar with pybind11 more Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

seemethere · 2022-10-21T15:04:03Z

@pytorchbot merge -f "needed to fix regression in 1.13, failures unrelated"

pytorchmergebot · 2022-10-21T15:05:34Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-21T15:06:12Z

Hey @crcrpar.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@kwen2501

tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol

@kwen2501

) tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: #87191 cc @kwen2501 Pull Request resolved: #87303 Approved by: https://github.com/wanchaol Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

wanchaol · 2022-10-24T04:48:11Z

torch/distributed/distributed_c10d.py

+#   which broke an implicit contract of ReduceOp being enum-like with which users apply isinstance to
+#   `op`, for example, `isinstance(ReduceOp.SUM, ReduceOp)`: https://github.com/pytorch/pytorch/issues/87191
+DENY_LIST = ("PREMUL_SUM", )
+for _red_op_name, _red_op_value in ReduceOp.RedOpType.__members__.items():


@crcrpar I think this PR break the serialization of c10d.ReduceOp, with the newest nightly this now crashes:

import torch import copy a = torch.distributed.distributed_c10d.ReduceOp.SUM copy.deepcopy(a)

I checked the nightly a few days ago and it works all fine, and the issue seems coming from the changes in this file, if I comment this changes, then it works again, could you help take a look on this and fix it? Thanks!

I'm wondering if we should try to override __isinstance__ instead of overriding the attributes on class?

I can confirm that I'm also running into the issue @wanchaol mentioned

A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)

@kwen2501

tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol

@kwen2501

tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol

Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - #81272 - #84243 - #87191 - #87303 - #87555 Ref: - pybind/pybind11#2696 Pull Request resolved: #88275 Approved by: https://github.com/wanchaol

@kwen2501

tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: pytorch#87191 cc @kwen2501 Pull Request resolved: pytorch#87303 Approved by: https://github.com/wanchaol

) Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - pytorch#81272 - pytorch#84243 - pytorch#87191 - pytorch#87303 - pytorch#87555 Ref: - pybind/pybind11#2696 Pull Request resolved: pytorch#88275 Approved by: https://github.com/wanchaol

Define the ops other than PREMUL_SUM as a static member

8a4f648

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 19, 2022

pytorchbot added the open source label Oct 19, 2022

crcrpar marked this pull request as ready for review October 19, 2022 23:42

crcrpar requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, mingzhe09088, H-Huang, awgu and kwen2501 as code owners October 19, 2022 23:42

crcrpar added 2 commits October 19, 2022 16:43

remove type hint of ReduceOp from stubfile

70bbba4

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

remove __members__ from ReduceOp

e8b7d1c

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar force-pushed the more-enum-like-reduceop branch from 1d5844e to e8b7d1c Compare October 19, 2022 23:43

wanchaol reviewed Oct 20, 2022

View reviewed changes

carmocca mentioned this pull request Oct 20, 2022

Collective's PREMUL_SUM support with PyTorch 1.13 Lightning-AI/pytorch-lightning#15201

Merged

crcrpar added 2 commits October 20, 2022 11:33

isinstance test

8002908

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

_make_nccl_premul_sum is in distributed

8c25b2f

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

wanchaol approved these changes Oct 21, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2022

crcrpar added 2 commits October 20, 2022 17:05

missing isinstance

2223c28

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

fixing __hash__ of ReduceOp

16bdeae

I need to get familiar with pybind11 more Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

pytorchmergebot added the Merged label Oct 21, 2022

pytorchmergebot closed this in aa8248c Oct 21, 2022

This was referenced Oct 21, 2022

Reenable isinstance with torch.distributed.ReduceOp (#87303) #87463

Merged

[v.1.13.0] Release Tracker #86312

Closed

crcrpar mentioned this pull request Oct 22, 2022

Improve c10d::ReduceOp & torch.distributed.distributed_c10d.ReduceOp #87555

Open

wanchaol reviewed Oct 24, 2022

View reviewed changes

wanchaol added a commit to pytorch/PiPPy that referenced this pull request Oct 24, 2022

[spmd] pin pt version

9b4fc07

A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)

wanchaol mentioned this pull request Oct 24, 2022

[spmd] pin pt version pytorch/PiPPy#572

Merged

wanchaol added a commit to pytorch/PiPPy that referenced this pull request Oct 24, 2022

[spmd] pin pt version (#572)

c74412b

A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)

wz337 pushed a commit to wz337/PiPPy that referenced this pull request Oct 24, 2022

[spmd] pin pt version (pytorch#572)

7b49903

A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version temporarily to fix CI. see pytorch/pytorch#87303 (comment)

crcrpar deleted the more-enum-like-reduceop branch November 1, 2022 18:58

crcrpar mentioned this pull request Nov 2, 2022

[c10d] Implement __instancecheck__ for c10d::ReduceOp #88275

Closed

malfet mentioned this pull request Nov 23, 2022

Add c10:: namespace in front of optional #89605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reenable `isinstance` with `torch.distributed.ReduceOp` #87303

Reenable `isinstance` with `torch.distributed.ReduceOp` #87303

crcrpar commented Oct 19, 2022

pytorch-bot bot commented Oct 19, 2022 •

edited

malfet commented Oct 20, 2022

wanchaol left a comment

wanchaol Oct 20, 2022

wanchaol Oct 20, 2022

wanchaol left a comment

wanchaol Oct 20, 2022

seemethere commented Oct 21, 2022

pytorchmergebot commented Oct 21, 2022

github-actions bot commented Oct 21, 2022

wanchaol Oct 24, 2022

wanchaol Oct 24, 2022

mannatsingh Oct 31, 2022

Reenable isinstance with torch.distributed.ReduceOp #87303

Reenable isinstance with torch.distributed.ReduceOp #87303

Conversation

crcrpar commented Oct 19, 2022

pytorch-bot bot commented Oct 19, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87303

❌ 4 Failures

malfet commented Oct 20, 2022

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol Oct 20, 2022

Choose a reason for hiding this comment

wanchaol Oct 20, 2022

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol Oct 20, 2022

Choose a reason for hiding this comment

seemethere commented Oct 21, 2022

pytorchmergebot commented Oct 21, 2022

Merge started

github-actions bot commented Oct 21, 2022

wanchaol Oct 24, 2022

Choose a reason for hiding this comment

wanchaol Oct 24, 2022

Choose a reason for hiding this comment

mannatsingh Oct 31, 2022

Choose a reason for hiding this comment

Reenable `isinstance` with `torch.distributed.ReduceOp` #87303

Reenable `isinstance` with `torch.distributed.ReduceOp` #87303

pytorch-bot bot commented Oct 19, 2022 •

edited