adding complex support for distributed functions and . fix #45760 #45879

bdhirsh · 2020-10-06T00:09:31Z

Stack from ghstack:

adding complex support for distributed functions and . fix #45760 #45879 adding complex support for distributed functions and . fix Complex Number support for distributed #45760

Differential Revision: D24127949

[ghstack-poisoned]

bdhirsh · 2020-10-06T00:11:05Z

torch/testing/_internal/distributed/distributed_test.py

    if value is None:
        value = size
-    return torch.FloatTensor(size=[dim_size for _ in range(dim)]).fill_(value)


What's the use case for typed Tensor classes that are typed on their dtype, vs. the generic torch.Tensor class?

not sure about the history here, but I think your solution is better.

bdhirsh · 2020-10-06T00:12:21Z

torch/testing/_internal/distributed/distributed_test.py

@@ -1060,6 +1057,41 @@ def test_all_reduce_sum_cuda(self):
                rank_to_GPU,
            )

+        @unittest.skipIf(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        def test_all_reduce_sum_complex(self):


I only bothered adding complex tests for a single reduction op (sum) since it's unrelated to the actual op logic

update to that since the above comment isn't really true- added explicit tests that a complex-unsupported reduceOp like Max should error out properly :)

…45760" [ghstack-poisoned]

ghstack-source-id: ff291d0f75451c20b1cdfd7d93738fb252a60fc9 Pull Request resolved: #45879

codecov · 2020-10-06T04:34:20Z

Codecov Report

Merging #45879 into gh/bdhirsh/22/base will decrease coverage by 0.02%.
The diff coverage is 32.94%.

@@                  Coverage Diff                   @@
##           gh/bdhirsh/22/base   #45879      +/-   ##
======================================================
- Coverage               68.20%   68.17%   -0.03%     
======================================================
  Files                     410      410              
  Lines                   53453    53516      +63     
======================================================
+ Hits                    36457    36484      +27     
- Misses                  16996    17032      +36

Impacted Files	Coverage Δ
torch/distributed/distributed_c10d.py	`27.29% <5.88%> (-0.54%)`	⬇️
.../testing/_internal/distributed/distributed_test.py	`29.72% <39.70%> (+0.54%)`	⬆️
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a814231...375fe36. Read the comment docs.

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

updated docs ghstack-source-id: 6867916d2a3316d5896421a8906b1f64cb1495ad Pull Request resolved: #45879

mrshenli · 2020-10-07T02:17:50Z

torch/distributed/distributed_c10d.py

@@ -929,11 +935,32 @@ def all_reduce(tensor,
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group

+    Example:
+        Tensors are all of dtype torch.int64.


double quote on all code

``torch.int64`` ``tensor = [[1, 1], [2, 2]]``

mrshenli · 2020-10-07T02:18:18Z

torch/distributed/distributed_c10d.py

+        rank 1 passes:
+            tensor = [[3+3i, 3+3i], [4+4i, 4+4i]]
+        both rank 0 and 1 get:
+            tensor = [[4+4i, 4+4i], [6+6i, 6+6i]]


could you please build the doc to verify that this renders correctly?

synced privately on this- I updated the doc to use python repl formatting for the examples and tested out building the doc:

mrshenli · 2020-10-07T02:19:53Z

torch/testing/_internal/distributed/distributed_test.py

    if value is None:
        value = size
-    return torch.FloatTensor(size=[dim_size for _ in range(dim)]).fill_(value)


not sure about the history here, but I think your solution is better.

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

updated docs used standard python repl examples in the docs, tested the way that they render in the browser ghstack-source-id: f0fe2fdb0a1abf6f7a30019d132cc3d5900e8fd6 Pull Request resolved: #45879

anjali411 · 2020-10-07T19:31:55Z

torch/distributed/distributed_c10d.py

+
+        >>> # Tensors are all of dtype torch.complex64.
+        >>> # We have 2 process groups, 2 ranks.
+        >>> tensor = torch.tensor([complex(1, 1), complex(2, 2)], dtype=torch.complex64) + 2 * complex(rank, rank)


tensor = torch.tensor([1+1j, 2+2j], dtype=torch.cdouble) + 2 * rank * (1+1j)

Unlike C++, Python can interpret the imaginary number j. Also, maybe we should use torch.complex128 or torch.cdouble since above, we show an example of torch.int64 and not torch.int32

anjali411 · 2020-10-07T19:41:40Z

torch/distributed/distributed_c10d.py

+        tensor([4, 6]) # Rank 0
+        tensor([4, 6]) # Rank 1
+
+        >>> # Tensors are all of dtype torch.complex64.


nit - All tensors below are of torch.complex64 dtype

anjali411 · 2020-10-07T19:41:50Z

torch/distributed/distributed_c10d.py

@@ -929,11 +935,36 @@ def all_reduce(tensor,
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group

+    Examples:
+        >>> # Tensors are all of dtype torch.int64.


nit - All tensors below are of torch.int64 dtype

anjali411 · 2020-10-07T19:42:42Z

torch/distributed/distributed_c10d.py

@@ -1408,12 +1450,44 @@ def all_gather(tensor_list,
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group

+    Examples:
+        >>> # Tensors are all of dtype torch.int64.


nit - All tensors below are of torch.int64 dtype

anjali411 · 2020-10-07T19:42:53Z

torch/distributed/distributed_c10d.py

+        [tensor([1, 2]), tensor([3, 4])] # Rank 0
+        [tensor([1, 2]), tensor([3, 4])] # Rank 1
+
+        >>> # Tensors are all of dtype torch.complex64.


nit - All tensors below are of torch.complex64 dtype

anjali411 · 2020-10-07T19:43:42Z

torch/distributed/distributed_c10d.py

+        >>> tensor_list = [torch.zero(2, dtype=torch.complex64) for _ in range(2)]
+        >>> tensor_list
+        [tensor([0.+0.j, 0.+0.j]), tensor([0.+0.j, 0.+0.j])] # Rank 0 and 1
+        >>> tensor = torch.tensor([complex(1, 1), complex(2, 2)], dtype=torch.complex64) + 2 * complex(rank, rank)


tensor = torch.tensor([1+1j, 2+2j], dtype=torch.cdouble) + 2 * rank * (1+1j)

anjali411 · 2020-10-07T19:49:04Z

torch/testing/_internal/distributed/distributed_test.py

            )

        @staticmethod
        def _all_reduce_coalesced_min_test_cases(group_size):
            return (
                [1, 4],
                [2, 3],
-                [1, 3]
+                [1, 3],
+                [torch.float, torch.float],


might be useful to explicitly mention in the documentation here: https://pytorch.org/docs/stable/distributed.html#torch.distributed.ReduceOp that min, max are not supported for complex tensors.

agreed. I also added explicit error checking for that case

sorry I messed up in copy pasting earlier. updated the comment with the link to the doc I was referring to

yep that's probably a better place to put it. added

gchanan

does complex support all of the reduce ops? Like, I don't think we support max with complex and it doesn't seem like viewing it as a real will give you something that makes a ton of sense.

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

anjali411 · 2020-10-08T17:53:19Z

torch/distributed/distributed_c10d.py

+
+        >>> # All tensors below are of torch.cdouble type.
+        >>> # We have 2 process groups, 2 ranks.
+        >>> tensor = torch.tensor([1+1j, 2+2j], dtype=torch.cdouble) + 2 * rank * (1+1j)


actually maybe it's best to let it be torch.cfloat because that's the default complex type if the default dtype is set to torch.float. This is consistent with the above example since default int dtype is torch.long or torch.int64.
If we want to use torch.cdouble, all the following prints of tensor, would look like:
tensor([1.+1.j, 2.+2.j], dtype=torch.complex128)

Makes sense. done

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

anjali411 · 2020-10-08T18:23:10Z

torch/distributed/distributed_c10d.py

+
+        >>> # All tensors below are of torch.cfloat dtype.
+        >>> # We have 2 process groups, 2 ranks.
+        >>> tensor_list = [torch.zero(2, dtype=torch.float) for _ in range(2)]


nit - torch.zero(2, dtype=torch.cfloat)

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

updated docs used standard python repl examples in the docs, tested the way that they render in the browser more doc fixes. Add an explicit error check for ReduceOps that do not support complex (Max and Min), + tests for that case ghstack-source-id: 4920be3e0cb551612c2f76a4fbcea2444f097558 Pull Request resolved: #45879

dr-ci · 2020-10-08T22:23:27Z

💊 CI failures summary and remediations

As of commit 375fe36 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

codecov.io: 1 failed

Failed: codecov/patch

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

mrshenli

Code LGTM! Please also get a stamp from @anjali411

Besides, there are too many irrelevant test failures. Please rebase and rerun tests before landing.

mrshenli · 2020-10-09T02:51:08Z

torch/distributed/distributed_c10d.py

+    if reduceOp == ReduceOp.MAX or reduceOp == ReduceOp.MIN or reduceOp == ReduceOp.PRODUCT:
+        return False
+    return True


is this same as:

return True if reduceOp == ReduceOp.SUM else False?

cleaned this up a little to make it more pythonic.

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

updated docs used standard python repl examples in the docs, tested the way that they render in the browser more doc fixes. Add an explicit error check for ReduceOps that do not support complex (Max and Min), + tests for that case make error checking a bit more pythonic ghstack-source-id: 57babd5380cf8eb464b66114a6532011b9aea4ac Pull Request resolved: #45879

anjali411 · 2020-10-09T17:39:09Z

torch/distributed/distributed_c10d.py

+# We'd like calls to unsupported ops to error out accordingly,
+# rather than returning garbage values.
+def supports_complex(reduceOp: ReduceOp) -> bool:
+    denyList = [ReduceOp.MAX, ReduceOp.MIN, ReduceOp.PRODUCT]


this looks great!

anjali411

LGTM overall. thanks Brian! My only other comment would be that we should add tests to ensure BAND, BOR, and BXOR work for complex.

@mrshenli will torch.distributed.autograd also work after this change?

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

updated docs used standard python repl examples in the docs, tested the way that they render in the browser more doc fixes. Add an explicit error check for ReduceOps that do not support complex (Max and Min), + tests for that case make error checking a bit more pythonic ghstack-source-id: bb0e96bc28664a059d8415124cc809556010ac6b Pull Request resolved: #45879

facebook-github-bot · 2020-10-12T20:19:03Z

@bdhirsh merged this pull request in c02efde.

facebook-github-bot · 2020-10-12T20:19:16Z

@bdhirsh merged this pull request in c02efde.

gchanan · 2020-10-20T17:55:32Z

torch/distributed/distributed_c10d.py

@@ -44,6 +44,17 @@
 except ImportError:
    _GLOO_AVAILABLE = False

+# Some reduce ops are not supported by complex numbers.


the way this comment is written reads like we allow calling them.

@gchanan quick fix here: #46599

adding complex support for distributed functions and . fix #45760

0c15698

[ghstack-poisoned]

bdhirsh requested review from apaszke, mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 6, 2020 00:09

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2020

bdhirsh commented Oct 6, 2020

View reviewed changes

Update on "adding complex support for distributed functions and . fix #…

d590fe6

…45760" [ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Oct 6, 2020

adding complex support for distributed functions and . fix #45760

075ff23

ghstack-source-id: ff291d0f75451c20b1cdfd7d93738fb252a60fc9 Pull Request resolved: #45879

Update on "adding complex support for distributed functions and . fix #…

517f458

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Oct 6, 2020

adding complex support for distributed functions and . fix #45760

2e87040

updated docs ghstack-source-id: 6867916d2a3316d5896421a8906b1f64cb1495ad Pull Request resolved: #45879

mrshenli reviewed Oct 7, 2020

View reviewed changes

mrshenli requested a review from anjali411 October 7, 2020 02:22

Update on "adding complex support for distributed functions and . fix #…

8c0578d

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

anjali411 reviewed Oct 7, 2020

View reviewed changes

anjali411 added the module: complex Related to complex number support in PyTorch label Oct 7, 2020

gchanan reviewed Oct 7, 2020

View reviewed changes

Update on "adding complex support for distributed functions and . fix #…

cb38d26

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

anjali411 reviewed Oct 8, 2020

View reviewed changes

Update on "adding complex support for distributed functions and . fix #…

ef8c031

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

anjali411 reviewed Oct 8, 2020

View reviewed changes

Update on "adding complex support for distributed functions and . fix #…

e8fcc0b

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

Update on "adding complex support for distributed functions and . fix #…

e048c75

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

mrshenli approved these changes Oct 9, 2020

View reviewed changes

Update on "adding complex support for distributed functions and . fix #…

ef546af

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

Update on "adding complex support for distributed functions and . fix #…

3d6a692

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

Update on "adding complex support for distributed functions and . fix #…

8ddb50a

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

anjali411 reviewed Oct 9, 2020

View reviewed changes

anjali411 approved these changes Oct 9, 2020

View reviewed changes

Update on "adding complex support for distributed functions and . fix #…

375fe36

…45760" Differential Revision: [D24127949](https://our.internmc.facebook.com/intern/diff/D24127949) [ghstack-poisoned]

facebook-github-bot closed this in c02efde Oct 12, 2020

facebook-github-bot added the Merged label Oct 12, 2020

facebook-github-bot deleted the gh/bdhirsh/22/head branch October 16, 2020 14:21

gchanan reviewed Oct 20, 2020

View reviewed changes

osalpekar mentioned this pull request Nov 3, 2020

Complex Tensor Support for DataParallel #47330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding complex support for distributed functions and . fix #45760 #45879

adding complex support for distributed functions and . fix #45760 #45879

bdhirsh commented Oct 6, 2020 •

edited

bdhirsh Oct 6, 2020

mrshenli Oct 7, 2020

bdhirsh Oct 6, 2020

bdhirsh Oct 8, 2020

codecov bot commented Oct 6, 2020 •

edited

mrshenli Oct 7, 2020

mrshenli Oct 7, 2020

bdhirsh Oct 7, 2020

mrshenli Oct 7, 2020

anjali411 Oct 7, 2020

anjali411 Oct 7, 2020

anjali411 Oct 7, 2020

anjali411 Oct 7, 2020

anjali411 Oct 7, 2020

anjali411 Oct 7, 2020

anjali411 Oct 7, 2020 •

edited

bdhirsh Oct 7, 2020

anjali411 Oct 8, 2020

bdhirsh Oct 8, 2020

gchanan left a comment

anjali411 Oct 8, 2020

bdhirsh Oct 8, 2020

anjali411 Oct 8, 2020

dr-ci bot commented Oct 8, 2020 •

edited

mrshenli left a comment

mrshenli Oct 9, 2020

bdhirsh Oct 9, 2020

anjali411 Oct 9, 2020

anjali411 left a comment

facebook-github-bot commented Oct 12, 2020

facebook-github-bot commented Oct 12, 2020

gchanan Oct 20, 2020

bdhirsh Oct 20, 2020

adding complex support for distributed functions and . fix #45760 #45879

adding complex support for distributed functions and . fix #45760 #45879

Conversation

bdhirsh commented Oct 6, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 6, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anjali411 Oct 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gchanan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-ci bot commented Oct 8, 2020 • edited

💊 CI failures summary and remediations

codecov.io: 1 failed

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anjali411 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 12, 2020

facebook-github-bot commented Oct 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdhirsh commented Oct 6, 2020 •

edited

codecov bot commented Oct 6, 2020 •

edited

anjali411 Oct 7, 2020 •

edited

dr-ci bot commented Oct 8, 2020 •

edited