Add XLA backend for torch.distributed (#3339) #3378

JackCaoG · 2022-02-16T01:08:24Z

Cherry picking #3339 to the release 1.11 branch

miladm

LGTM - Thanks @JackCaoG
(pending the pass of currently running CI tests)

yeounoh · 2022-02-16T01:31:46Z

test/test_torch_distributed_all_gather_xla_backend.py

+      expected = torch.ones((2, 3)) * i
+      assert torch.all(o.cpu() == expected), f'{o} != {expected}'
+    expected0 = torch.zeros_like(input)
+    assert torch.all(xoutput0.cpu() == expected0), f'{xoutput0} != {expected0}'


Not sure if this is needed?

we need to make sure after allgather tensor has the expected value.

Yes, then shouldn't we reverse the order here

xoutput0 = xoutputs[0] # copy dist.all_gather(xoutputs, xinput)

oh, it is a special use case we need to support. We need to make sure xoutput0 also get updated when we do all_gather on xoutputs

yeounoh · 2022-02-16T01:33:51Z

test/test_torch_distributed_multi_all_reduce_xla_backend.py

@@ -0,0 +1,38 @@
+import os


Probably not easy in the current test structure, but would be more consistent if we can group all_reduce tests in a single test file using unit test framework. Also, it might be better to have a separate distributed test folder under test/.

yea.. we should eventually have a separate repo for distributed tests

yeounoh · 2022-02-16T01:44:34Z

torch_xla/distributed/xla_backend.py

+  '''ProcessGroup for XLA devices. See ProcessGroup for doc.
+
+    Here we are implementing only a Python subclass. For implementing a
+    C++/Python extension, see


Do we have the C++ binding in a separate PR?

not sure if aws plans to do that, it is also unclear to me how C++ binding is going to help

yeounoh · 2022-02-16T01:51:29Z

torch_xla/distributed/xla_backend.py

+    return WorkXla(output_tensors)
+
+  # Call site:
+  # https://github.com/pytorch/pytorch/blob/70f57bcb1e45d21532bdb1c44d3aab018d1cbe88/torch/distributed/distributed_c10d.py#L2683


Replace the reference link with https://github.com/pytorch/pytorch/blob/release/1.11/torch/distributed/distributed_c10d.py#L2774

I think it is OK, this pr was already submitted to master and we don't need to replace links for every release branch.

yeounoh · 2022-02-16T01:52:54Z

torch_xla/distributed/xla_backend.py

+    else:
+      raise ValueError(f'Invalid reduce op {reduce_op}')
+
+  def allreduce(self, tensors, all_reduce_options):


nit: inconsistent naming convention, should use all_reduce(), all_gather() like reduce_scatter below.

This api if I am not mistaken is inherit from torch distributed, so it has to be allreduce. Example https://github.com/pytorch/pytorch/blob/release/1.10/torch/distributed/distributed_c10d.py#L1217

yeounoh · 2022-02-16T01:53:35Z

torch_xla/distributed/xla_backend.py

+    raise NotImplementedError
+
+
+class WorkXla(Work):


nit. should comment/describe the class better, as it's not obvious from the name.

We should fix that in the master.

yeounoh

Minor comments mostly, approving as the original PR was already approved.

* Add XLA backend for torch.distributed * Add XLA backend multiprocessing tests. * Linter fixes. * Address Jack's comments. * Fix multiprocessing tests: forgot to import the backend. * Addressing Shen and Jack's comments. * Fix a search/replace error. * Fix typo in test_mp_all_gather_xla_backend to make it real. * Use new reduce_scatter output param and use tensor.copy_ for all_gather result tensor to avoid graph execution. * Lint fix. * Fix TODO(alias). * Add XRT_WORKERS and XRT_DEVICE_MAP setting back to the unit test as we do not aim to exercise GPU spicific code in the unit test. * Lint fix. * Skip XLA backend unit tests for GPU/TPU. * Address Jack's comments. * rename tests according to Jack's comment.

miladm self-requested a review February 16, 2022 01:10

miladm assigned JackCaoG Feb 16, 2022

miladm approved these changes Feb 16, 2022

View reviewed changes

yeounoh reviewed Feb 16, 2022

View reviewed changes

yeounoh approved these changes Feb 16, 2022

View reviewed changes

JackCaoG force-pushed the cherry_pick_torch_distributed branch from d6f3e85 to ba3ed7a Compare February 16, 2022 04:08

JackCaoG merged commit 5066ab1 into release/1.11 Feb 16, 2022

JackCaoG deleted the cherry_pick_torch_distributed branch February 16, 2022 18:20

Add XLA backend for torch.distributed (#3339) #3378

Add XLA backend for torch.distributed (#3339) #3378

Uh oh!

Conversation

JackCaoG commented Feb 16, 2022

Uh oh!

miladm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

miladm left a comment •

edited

Loading