implement send and recv using collective_permute #9373

bfolie · 2025-06-17T05:18:11Z

bfolie · 2025-06-17T16:06:41Z

torch_xla/core/xla_model.py

-  WARNING: This function is not very reliable, may produce wrong results under
-           certain inputs. Use it at your own risk.
-


As discussed in #8815 there's no context for this ancient warning. Given the age, lack of details, and lack of any other reported bugs I think it's best to remove it. If we get a specific bug report then we can act on that.

benawilson · 2025-06-17T19:28:27Z

test/pjrt/test_collective_ops_tpu.py

+    dist.init_process_group("xla", init_method='xla://')
+    device = torch_xla.device()
+    world_size = xr.world_size()
+    cutoff = world_size // 2


I think if the world size is not even, this test will hang. For example, if world size is 3, then index 0 will send to 1 and 1 will recv from 0, but index 2 will try to recv from 1 without an associated send.

Good point. I'll update the test so that it is more defensive

rpsilva-aws · 2025-06-17T22:26:09Z

torch_xla/distributed/xla_backend.py

+    logging.warning(
+        "Individual send/recv ops are inefficient on an XLA device. Consider using xla_model.collective_permute()."
+    )


Does it happen to print it everytime we trace?

Probably. I'm not sure how to only make it print once -- will look into it

I checked around, and couldn't find a in built way to do this through logging.warning. Given this is at warning level and can be filtered out, is it worth to seek a solution?

rpsilva-aws · 2025-06-17T22:26:17Z

torch_xla/distributed/xla_backend.py

-  # test/test_torch_distributed_xla_backend.py for an example.
-  def make_recv_channel_id(self, src_rank, tag):
-    raise NotImplementedError
-
  # Call site e.g.
  # https://github.com/pytorch/pytorch/blob/release/1.10/torch/distributed/distributed_c10d.py#L913
  def recv(self, out_tensors, src_rank, tag=0):


Do we need the warning on the recv end too, so each host has it?

rpsilva-aws · 2025-06-17T22:26:52Z

torch_xla/distributed/xla_backend.py

-  # test/test_torch_distributed_xla_backend.py for an example.
-  def make_send_channel_id(self, dst_rank, tag):
-    raise NotImplementedError
-
  # Call site e.g.
  # https://github.com/pytorch/pytorch/blob/release/1.10/torch/distributed/distributed_c10d.py#L877
  def send(self, tensors, dst_rank, tag=0):


If we're warning to use collective_permute, but it still ends up using a collective permute, should the warning itself be clearer that this is happening under the hood?

I could word this better. The real advice is to restructure your code so that each process calls collective_permute with all of the send-recv pairs

rpsilva-aws · 2025-06-17T22:29:23Z

test/pjrt/test_collective_ops_tpu.py

@@ -326,6 +326,28 @@ def test_all_to_all_single(self, use_dynamo):
                         expected.sort().values),
          f"Got {val}, expected {expected}")

+  @staticmethod


Last time we checked, we also noticed that https://github.com/pytorch/xla/blob/master/test/test_mp_collective_permute.py didn't work on the CPU, but send/recv did. We might want to double check it.

Is test/test_torch_distributed_xla_backend.py tested for CPU and Neuron? Would it be possible to test it and see if the change is compatible?

Is test/test_torch_distributed_xla_backend.py tested for CPU and Neuron? Would it be possible to test it and see if the change is compatible?

It is, but it just checks that the expected IR is emitted. It doesn't run anything. And in this case it wasn't a reliable test because, at least for TPU, that IR does not actually run.

test_mp_collective_permute is run for both TPU and Neuron. I don't think it works for CPU but neither do send/recv. The success of test_mp_collective_permute indicates this change should work for Neuron, but to be more certain I could add a test that covers a pipeline-like transfer in addition to the existing test of a permutation-like transfer.

The most direct test would be something like what's in test_collective_ops_tpu.py, which runs the ops to completion, for Neuron.

The most direct test would be something like what's in test_collective_ops_tpu.py, which runs the ops to completion, for Neuron.

This would be great. Any chance we can move it outside of this file and make it general? I can help test it out if so. Otherwise, I'll need to follow up if we can port this entire file to Neuron. I see tpu.num_expected_global_devices, and pjrt.run_multiprocess, but haven't seen/used these before.

pgmoka · 2025-06-18T17:10:48Z

test/pjrt/test_collective_ops_tpu.py

+      dist.recv(tensor, index - cutoff)
+    return tensor.cpu()
+
+  def test_send_recv(self):


The original test separated both send and receive. While this is more code efficient, it might be harder to debug as it will not be obvious what the issue is.

I think keeping a test for the total interaction is valid, but is there a way to replicate the other two tests that existed previously?

send and recv don't work independently. The original test was a "dry run" -- it checked the IR but didn't execute. If it did execute it would fail.

pgmoka · 2025-06-18T17:24:25Z

torch_xla/distributed/xla_backend.py

+    logging.warning(
+        "Individual send/recv ops are inefficient on an XLA device. Consider using xla_model.collective_permute()."
+    )


I checked around, and couldn't find a in built way to do this through logging.warning. Given this is at warning level and can be filtered out, is it worth to seek a solution?

pgmoka · 2025-06-18T17:46:08Z

torch_xla/distributed/xla_backend.py

+      # in the sending process it is unchanged. The solution used here is to
+      # have every process copy a linear combination of the two tensors, but
+      # send/recv use different coefficients to achieve different outcomes.


This took a couple reads until I understood what was going on here. My understanding is that by having both result_t * X + t * Y you are having both operation IRs be the same as X and Y are constants. That way when the IRs are compared they will be equivalent.

If this understanding is correct, could you add a little bit more here to make it more apparent?

pgmoka · 2025-06-18T17:53:27Z

torch_xla/distributed/xla_backend.py

-  # test/test_torch_distributed_xla_backend.py for an example.
-  def make_recv_channel_id(self, src_rank, tag):
-    raise NotImplementedError
-
  # Call site e.g.
  # https://github.com/pytorch/pytorch/blob/release/1.10/torch/distributed/distributed_c10d.py#L913
  def recv(self, out_tensors, src_rank, tag=0):


We should not assume someone reading "recv" will have read the documentation for "send". I think we should add documentation here. I would then add a note specific about what the IR expectation will be for "send" and "recv" on each of their comments.

bfolie · 2025-06-27T21:10:31Z

The approach implemented here works for a "pipeline" type operation but does not work for a "permutation" type operation. The way this is commonly done in native pytorch in order to avoid deadlocks is that half of the devices send and the other receive, then they switch roles. What this means is that the sending and receiving tensors must be different, and one half of the devices end up having a different IR than the other half, resulting in a deadlock. I'm still searching for a way around this.

…end-recv-collectives

bfolie · 2025-07-02T19:23:50Z

The only way I was able to make a "permutation" type op (every device sends and every device receives) work is by inserting a sync after each set of send/recv. This is not ideal. It's better than the status quo for TPU, which is that send/recv don't work at all. But since Neuron does have something working I'll defer to you @rpsilva-aws . We can put this on ice until the Send/Recv XLA ops can be called directly.

rpsilva-aws · 2025-07-02T19:46:39Z

Hm, that does complicate things... I have it working on TRN, though I deviated a bit with multi-operands to capture tokens. I'll end up creating a PR for this one, which would build upon the work you had in the prior commits. Actually, TRN has the same limitation for send/recv, requiring a graph break.

Do you think we can merge this PR without the sync since it's working for existing devices (e.g. TRN), and revisit as we figure out the underlying issues with TPU? If you want to defer until the new ops, or we re-raise the need as we bring in our work, both are ok with me.

bfolie · 2025-07-02T19:54:42Z

Do you think we can merge this PR without the sync since it's working for existing devices (e.g. TRN), and revisit as we figure out the underlying issues with TPU?

There are two tests in the PR, test_send_recv_pipeline and test_send_recv_permute. Without a sync, the former works on TPU (that's what I initially committed) but the latter does not. Does test_send_recv_permute work on TRN without a sync? I would expect it to have the same non-uniform-IR problems, so that would be a surprise and interesting to me if it did work.

I have it working on TRN, though I deviated a bit with multi-operands to capture tokens. I'll end up creating a PR for this one, which would build upon the work you had in the prior commits.

I'd be interested in seeing that

bfolie added 2 commits June 17, 2025 05:06

implement send and recv using collective_permute

ae8052e

remove extra blank line

e3f75e0

bfolie requested review from pgmoka and GleasonK June 17, 2025 05:37

bfolie mentioned this pull request Jun 17, 2025

[RFC] Improved coverage for native distributed collective operations #9315

Open

bfolie commented Jun 17, 2025

View reviewed changes

benawilson reviewed Jun 17, 2025

View reviewed changes

rpsilva-aws reviewed Jun 17, 2025

View reviewed changes

pgmoka reviewed Jun 18, 2025

View reviewed changes

bfolie added 4 commits June 27, 2025 22:42

add test for permutation

1259414

update comments

81624bb

add torch_xla.sync() after each send/recv pair to make permute work

b65e2cf

Merge branch 'master' of https://github.com/pytorch/xla into bfolie/s…

37f2e63

…end-recv-collectives

		WARNING: This function is not very reliable, may produce wrong results under
		certain inputs. Use it at your own risk.

implement send and recv using collective_permute #9373

Are you sure you want to change the base?

implement send and recv using collective_permute #9373

Uh oh!

Conversation

bfolie commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bfolie commented Jun 27, 2025

Uh oh!

bfolie commented Jul 2, 2025

Uh oh!

rpsilva-aws commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bfolie commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bfolie commented Jun 17, 2025 •

edited

Loading

rpsilva-aws commented Jul 2, 2025 •

edited

Loading

bfolie commented Jul 2, 2025 •

edited

Loading