Add new tpu backend for torch by siyuanfoundation · Pull Request #63442 · ray-project/ray

siyuanfoundation · 2026-05-18T15:53:03Z

Description

Add a new torch trainer backend for torch_tpu

Related to this announcement https://developers.googleblog.com/torchtpu-running-pytorch-natively-on-tpus-at-google-scale/

Related issues

Additional information

gemini-code-assist

Code Review

This pull request introduces TPU support for Ray Train and Ray AIR, including a new TPUTorchDeviceManager and logic to inject environment variables for distributed TPU training. Feedback highlights a missing use_tpu field in ScalingConfig that will cause runtime errors and an empty registration method for the tpu_dist backend. It is also recommended to use math.prod instead of functools.reduce for more idiomatic calculations of TPU chip products.

Signed-off-by: siyuanfoundation <sizhang@google.com>

ryanaoleary · 2026-05-19T21:14:30Z

+        return
+
    @staticmethod
    def set_current_process_visible_accelerator_ids(


This is only called from ray.util.set_visible_accelerator_ids - so it won't be set automatically which I think is the intended goal. I would prefer adding something like set_accelerator_env_vars to AcceleratorManager and then the path for TPU is:

On Ray node init we call ResourceAndLabelSpec to pass required resources/labels to the Raylet.

Call set_accelerator_env_vars if the AcceleratorManager is not None

For the TPU implementation, we check for the expected Torch and/or JAX env vars and set them in the raylet process if missing

We may want to consider gating the above on if the user passes some flag or env var that indicates they're going to use Torch TPU.

This would keep it extensible to other accelerator types.

ryanaoleary · 2026-05-19T21:15:33Z

+            )
+
+    def set_device(self, device: Union[torch.device, int, str, None]):
+        # TPU device setting is typically handled by torch_tpu.api.tpu_device()


Avoid silent pass and add some debug logging if we expect set_device to be called but want to force it to be a no-op

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

siyuanfoundation force-pushed the torchtpu branch from ed271a8 to 1cafa9f Compare May 19, 2026 19:40

Add new tpu backend for torch

6fa7706

Signed-off-by: siyuanfoundation <sizhang@google.com>

siyuanfoundation force-pushed the torchtpu branch from 1cafa9f to 6fa7706 Compare May 19, 2026 19:51

ryanaoleary reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new tpu backend for torch#63442

Add new tpu backend for torch#63442
siyuanfoundation wants to merge 1 commit into
ray-project:masterfrom
siyuanfoundation:torchtpu

siyuanfoundation commented May 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary May 19, 2026 •

edited

Loading

Uh oh!

ryanaoleary May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

siyuanfoundation commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siyuanfoundation commented May 18, 2026 •

edited

Loading

ryanaoleary May 19, 2026 •

edited

Loading