Add new tpu backend for torch#63442
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces TPU support for Ray Train and Ray AIR, including a new TPUTorchDeviceManager and logic to inject environment variables for distributed TPU training. Feedback highlights a missing use_tpu field in ScalingConfig that will cause runtime errors and an empty registration method for the tpu_dist backend. It is also recommended to use math.prod instead of functools.reduce for more idiomatic calculations of TPU chip products.
ed271a8 to
1cafa9f
Compare
Signed-off-by: siyuanfoundation <sizhang@google.com>
1cafa9f to
6fa7706
Compare
| return | ||
|
|
||
| @staticmethod | ||
| def set_current_process_visible_accelerator_ids( |
There was a problem hiding this comment.
This is only called from ray.util.set_visible_accelerator_ids - so it won't be set automatically which I think is the intended goal. I would prefer adding something like set_accelerator_env_vars to AcceleratorManager and then the path for TPU is:
- On Ray node init we call ResourceAndLabelSpec to pass required resources/labels to the Raylet.
- Call
set_accelerator_env_varsif theAcceleratorManageris not None - For the TPU implementation, we check for the expected Torch and/or JAX env vars and set them in the raylet process if missing
- We may want to consider gating the above on if the user passes some flag or env var that indicates they're going to use Torch TPU.
This would keep it extensible to other accelerator types.
| ) | ||
|
|
||
| def set_device(self, device: Union[torch.device, int, str, None]): | ||
| # TPU device setting is typically handled by torch_tpu.api.tpu_device() |
There was a problem hiding this comment.
Avoid silent pass and add some debug logging if we expect set_device to be called but want to force it to be a no-op
Description
Add a new torch trainer backend for torch_tpu
Related to this announcement https://developers.googleblog.com/torchtpu-running-pytorch-natively-on-tpus-at-google-scale/
Related issues
Additional information