[TPU] add tpu_inference #27277

jcyang43 · 2025-10-21T17:53:22Z

Purpose

Add tpu-inference for vllm TPU.

Test Plan

E2E tests on v6e.

Test Result

Successful

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request transitions the TPU backend to use the tpu-inference library, updating dependencies and import paths accordingly. The changes correctly adapt vLLM to the new library version. However, I've identified a critical issue with the dependency management in requirements/tpu.txt. By removing torch_xla, the fallback mechanism for when tpu-inference is unavailable becomes non-functional, which could lead to confusing runtime errors. My review includes a suggestion to address this to ensure the system remains robust.

gemini-code-assist · 2025-10-21T17:54:57Z

requirements/tpu.txt

+tpu-inference==0.11.1
+numba


Removing torch_xla from this requirements file makes the fallback path in the code fragile. The codebase contains logic to fall back to a torch_xla-based implementation if tpu-inference is not found or fails to import (e.g., in vllm/platforms/tpu.py).

With torch_xla removed from the dependencies, if tpu-inference is installed but fails to import for any reason (like a transitive dependency issue), the fallback will immediately fail with an ImportError for torch_xla. This creates a brittle setup and can cause confusing errors for users.

To ensure the fallback mechanism is robust, please re-add torch_xla to the dependencies. This ensures that even if the primary tpu-inference path fails, the system can gracefully fall back to the torch_xla-based implementation.

tpu-inference==0.11.1 numba # Install torch_xla for fallback torch_xla[tpu, pallas]==2.8.0

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/distributed/device_communicators/tpu_communicator.py

Lines 22 to 62 in 89ad99a

    
           if not USE_TPU_INFERENCE: 
        
               logger.info("tpu_inference not found, using vLLM's TpuCommunicator") 
        
               if current_platform.is_tpu(): 
        
                   import torch_xla 
        
                   import torch_xla.core.xla_model as xm 
        
                   import torch_xla.runtime as xr 
        
                   from torch_xla._internal import pjrt 
        
                   from torch_xla.distributed.xla_multiprocessing import ( 
        
                       create_optimized_replica_groups, 
        
                   ) 
        
                   if USE_RAY: 
        
                       from vllm.executor import ray_utils 
        
           class TpuCommunicator(DeviceCommunicatorBase): 
        
               def __init__( 
        
                   self, 
        
                   cpu_group: ProcessGroup, 
        
                   device: torch.device | None = None, 
        
                   device_group: ProcessGroup | None = None, 
        
                   unique_name: str = "", 
        
               ): 
        
                   super().__init__(cpu_group, device, device_group, unique_name) 
        
                   # NOTE(woosuk): When using TP > 1 on TPUs, every TPU on the same node 
        
                   # must be used together. Therefore, the local rank and world size can 
        
                   # be simply calculated as follows. 
        
                   global_rank = self.global_rank 
        
                   global_world_size = self.global_world_size 
        
                   if USE_RAY: 
        
                       logger.info("TpuCommunicator initialized with RAY") 
        
                       # Calculate how many TPU nodes are in the current deployment. This 
        
                       # is the Ray placement group if it is deployed with Ray. Default 
        
                       # to the number of TPU nodes in the Ray cluster. The number of TPU 
        
                       # nodes is computed by the total number of TPUs divided by the 
        
                       # number of TPU accelerators per node, to account for clusters 
        
                       # with both CPUs and TPUs. 
        
                       num_nodes = ray_utils.get_num_tpu_nodes() 
        
                       num_nodes_in_pg = ray_utils.get_num_nodes_in_placement_group()

Import ray_utils when TPU inference is enabled

With tpu_inference installed USE_TPU_INFERENCE becomes True, but this module now always exposes the local TpuCommunicator instead of aliasing to the implementation in tpu_inference. The conditional import of ray_utils is still wrapped inside if not USE_TPU_INFERENCE, yet __init__ uses ray_utils whenever the distributed executor backend is Ray. In a TPU inference deployment that uses the Ray backend, ray_utils will be undefined and TpuCommunicator.__init__ will raise NameError before any communication is set up. The Ray utilities need to be imported regardless of USE_TPU_INFERENCE (or the Ray-specific logic needs a similar guard).

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/distributed/device_communicators/tpu_communicator.py

Lines 22 to 62 in 89ad99a

    
           if not USE_TPU_INFERENCE: 
        
               logger.info("tpu_inference not found, using vLLM's TpuCommunicator") 
        
               if current_platform.is_tpu(): 
        
                   import torch_xla 
        
                   import torch_xla.core.xla_model as xm 
        
                   import torch_xla.runtime as xr 
        
                   from torch_xla._internal import pjrt 
        
                   from torch_xla.distributed.xla_multiprocessing import ( 
        
                       create_optimized_replica_groups, 
        
                   ) 
        
                   if USE_RAY: 
        
                       from vllm.executor import ray_utils 
        
           class TpuCommunicator(DeviceCommunicatorBase): 
        
               def __init__( 
        
                   self, 
        
                   cpu_group: ProcessGroup, 
        
                   device: torch.device | None = None, 
        
                   device_group: ProcessGroup | None = None, 
        
                   unique_name: str = "", 
        
               ): 
        
                   super().__init__(cpu_group, device, device_group, unique_name) 
        
                   # NOTE(woosuk): When using TP > 1 on TPUs, every TPU on the same node 
        
                   # must be used together. Therefore, the local rank and world size can 
        
                   # be simply calculated as follows. 
        
                   global_rank = self.global_rank 
        
                   global_world_size = self.global_world_size 
        
                   if USE_RAY: 
        
                       logger.info("TpuCommunicator initialized with RAY") 
        
                       # Calculate how many TPU nodes are in the current deployment. This 
        
                       # is the Ray placement group if it is deployed with Ray. Default 
        
                       # to the number of TPU nodes in the Ray cluster. The number of TPU 
        
                       # nodes is computed by the total number of TPUs divided by the 
        
                       # number of TPU accelerators per node, to account for clusters 
        
                       # with both CPUs and TPUs. 
        
                       num_nodes = ray_utils.get_num_tpu_nodes() 
        
                       num_nodes_in_pg = ray_utils.get_num_nodes_in_placement_group()

Import ray_utils when TPU inference is enabled

With tpu_inference installed USE_TPU_INFERENCE becomes True, but this module now always exposes the local TpuCommunicator instead of aliasing to the implementation in tpu_inference. The conditional import of ray_utils is still wrapped inside if not USE_TPU_INFERENCE, yet __init__ uses ray_utils whenever the distributed executor backend is Ray. In a TPU inference deployment that uses the Ray backend, ray_utils will be undefined and TpuCommunicator.__init__ will raise NameError before any communication is set up. The Ray utilities need to be imported regardless of USE_TPU_INFERENCE (or the Ray-specific logic needs a similar guard).

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-14T22:00:05Z

vllm/distributed/device_communicators/tpu_communicator.py

    def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor:
        assert dim == -1, "TPUs only support dim=-1 for all-gather."
        return xm.all_gather(input_, dim=dim)


Communicator now crashes when tpu_inference is present

With this change the alias that swapped in tpu_inference’s communicator was removed, but the guard at lines 22‑31 still skips importing torch_xla, xm, xr, etc. whenever USE_TPU_INFERENCE is True. Because tpu-inference is now listed in requirements/tpu.txt, that flag is effectively always True on TPU installs, so the vLLM TpuCommunicator defined in this module is still used yet its methods call xm.*/torch_xla.* (lines 92‑99) even though those symbols were never imported, causing a NameError as soon as the communicator is instantiated or all_reduce/all_gather are called. Either the communicator needs to keep delegating to tpu_inference or the XLA imports must happen unconditionally; otherwise TPU runs fail immediately when tpu-inference is installed.

Useful? React with 👍 / 👎.

Signed-off-by: Johnny Yang <johnnyyang@google.com>

yaochengji

LGTM, thanks!

Signed-off-by: Johnny Yang <johnnyyang@google.com> Signed-off-by: Benjamin Feuer <penfever@gmail.com>

Signed-off-by: Johnny Yang <johnnyyang@google.com>

Signed-off-by: Johnny Yang <johnnyyang@google.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Signed-off-by: Johnny Yang <johnnyyang@google.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

jcyang43 requested a review from NickLucche as a code owner October 21, 2025 17:53

mergify bot added ci/build v1 tpu Related to Google TPUs labels Oct 21, 2025

mergify bot mentioned this pull request Oct 21, 2025

[TPU] add tpu_inference #27276

Closed

5 tasks

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 21, 2025

View reviewed changes

jcyang43 marked this pull request as draft October 22, 2025 02:48

jcyang43 force-pushed the tpu-inference branch 3 times, most recently from cb27073 to db10aa5 Compare November 14, 2025 20:23

jcyang43 marked this pull request as ready for review November 14, 2025 21:57

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 16, 2025

jcyang43 force-pushed the tpu-inference branch from 50e1f4a to dabc072 Compare November 19, 2025 00:26

[TPU] add tpu_inference

d65756c

Signed-off-by: Johnny Yang <johnnyyang@google.com>

jcyang43 force-pushed the tpu-inference branch from dabc072 to d65756c Compare November 24, 2025 17:54

yaochengji approved these changes Nov 24, 2025

View reviewed changes

yaochengji merged commit ba1fcd8 into vllm-project:main Nov 26, 2025
52 checks passed

penfever pushed a commit to mlfoundations/vllm that referenced this pull request Dec 1, 2025

[TPU] add tpu_inference (vllm-project#27277)

f1dff73

Signed-off-by: Johnny Yang <johnnyyang@google.com> Signed-off-by: Benjamin Feuer <penfever@gmail.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[TPU] add tpu_inference (vllm-project#27277)

2d92c4c

Signed-off-by: Johnny Yang <johnnyyang@google.com>

jcyang43 deleted the tpu-inference branch December 2, 2025 01:19

amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025

[TPU] add tpu_inference (vllm-project#27277)

acdab3f

Signed-off-by: Johnny Yang <johnnyyang@google.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025

[TPU] add tpu_inference (vllm-project#27277)

405ee60

Signed-off-by: Johnny Yang <johnnyyang@google.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[TPU] add tpu_inference #27277

[TPU] add tpu_inference #27277

Uh oh!

jcyang43 commented Oct 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Uh oh!

yaochengji left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if not USE_TPU_INFERENCE:
	logger.info("tpu_inference not found, using vLLM's TpuCommunicator")
	if current_platform.is_tpu():
	import torch_xla
	import torch_xla.core.xla_model as xm
	import torch_xla.runtime as xr
	from torch_xla._internal import pjrt
	from torch_xla.distributed.xla_multiprocessing import (
	create_optimized_replica_groups,
	)

	if USE_RAY:
	from vllm.executor import ray_utils


	class TpuCommunicator(DeviceCommunicatorBase):
	def __init__(
	self,
	cpu_group: ProcessGroup,
	device: torch.device \| None = None,
	device_group: ProcessGroup \| None = None,
	unique_name: str = "",
	):
	super().__init__(cpu_group, device, device_group, unique_name)

	# NOTE(woosuk): When using TP > 1 on TPUs, every TPU on the same node
	# must be used together. Therefore, the local rank and world size can
	# be simply calculated as follows.
	global_rank = self.global_rank
	global_world_size = self.global_world_size

	if USE_RAY:
	logger.info("TpuCommunicator initialized with RAY")
	# Calculate how many TPU nodes are in the current deployment. This
	# is the Ray placement group if it is deployed with Ray. Default
	# to the number of TPU nodes in the Ray cluster. The number of TPU
	# nodes is computed by the total number of TPUs divided by the
	# number of TPU accelerators per node, to account for clusters
	# with both CPUs and TPUs.
	num_nodes = ray_utils.get_num_tpu_nodes()
	num_nodes_in_pg = ray_utils.get_num_nodes_in_placement_group()

		tpu-inference==0.11.1
		numba

Uh oh!

[TPU] add tpu_inference #27277

[TPU] add tpu_inference #27277

Uh oh!

Conversation

jcyang43 commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcyang43 commented Oct 21, 2025 •

edited by github-actions bot

Loading