[Torchax] Refactor torchax layers to use vLLM APIs #512

kyuyeunk · 2025-08-19T23:16:49Z

Description

Instead of overriding vLLM layers into a custom JAX layer, this PR aims to utilize existing vLLM layer but use their provided APIs and templates (such as get_quantization_config and process_weights_after_loading) to call our own JAX code.

This will loose some flexibility but has following benefits:

We don’t have to implement everything from scratch and able to leverage pre-existing vLLM APIs.
Plays more nicely with existing vLLM configs / features. Therefore, it helps with customer migration from vLLM GPU to vLLM TPU.

Using this change, I was able to implement FP8 model support for torchax with relative ease - which I'll create a follow-up PR for it.

Tests

Ran following models and verified that performance has not changed

model	before	after
Qwen/Qwen2.5-14B-Instruct	18.36	18.37
Qwen/Qwen2.5-32B	12.42	12.65
Qwen/Qwen3-32B	12.07	12.14
RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w8a8	8.82	8.72
google/gemma-3-27b-it	19.22	19.19
meta-llama/Llama-3.1-70B-Instruct	7.55	7.55
mistralai/Codestral-22B-v0.1	14.94	14.98
mistralai/Mistral-Small-24B-Instruct-2501	20.39	20.85
mistralai/Mixtral-8x7B-Instruct-v0.1	18.51	18.61

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-08-19T23:17:03Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

kyuyeunk · 2025-08-20T19:08:59Z

@hfan @yaochengji @vanbasten23 work in progress. Pinging to get some early feedback.

kyuyeunk · 2025-08-21T06:51:59Z

@hfan @yaochengji @vanbasten23 work in progress. Pinging to get some early feedback.

+ @yarongmu-google @bythew3i @QiliangCui @lsy323

This PR is ready for review. This is a pretty big change so please feel free to request any additional test results or design discussion.

tests/models/vllm/quantization/test_compressed_tensors_w8a8_int8.py

vanbasten23 · 2025-08-21T16:58:04Z

Considering it's a large change, could you enable the torchax+jax_runner tests in the CI, make sure it pass, then merge? Afaik, the torchax+jax_runner tests are disabled in the CI today.

yarongmu-google · 2025-08-21T16:59:24Z

Thanks @kyuyeunk. cc @bvrockwell this is the refactor that brings torchax much closer to vllm upstream to max reuse.

tpu_commons/attention/backends/pallas_torchax.py

vanbasten23 · 2025-08-21T17:12:54Z

Just out of curiosity, do you know why we created our own layers such as JaxMergedColumnParallelLinear?

tpu_commons/models/vllm/jax_linear_common.py

tpu_commons/models/vllm/quantization/__init__.py

tpu_commons/models/vllm/quantization/unquantized.py

tpu_commons/models/vllm/sharding.py

tpu_commons/models/vllm/vllm_model_wrapper.py

kyuyeunk

@hfan As mentioned in one of the comment, I've separated out one of the logic in to a separate PR: #590

I'm intending to merge that PR first and submit this PR.

Additionally, I'm running performance benchmark on all models that we are tracking to confirm performance has not changed. I will report back when I'm finished.

kyuyeunk · 2025-08-26T23:27:18Z

tpu_commons/models/vllm/quantization/unquantized.py

+            token_num = x.shape[0]
+            # NOTE(chengjiyao): make sure the sharded token_num is larger than TPU_SECOND_LAST_MINOR
+            if token_num // self.mesh.shape['model'] >= TPU_SECOND_LAST_MINOR:
+                out.shard_(NamedSharding(self.mesh, P('model', None)))


tpu_commons/models/vllm/quantization/unquantized.py

tpu_commons/models/vllm/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py

tpu_commons/models/vllm/jax_linear_common.py

tests/models/vllm/quantization/utils.py

tests/models/vllm/test_torchax_tpu_loader.py

kyuyeunk · 2025-08-27T00:51:17Z

tpu_commons/models/vllm/vllm_model_wrapper.py

-            with torchax.default_env():
+            with torchax.default_env(), set_vllm_model_wrapper_context(
+                    kv_caches=None, mesh=self.mesh), set_forward_context(
+                        attn_metadata=None, vllm_config=self.vllm_config):


If we don't add set_forward_context, this part of vLLM code complains that forward context is not set: https://github.com/vllm-project/vllm/blob/585e0bde36abdb2ab2967fd42005cbe62459020e/vllm/attention/layer.py#L268

I've considered wrapping set_foward_context into set_vllm_model_wrapper_context so that a single call can set both vllm_model_wrapper_context and forward_context. But I wasn't sure if it's a good UI design so I left it as-is.

tests/models/vllm/quantization/test_compressed_tensors_w8a8_int8.py

kyuyeunk · 2025-08-27T09:46:04Z

Referenced https://github.com/QiliangCui/bm-infra/blob/main/cases/hourly_torchax_jax.csv and benchmarked following models. Confirmed that the performance has not changed.

model	before	after
Qwen/Qwen2.5-14B-Instruct	18.36	18.37
Qwen/Qwen2.5-32B	12.42	12.65
Qwen/Qwen3-32B	12.07	12.14
RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w8a8	8.82	8.72
google/gemma-3-27b-it	19.22	19.19
meta-llama/Llama-3.1-70B-Instruct	7.55	7.55
mistralai/Codestral-22B-v0.1	14.94	14.98
mistralai/Mistral-Small-24B-Instruct-2501	20.39	19.72
mistralai/Mixtral-8x7B-Instruct-v0.1	18.51	18.61

cc: @hfan

hfan · 2025-08-27T17:48:49Z

Referenced https://github.com/QiliangCui/bm-infra/blob/main/cases/hourly_torchax_jax.csv and benchmarked following models. Confirmed that the performance has not changed.

model before after
Qwen/Qwen2.5-14B-Instruct 18.36 18.37
Qwen/Qwen2.5-32B 12.42 12.65
Qwen/Qwen3-32B 12.07 12.14
RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w8a8 8.82 8.72
google/gemma-3-27b-it 19.22 19.19
meta-llama/Llama-3.1-70B-Instruct 7.55 7.55
mistralai/Codestral-22B-v0.1 14.94 14.98
mistralai/Mistral-Small-24B-Instruct-2501 20.39 19.72
mistralai/Mixtral-8x7B-Instruct-v0.1 18.51 18.61
cc: @hfan

Thanks! The performance diff for mistralai/Mistral-Small-24B-Instruct-2501 seems bit larger than noise? Is it reproducible?

kyuyeunk

Thanks! The performance diff for mistralai/Mistral-Small-24B-Instruct-2501 seems bit larger than noise? Is it reproducible?

Turns out I was using a wrong --max-num-batched-tokens. Fixing it bridged the gap:

model	before	after
mistralai/Mistral-Small-24B-Instruct-2501	20.39	20.85

kyuyeunk · 2025-08-28T01:03:32Z

tpu_commons/models/vllm/vllm_model_wrapper.py

-            with torchax.default_env():
+            with torchax.default_env(), set_vllm_model_wrapper_context(
+                    kv_caches=None, mesh=self.mesh), set_forward_context(
+                        attn_metadata=None, vllm_config=self.vllm_config):


Oh yeah. You were right. Deleted set_forward_context.

tpu_commons/models/vllm/sharding.py

kyuyeunk · 2025-08-28T01:48:10Z

tpu_commons/models/vllm/quantization/unquantized.py

+            token_num = x.shape[0]
+            # NOTE(chengjiyao): make sure the sharded token_num is larger than TPU_SECOND_LAST_MINOR
+            if token_num // self.mesh.shape['model'] >= TPU_SECOND_LAST_MINOR:
+                out.shard_(NamedSharding(self.mesh, P('model', None)))


btw, I'm slightly confused what is the difference between sharding input vs. output. After sharding propagation, isn't the end result essentially the same?

Here is an example of RowParallelLinear:

# RowParallelLinear - Shard input in # ('model', None) weight # (None, 'model') out # not specified After sharding propagation, out is sharded as ('model', None) in # ('model', None) weight # (None, 'model') weight = allgather(weight, 'model') # (None, None) out = in * weight # ('model', None) # RowParallelLinear - Shard output in # not specified weight # (None, 'model') out # ('model', None) After sharding propagation, in is sharded as ('model', None) in # ('model', None) weight # (None, 'model') weight = allgather(weight, 'model') # (None, None) out = in * weight # ('model', None)

And here is the scenario for ColumnParallelLinear:

# ColumnParallelLinear - Shard input in # ('model', None) weight # ('model', None) out # not specified After sharding propagation, out is sharded as ('model', None) in # ('model', None) weight # ('model', None) weight = allgather(weight, 'model') # (None, None) out = in * weight # ('model', None) # ColumnParallelLinear - Shard output in # not specified weight # ('model', None) out # ('model', None) After sharding propagation, in is sharded as ('model', None) in # ('model', None) weight # ('model', None) weight = allgather(weight, 'model') # (None, None) out = in * weight # ('model', None)

Things might differ based on how previous / next layers are sharded, but that's my understanding. @yaochengji, do you have any insights? Or some unit tests to verify sharding?

tpu_commons/attention/backends/pallas_torchax.py

tests/models/vllm/quantization/test_compressed_tensors_w8a8_int8.py

tpu_commons/models/vllm/jax_linear_common.py

tpu_commons/models/vllm/quantization/__init__.py

kyuyeunk · 2025-08-28T04:12:29Z

Confirmed that CI passes: https://buildkite.com/tpu-commons/tpu-commons-ci/builds/2429#0198ee7d-539b-4556-8db4-315e40888e7f

hfan · 2025-08-28T19:22:05Z

tpu_commons/models/vllm/quantization/unquantized.py

+                if token_num // self.jax_config.mesh.shape[
+                        'model'] >= TPU_SECOND_LAST_MINOR:
+                    x.shard_(
+                        NamedSharding(self.jax_config.mesh, P('model', None)))


I can approve this PR except these XLA collective matmul handling.

Will leave it to @yaochengji

Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>

yaochengji

LGTM, thanks for the awesome improvement!

kyuyeunk mentioned this pull request Aug 19, 2025

Archive RPAv2 and Delete Torchax-pt path. #504

Merged

kyuyeunk force-pushed the torchax_quant_api branch from d5f5e64 to 409c10e Compare August 20, 2025 19:08

kyuyeunk mentioned this pull request Aug 20, 2025

Support sequence parallelism and collective matmul #520

Merged

kyuyeunk force-pushed the torchax_quant_api branch 2 times, most recently from d3a1945 to 577f5cd Compare August 21, 2025 06:47

kyuyeunk changed the title ~~Support Quantization API~~ Refactor torchax layers to use vLLM APIs Aug 21, 2025

kyuyeunk requested a review from hfan August 21, 2025 06:48

kyuyeunk marked this pull request as ready for review August 21, 2025 06:48

kyuyeunk force-pushed the torchax_quant_api branch 4 times, most recently from 28c1e0a to f65a27b Compare August 21, 2025 07:31

kyuyeunk mentioned this pull request Aug 21, 2025

Create TorchaxMergedColumnParallelLinearWithLoRA lora wrapper for single chip #496

Open

kyuyeunk force-pushed the torchax_quant_api branch 2 times, most recently from c47c114 to abe3c95 Compare August 21, 2025 16:44