Support sequence parallelism and collective matmul #520

yaochengji · 2025-08-20T17:15:07Z

Description

add sequence parallelism sharding annotation for torchax dense models
fix the issue of v6e-4, v6e-8 logical device id order
the feature can be enabled by -O '{"pass_config": {"enable_sequence_parallelism": true}}'

Tests

MODEL_IMPL_TYPE=vllm vllm serve Qwen/Qwen2.5-32B  --seed 42  --disable-log-requests  --tensor-parallel-size 8 --max-model-len 2048 --gpu-memory-utilization 0.96 --no-enable-prefix-caching --max-num-seqs 256 --max-num-batched-tokens 4096 -O '{"pass_config": {"enable_sequence_parallelism": true}}' |& tee run.log

python3 ./benchmarks/benchmark_serving.py --model Qwen/Qwen2.5-32B --dataset-name sonnet --dataset-path benchmarks/sonnet_4x.txt --sonnet-input-len 1800 --sonnet-output-len 128 --ignore_eos

MODEL_IMPL_TYPE=vllm vllm serve meta-llama/Llama-3.1-70B-Instruct  --seed 42  --disable-log-requests  --tensor-parallel-size 8 --max-model-len 2048 --gpu-memory-utilization 0.96 --no-enable-prefix-caching --max-num-seqs 128 --max-num-batched-tokens 2048 -O '{"pass_config": {"enable_sequence_parallelism": true}}' |& tee run.log

python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-70B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet_4x.txt --sonnet-input-len 1800 --sonnet-output-len 128 --ignore_eos

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

yaochengji · 2025-08-20T17:20:47Z

Model	Best_throuput_before	Throuput_new	Improvement
Qwen/Qwen2.5-32B	12.1req/s	14.4req/s	18.9%
Llama-3.1-70B	7.58req/s	8.1req/s	6.9%

yaochengji · 2025-08-20T17:29:14Z

cc @hfan could you also take a look?

cc @xiangxu-google , you may also apply this on jax models.

tpu_commons/utils.py

Signed-off-by: Chengji Yao <chengjiyao@gmail.com>

hfan · 2025-08-20T18:02:12Z

@kyuyeunk - this affects your refactor too?

tpu_commons/models/vllm/vllm_model_wrapper.py

tpu_commons/models/vllm/jax_row_parallel_linear.py

tpu_commons/models/vllm/jax_merged_column_parallel_linear_core.py

hfan · 2025-08-20T18:21:10Z

tpu_commons/models/vllm/jax_merged_column_parallel_linear_core.py


    def forward(self, input: torch.Tensor):
        with jax.named_scope(self.name):
+            if self.enable_sequence_parallelism:


Unit test probably is needed

I intended to add an e2e test, but cannot find a good example.

tests/models/vllm/test_jax_XXX_linear.py are the existing unit tests.

Thanks, @hfan ! The unit test is added, manually triggered in https://buildkite.com/tpu-commons/tpu-commons-ci/builds/2043

hfan · 2025-08-20T18:22:12Z

Anything prevent it being turned on by default?

yaochengji · 2025-08-20T18:29:12Z

Anything prevent it being turned on by default?

@hfan I'm not sure if this can benefit all the models, maybe @QiliangCui can first try this in auto-tuning

kyuyeunk · 2025-08-20T19:09:55Z

@kyuyeunk - this affects your refactor too?

Yes, this does get affected. @yaochengji, I have created a draft PR so you can take an early look: #512

hfan · 2025-08-20T19:10:51Z

Anything prevent it being turned on by default?

@hfan I'm not sure if this can benefit all the models, maybe @QiliangCui can first try this in auto-tuning

As discussed offline, quantized matmuls that uses kernels whose collectives can't be automatically handled by XLA, so we probably have to wait for your collective-matmul kernel and better be safe (or only make it default for the non-quantized code path)

Signed-off-by: Chengji Yao <chengjiyao@gmail.com>

vanbasten23 · 2025-08-20T23:33:59Z

tpu_commons/models/vllm/jax_merged_column_parallel_linear_core.py

        with jax.named_scope(self.name):
+            if self.enable_sequence_parallelism:
+                token_num = input.shape[0]
+                # NOTE(chengjiyao): make sure the sharded token_num is larger than TPU_SECOND_LAST_MINOR


If "sharded token_num is larger than TPU_SECOND_LAST_MINOR", I guess the downside is waste of memory?

There will be more communication in the final result, you can have a try.

vanbasten23 · 2025-08-20T23:38:39Z

tpu_commons/models/vllm/jax_merged_column_parallel_linear_core.py

+                # NOTE(chengjiyao): make sure the sharded token_num is larger than TPU_SECOND_LAST_MINOR
+                if token_num // self.mesh.shape[
+                        'model'] >= TPU_SECOND_LAST_MINOR:
+                    input.shard_(NamedSharding(self.mesh, P('model', None)))


very neat and convenient way of doing self.apply_jax_(jax.lax.with_sharding_constraint, sharding)

vanbasten23 · 2025-08-20T23:46:41Z

tpu_commons/utils.py

+            if device_num == 8:
+                ordered_devices = np.array([
+                    devices[0],
+                    devices[2],


I wonder why (1,0,0) maps to device [2].

Also, does the order

# (0,0,0) # (1,0,0) # (0,1,0) # (1,1,0) # (0,2,0) # (1,2,0) # (0,3,0) # (1,3,0)

matter?

Yeah, from the order, it matters

why does (1,0,0) maps to device [2]?

vanbasten23 · 2025-08-21T18:41:50Z

tests/models/vllm/test_jax_merged_column_parallel_linear.py

    output = merged_column_linear(input_tensor).to(dtype)

    # Set jax default device to workaround a layout bug in JAX 0.7.0 and earlier
    with torchax.default_env(), jax.default_device(jax.devices("tpu")[0]):


is it using one TPU device? Do you need to test when there are multiple TPU device?

vanbasten23 · 2025-08-21T18:44:20Z

tpu_commons/models/vllm/jax_merged_column_parallel_linear_core.py

+                # NOTE(chengjiyao): make sure the sharded token_num is larger than TPU_SECOND_LAST_MINOR
+                if token_num // self.mesh.shape[
+                        'model'] >= TPU_SECOND_LAST_MINOR:
+                    input.shard_(NamedSharding(self.mesh, P('model', None)))


It seems that SP is implemented by sharding the num_tokens dimension. Do you need to do an all-gather at the very end? I couldn't find it in your pr.

sharding propagation can handle this and make it correct.

chatted offline, all-gather is not needed because some later ops (e.g. select which token to get logit (need to select num_reqs tokens)) may hint the compiler to get the global view. At that time, compiler will do a all-gather implicitly.

kyuyeunk · 2025-08-22T07:12:19Z

tpu_commons/models/vllm/jax_row_parallel_linear.py

+                # NOTE(chengjiyao): make sure the sharded token_num is larger than TPU_SECOND_LAST_MINOR
+                if token_num // self.mesh.shape[
+                        'model'] >= TPU_SECOND_LAST_MINOR:
+                    output.shard_(NamedSharding(self.mesh, P('model', None)))


Doesn't sequence parallelism usually use 'data' to shard batch dim? Meaning, we can use both sequence and model parallelism and shard the inputs/outputs using P('data', 'model')

In this case, only the mesh "model" axis is enough. Here the sequence parallelism is applied on layer_norm, and model parallelism is applied on matmul. It is described in this paper: https://arxiv.org/abs/2205.05198

yaochengji requested a review from lsy323 August 20, 2025 17:15

yaochengji force-pushed the chengji/cm branch from 61c61c9 to d0e9acd Compare August 20, 2025 17:26

hfan reviewed Aug 20, 2025

View reviewed changes

tpu_commons/utils.py Outdated Show resolved Hide resolved

support sequence parallelism and collective matmul

a660336

Signed-off-by: Chengji Yao <chengjiyao@gmail.com>

yaochengji force-pushed the chengji/cm branch from d0e9acd to a660336 Compare August 20, 2025 17:54

hfan reviewed Aug 20, 2025

View reviewed changes

tpu_commons/models/vllm/vllm_model_wrapper.py Show resolved Hide resolved

hfan reviewed Aug 20, 2025

View reviewed changes

tpu_commons/models/vllm/jax_row_parallel_linear.py Show resolved Hide resolved

hfan reviewed Aug 20, 2025

View reviewed changes

tpu_commons/models/vllm/jax_merged_column_parallel_linear_core.py Show resolved Hide resolved

hfan requested changes Aug 20, 2025

View reviewed changes

Chengji Yao and others added 2 commits August 20, 2025 20:02

add unit tests

1c5f8ce

Signed-off-by: Chengji Yao <chengjiyao@gmail.com>

Merge branch 'main' into chengji/cm

9a2bbbd

hfan approved these changes Aug 20, 2025

View reviewed changes

yaochengji merged commit 1badfe8 into main Aug 20, 2025
2 checks passed

vanbasten23 reviewed Aug 20, 2025

View reviewed changes

vanbasten23 reviewed Aug 21, 2025

View reviewed changes

kyuyeunk reviewed Aug 22, 2025

View reviewed changes

Support sequence parallelism and collective matmul #520

Support sequence parallelism and collective matmul #520

Uh oh!

Conversation

yaochengji commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

yaochengji commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaochengji commented Aug 20, 2025

Uh oh!

Uh oh!

hfan commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hfan commented Aug 20, 2025

Uh oh!

yaochengji commented Aug 20, 2025

Uh oh!

kyuyeunk commented Aug 20, 2025

Uh oh!

hfan commented Aug 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaochengji Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yaochengji commented Aug 20, 2025 •

edited

Loading

yaochengji commented Aug 20, 2025 •

edited

Loading

yaochengji Aug 23, 2025 •

edited

Loading