[perf] Use CPU tensor to reduce GPU->CPU sync #25884

lhtin · 2025-09-29T12:37:59Z

Use seq_lens_cpu instead of seq_lens to reduce GPU->CPU sync.

Purpose

Reduce unnecessary GPU->CPU sync, since it will affect the perf of Async Scheduling+MTP.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Lehua Ding <lehuading@tencent.com>

gemini-code-assist

Code Review

This pull request introduces a performance optimization to reduce GPU-to-CPU synchronization during speculative decoding. The change replaces a call to .max() on a GPU tensor (seq_lens) with its CPU counterpart (seq_lens_cpu) within a conditional check. This avoids a blocking operation, which is particularly beneficial for asynchronous scheduling. The change is correct and aligns with the stated goal of improving performance. I have no further comments.

lhtin · 2025-09-29T12:41:38Z

Which introduce by this pr(#24662), @AlonKejzman can you help review this too? thanks.

benchislett

LGTM, thanks!

benchislett · 2025-09-29T14:28:43Z

vllm/v1/worker/gpu_model_runner.py

                self.speculative_config.draft_model_config.max_model_len)
        input_fits_in_drafter = spec_decode_common_attn_metadata and (
-            spec_decode_common_attn_metadata.seq_lens.max() +
+            spec_decode_common_attn_metadata.seq_lens_cpu.max() +


Is it possible to use .max_seq_len here?

Indeed, and it is simpler! spec_decode_common_attn_metadata.max_seq_len is cmoputed from self.seq_lens.np[:num_reqs].max().item(), they are equivalent.

Signed-off-by: Lehua Ding <lehuading@tencent.com>

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

[perf] Use CPU tensor to reduce GPU->CPU sync

9a6d1fc

Signed-off-by: Lehua Ding <lehuading@tencent.com>

lhtin requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 29, 2025 12:38

mergify bot added the v1 label Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

benchislett approved these changes Sep 29, 2025

View reviewed changes

benchislett reviewed Sep 29, 2025

View reviewed changes

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025

[perf] Use .max_seq_len replace .seq_lens_cpu.max()

0617ab6

Signed-off-by: Lehua Ding <lehuading@tencent.com>

DarkLight1337 merged commit e184c9c into vllm-project:main Sep 30, 2025
42 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

1116b82

Signed-off-by: Lehua Ding <lehuading@tencent.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)

eea2536

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

lhtin commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

lhtin commented Sep 29, 2025

Uh oh!

benchislett left a comment

Uh oh!

benchislett Sep 29, 2025

Uh oh!

lhtin Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

Conversation

lhtin commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

lhtin commented Sep 29, 2025

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

lhtin Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lhtin commented Sep 29, 2025 •

edited by github-actions bot

Loading

lhtin Sep 29, 2025 •

edited

Loading