[Spec Decode] Utilities and refactor to support qlen>1 decode kernels for spec decode #25183

benchislett · 2025-09-18T16:48:01Z

Purpose

This PR makes common changes needed to enable FlashInfer, FlashInferMLA, FlashMLA, and other new backends for speculative decoding. This PR does not add explicit support for any one of these.

Included in this PR is:

A change to reorder_batch_threshold, making it no longer a ClassVar. This is because is can now be specialized at initialization time depending on whether or not speculative decoding is enabled: when it is, we can set it to num_speculative_tokens + 1 so that all spec-verify can be classified as decodes. A helper function is also included to facilitate this
"uniform" mode for batch splitting, which conservatively splits the batch assuming that all decodes must have the same query length. All others are classified as prefills for safety. Tests are included to validate correctness of this method. It is disabled by default.
Helper functions to reshape the attention tensors to add/remove a query_length axis. This is useful for many MLA backends which require the input query to have an explicit dimension for the qlen.

Test Plan

See tests/v1/attention/test_attention_splitting.py

Test Result

All passing locally

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

gemini-code-assist

Code Review

This pull request introduces several utilities and refactors to support speculative decoding with query lengths greater than one. The changes include making reorder_batch_threshold an instance variable for dynamic configuration, adding a 'uniform' mode for batch splitting, and providing helper functions for tensor reshaping. The new logic for batch splitting is well-tested and appears correct. The refactoring improves code clarity and prepares the codebase for new speculative decoding backends. I have one suggestion to improve an assertion in a new helper function for clarity and correctness.

vllm/v1/attention/backends/utils.py

LucasWilkinson · 2025-09-18T17:59:51Z

vllm/v1/attention/backends/utils.py

@@ -766,6 +798,40 @@ def reorder_batch_to_split_decodes_and_prefills(
    return modified_batch


+def reshape_query_for_spec_decode(query: torch.Tensor,


Are these not used yet? should we just include them in the follow-up once they are actually used? or maybe we should add FlashMLA support in this PR? Just so everything is used (and tested since we can do a FlashMLA + MTP lm_eval run)

I think it is worth committing now and using in subsequent PRs mostly because it will be used by FlashMLA and also FlashInfer-MLA and maybe more. Merging here as a helper means that all the downstream PRs can reuse the same code from main instead of duplicating it in each.

But I don't feel particularly strongly about this, and can remove if you think it's better to add separately.

Co-authored-by: lhsjohn <huashuoli@tencent.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2025-09-23T22:53:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett · 2025-09-24T02:30:43Z

Closing, see #25196

benchislett requested a review from LucasWilkinson September 18, 2025 16:48

benchislett requested review from mgoin, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 18, 2025 16:48

mergify bot added the v1 label Sep 18, 2025

benchislett mentioned this pull request Sep 18, 2025

[Bugfix] [Performance]Better MTP Support when use flashmla #24045

Open

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Sep 18, 2025

View reviewed changes

benchislett and others added 3 commits September 18, 2025 18:06

Refactor reorder_batch_threshold for spec

f557af9

Co-authored-by: lhsjohn <huashuoli@tencent.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

unit test for batch splitting

ab6e3e9

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

minor tweak to helper func

cfa3273

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett force-pushed the setup-for-spec-decode-kernels branch from 6eae35e to cfa3273 Compare September 18, 2025 18:16

benchislett mentioned this pull request Sep 18, 2025

[Spec Decode] Enable FlashInfer Spec Decoding #25196

Merged

benchislett requested a review from LucasWilkinson September 19, 2025 19:28

remove attention backend requirement for eagle

587a6f6

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from luccafong as a code owner September 22, 2025 21:00

mergify bot added the speculative-decoding label Sep 22, 2025

mergify bot added the needs-rebase label Sep 23, 2025

benchislett closed this Sep 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode] Utilities and refactor to support qlen>1 decode kernels for spec decode #25183

[Spec Decode] Utilities and refactor to support qlen>1 decode kernels for spec decode #25183

Uh oh!

benchislett commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

LucasWilkinson Sep 18, 2025

Uh oh!

benchislett Sep 18, 2025

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

benchislett commented Sep 24, 2025

Uh oh!

Uh oh!

		@@ -766,6 +798,40 @@ def reorder_batch_to_split_decodes_and_prefills(
		return modified_batch


		def reshape_query_for_spec_decode(query: torch.Tensor,

Uh oh!

[Spec Decode] Utilities and refactor to support qlen>1 decode kernels for spec decode #25183

[Spec Decode] Utilities and refactor to support qlen>1 decode kernels for spec decode #25183

Uh oh!

Conversation

benchislett commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

LucasWilkinson Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

benchislett commented Sep 24, 2025

Uh oh!

Uh oh!

benchislett commented Sep 18, 2025 •

edited by github-actions bot

Loading