-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Spec Decode] Enable FlashInfer Spec Decoding #25196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
benchislett
merged 12 commits into
vllm-project:main
from
CentML:enable-flashinfer-trtllm-spec-kernels
Sep 24, 2025
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
f557af9
Refactor reorder_batch_threshold for spec
benchislett ab6e3e9
unit test for batch splitting
benchislett cfa3273
minor tweak to helper func
benchislett 51ec118
flashinfer spec support
benchislett 822d3dc
fix mixed-batch flashinfer
benchislett f7c3c13
remove supports_spec_as_decode from classvar
benchislett ff866e2
remove the attn backend requirement for EAGLE
benchislett 0d25335
Merge branch 'main' into enable-flashinfer-trtllm-spec-kernels
benchislett 2d7e8b1
Merge branch 'main' into enable-flashinfer-trtllm-spec-kernels
benchislett 0c8a509
use a separate workspace buffer for trtllm gen
benchislett 443f35c
Merge branch 'flashinfer-trtllm_gen-fix-workspace' into enable-flashi…
benchislett ad619d7
Merge branch 'main' into enable-flashinfer-trtllm-spec-kernels
benchislett File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably naive q: Can there be cases in normal decode where num_decodes < num_decode_tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually,
reorder_batch_size == 1
sonum_decodes == num_decode_tokens
.However, we're using a padded-batch speculative decoding implementation where we can use the
trtllm-gen batch_decode
kernel for a batch of requests as long as they all have the sameq_len
, which can be larger than 1.So we need to fix a bunch of cases like this one, where we can have
max_q_len * num_decodes
tokens in the decode pathway