[Spec Decode] Enable FlashInfer Spec Decoding #25196

benchislett · 2025-09-18T18:43:06Z

Purpose

This PR enables FlashInfer for speculative decoding. When possible, the trtllm-gen decode-optimized kernel is used for speculative decoding. The fallback case is the prefill kernel, which can handle arbitrary query lengths but is not as performant.

This PR depends on #25183 for the refactor of the batch reordering threshold variable.

Here's an example launch command for EAGLE3 on 1xB200:

vllm serve meta-llama/Llama-3.1-8B-Instruct --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4}' --max-model-len 2048 --no-enable-prefix-caching

Benchmarking with 200 requests from ShareGPT gives the following TPS numbers:

Padding enabled + FlashInfer: 530 TPS (1.14x)
Padding disabled + FlashInfer: 465 TPS (1.0x)
Padding enabled + FlashAttention: 492 TPS (1.06x)
Padding disabled + FlashAttention: 466 TPS (1.0x)

Examination of nsys traces shows that the main model's attention kernels are about 2.5x faster when using the decode-optimized trtllm-gen kernels.

Correctness Testing

Tested on GSM8k (limit 500) with/out spec and with/out FlashInfer. All successful.

FlashInfer + Padded-Batch (Default)

CUDA_VISIBLE_DEVICES=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4}'

limit: 500.0, num_fewshot: 5, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.796|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.774|±  |0.0187|

FlashInfer + Non-uniform Batch (opt-in / fallback option)

CUDA_VISIBLE_DEVICES=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4, "disable_padded_drafter_batch": true}'

limit: 500.0, num_fewshot: 5, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.792|±  |0.0182|
|     |       |strict-match    |     5|exact_match|↑  |0.774|±  |0.0187|

Baseline FlashInfer

CUDA_VISIBLE_DEVICES=3 vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048 --no-enable-prefix-caching --port 8049

limit: 500.0, num_fewshot: 5, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.796|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.778|±  |0.0186|

Baseline FlashAttn

CUDA_VISIBLE_DEVICES=3 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048 --no-enable-prefix-caching --port 8049

limit: 500.0, num_fewshot: 5, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.794|±  |0.0181|
|     |       |strict-match    |     5|exact_match|↑  |0.772|±  |0.0188|

Known issues Resolved Issues

When CUDA graphs are enabled and the padded-drafter-batch is disabled, there is a crash at high concurrency. This PR fixes a couple issues in FlashInfer where the planning and building of metadata assume that num_decode_tokens and num_decodes are the same. There is likely another such issue in the planning logic for cuda graph padding. The issue can be patched by using enforce-eager or recording a cuda graph for each input batch size.

Update after some investigation. There are illegal memory accesses in the TRTLLM-gen kernels that produce difficult-to-reproduce crashes. It seems unrelated to the logic in this PR which I have marked as ready-for-review. It is possible that some state corruption or race condition is independently causing issues with these kernels on blackwell. I will continue investigating

Update: Fixed. Issue was max_q_len > max(query_lens) causing illegal access for non-uniform batches such as [2, 1] where the prefill max query size was smaller than the total max query size. Fixed by manually calculating max_q_len_prefill after splitting the batch

Co-authored-by: lhsjohn <huashuoli@tencent.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

LucasWilkinson

LGTM, might be nice to get some extra eyes on the FlashInfer bits @pavanimajety @mgoin

pavanimajety · 2025-09-19T16:48:37Z

vllm/v1/attention/backends/flashinfer.py

+                    num_decodes:]
+                seq_lens_prefill = attn_metadata.seq_lens[num_decodes:]


Probably naive q: Can there be cases in normal decode where num_decodes < num_decode_tokens?

Usually, reorder_batch_size == 1 so num_decodes == num_decode_tokens.

However, we're using a padded-batch speculative decoding implementation where we can use the trtllm-gen batch_decode kernel for a batch of requests as long as they all have the same q_len, which can be larger than 1.

So we need to fix a bunch of cases like this one, where we can have max_q_len * num_decodes tokens in the decode pathway

pavanimajety

LGTM, could you share some accuracy results for non spec-decoding and spec-decoding on?

benchislett · 2025-09-22T17:33:53Z

@pavanimajety added correctness testing to the description. All successful.

benchislett · 2025-09-22T17:35:04Z

Also fixed the kernel crashes I was seeing by adding max_q_len_prefill to FlashInferMetadata. Open to discussing better ways to handle this, maybe we can completely use max_q_len? It seems like the only place it is used but it seems ambiguous

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

LucasWilkinson

LGTM; left one nit

vllm/v1/attention/backends/utils.py

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2025-09-23T16:06:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…nfer-trtllm-spec-kernels

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: lhsjohn <huashuoli@tencent.com>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: lhsjohn <huashuoli@tencent.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

benchislett and others added 4 commits September 18, 2025 18:06

Refactor reorder_batch_threshold for spec

f557af9

Co-authored-by: lhsjohn <huashuoli@tencent.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

unit test for batch splitting

ab6e3e9

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

minor tweak to helper func

cfa3273

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

flashinfer spec support

51ec118

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify bot added speculative-decoding v1 labels Sep 18, 2025

LucasWilkinson approved these changes Sep 18, 2025

View reviewed changes

jiahanc mentioned this pull request Sep 18, 2025

[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell #23596

Closed

4 tasks

pavanimajety reviewed Sep 19, 2025

View reviewed changes

pavanimajety approved these changes Sep 19, 2025

View reviewed changes

benchislett marked this pull request as ready for review September 22, 2025 15:27

benchislett requested review from mgoin, luccafong, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 22, 2025 15:27

fix mixed-batch flashinfer

822d3dc

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

jiahanc mentioned this pull request Sep 22, 2025

[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue #25406

Merged

5 tasks

LucasWilkinson approved these changes Sep 23, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

benchislett added 2 commits September 23, 2025 15:16

remove supports_spec_as_decode from classvar

f7c3c13

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

remove the attn backend requirement for EAGLE

ff866e2

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett enabled auto-merge (squash) September 23, 2025 15:23

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025

Merge branch 'main' into enable-flashinfer-trtllm-spec-kernels

0d25335

mergify bot added the needs-rebase label Sep 23, 2025

Merge branch 'main' into enable-flashinfer-trtllm-spec-kernels

2d7e8b1

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Sep 23, 2025

benchislett disabled auto-merge September 23, 2025 16:38

benchislett added 3 commits September 23, 2025 21:55

use a separate workspace buffer for trtllm gen

0c8a509

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'flashinfer-trtllm_gen-fix-workspace' into enable-flashi…

443f35c

…nfer-trtllm-spec-kernels

Merge branch 'main' into enable-flashinfer-trtllm-spec-kernels

ad619d7

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett merged commit c30b405 into vllm-project:main Sep 24, 2025
43 checks passed

benchislett mentioned this pull request Sep 24, 2025

[Spec Decode] Utilities and refactor to support qlen>1 decode kernels for spec decode #25183

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode] Enable FlashInfer Spec Decoding #25196

[Spec Decode] Enable FlashInfer Spec Decoding #25196

benchislett commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

LucasWilkinson left a comment

Uh oh!

pavanimajety Sep 19, 2025

Uh oh!

benchislett Sep 22, 2025

Uh oh!

pavanimajety left a comment

Uh oh!

benchislett commented Sep 22, 2025

Uh oh!

benchislett commented Sep 22, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

		num_decodes:]
		seq_lens_prefill = attn_metadata.seq_lens[num_decodes:]

Uh oh!

[Spec Decode] Enable FlashInfer Spec Decoding #25196

[Spec Decode] Enable FlashInfer Spec Decoding #25196

Conversation

benchislett commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Correctness Testing

FlashInfer + Padded-Batch (Default)

FlashInfer + Non-uniform Batch (opt-in / fallback option)

Baseline FlashInfer

Baseline FlashAttn

Known issues Resolved Issues

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

pavanimajety Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Sep 22, 2025

Uh oh!

benchislett commented Sep 22, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

benchislett commented Sep 18, 2025 •

edited by github-actions bot

Loading