[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode #24705

MatthewBonanni · 2025-09-12T00:32:49Z

Purpose

Enable FP8 kv cache for FLASHINFER_MLA backend.

Test Plan

Correctness

VLLM_ATTENTION_BACKEND=FLASHINFER_MLA lm_eval --model vllm --model_args '{"pretrained": "deepseek-ai/DeepSeek-V2-Lite-Chat", "trust_remote_code": true, "kv_cache_dtype": <dtype>}' --tasks gsm8k --batch_size auto

Performance

VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm bench throughput --model=deepseek-ai/DeepSeek-V2-Lite-Chat --dataset-name=random --input-len=8192 --output-len=1024 --num-prompts=100 --kv-cache-dtype=<dtype>

Test Result

Correctness

with <dtype>="auto":

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6520|±  |0.0131|
|     |       |strict-match    |     5|exact_match|↑  |0.6444|±  |0.0132|

with <dtype>="fp8":

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6384|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.6262|±  |0.0133|

Performance

with <dtype>="auto": Throughput: 3.22 requests/s, 29668.09 total tokens/s, 3296.62 output tokens/s
with <dtype>="fp8": Throughput: 3.55 requests/s, 32757.46 total tokens/s, 3639.90 output tokens/s

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

mgoin · 2025-09-12T00:55:40Z

vllm/v1/attention/backends/mla/flashinfer_mla.py

+        bmm1_scale = layer._q_scale.item() * layer._k_scale.item() * self.scale
+        bmm2_scale = layer._v_scale.item()


Could you use the _float version of these?

# We also keep q/k/v_scale on host (cpu) memory for attention # backends that require the scales to be on host instead of on device. # e.g. Flashinfer self._q_scale_float = 1.0 self._k_scale_float = 1.0 self._v_scale_float = 1.0

Also could this be calculated ahead of time rather than each forward pass?

I can definitely switch to the float ones. It's possible that each layer could have a different scale though, right?

Yes, definitely. Sorry I wasn't looking carefully at the difference between layer and self here, we can keep the local computation. Was just trying to avoid CPU ops if possible

No problem, thanks for the review!

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni added 2 commits September 11, 2025 17:04

Enable FP8

fbedc35

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

Whitespace

433b086

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni changed the title ~~Enable FP8 FlashInfer MLA decode~~ Enable FP8 FlashInfer (TRTLLM) MLA decode Sep 12, 2025

mergify bot added the v1 label Sep 12, 2025

Run prefill at q precision

4af0e82

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni marked this pull request as ready for review September 12, 2025 00:36

MatthewBonanni requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 12, 2025 00:36

MatthewBonanni changed the title ~~Enable FP8 FlashInfer (TRTLLM) MLA decode~~ [Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode Sep 12, 2025

mgoin added ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models labels Sep 12, 2025

mgoin reviewed Sep 12, 2025

View reviewed changes

Only compute bmm scales once

b34819e

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni requested review from tlrmchlsmth, yewentao256, zhuohan123 and youkaichao as code owners September 12, 2025 14:08

mergify bot added the tpu Related to Google TPUs label Sep 12, 2025

MatthewBonanni requested a review from mgoin September 12, 2025 14:43

mgoin approved these changes Sep 12, 2025

View reviewed changes

mgoin merged commit 7ba32aa into vllm-project:main Sep 12, 2025
48 checks passed

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (vl…

d0431a7

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (vl…

be34f2b

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (vl…

88b71f2

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (vl…

aaf0a1c

…lm-project#24705) Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode #24705

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode #24705

Uh oh!

MatthewBonanni commented Sep 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

mgoin Sep 12, 2025

Uh oh!

MatthewBonanni Sep 12, 2025

Uh oh!

mgoin Sep 12, 2025

Uh oh!

MatthewBonanni Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

		bmm1_scale = layer._q_scale.item() * layer._k_scale.item() * self.scale
		bmm2_scale = layer._v_scale.item()

Uh oh!

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode #24705

[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode #24705

Uh oh!

Conversation

MatthewBonanni commented Sep 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Correctness

Performance

Test Result

Correctness

Performance

Uh oh!

mgoin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MatthewBonanni commented Sep 12, 2025 •

edited by github-actions bot

Loading