[BugFix] Fix FI accuracy issue when used for MLA prefill #26063

LucasWilkinson · 2025-10-02T00:43:11Z

PR

VLLM_LOGGING_LEVEL=DEBUG vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
...
(EngineCore_DP0 pid=3662669) DEBUG 10-01 17:41:11 [v1/attention/.../mla/common.py:980] Using FlashInfer prefill for MLA
...
========================================================================
(vllm) lwilkinson@dgxB200-09:~/code/vllm$ python tests/evals/gsm8k/gsm8k_eval.py
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|█████████████████████████████████████████████████████████████████████| 1319/1319 [00:28<00:00, 46.50it/s]

Results:
Accuracy: 0.782
Invalid responses: 0.000
Total latency: 28.381 s
Questions per second: 46.475

Main

(vllm) lwilkinson@dgxB200-09:~/code/vllm$ python tests/evals/gsm8k/gsm8k_eval.py
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|█████████████████████████████████████████████████████████████████████| 1319/1319 [00:26<00:00, 49.12it/s]

Results:
Accuracy: 0.208
Invalid responses: 0.006
Total latency: 26.864 s
Questions per second: 49.100

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

gemini-code-assist

Code Review

This pull request addresses an accuracy issue with the FlashInfer MLA prefill implementation. The root cause appears to be an inconsistent shape of the log-sum-exp (LSE) tensor returned by FlashInfer, which is (q_len, num_heads) instead of the expected (num_heads, q_len). The changes correctly transpose the LSE tensor in both _run_prefill_new_tokens_fi and _run_prefill_context_chunk_fi to align with other backends, which should resolve the accuracy problem. The fix is logical and well-targeted. I have one suggestion to improve the code's robustness by using isinstance() for type checking, in line with Python best practices.

vllm/v1/attention/backends/mla/common.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

benchislett · 2025-10-02T01:08:35Z

I ran this to validate on 8xB200:

VLLM_ATTENTION_BACKEND=CUTLASS_MLA vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --max-num-batched-tokens 256

where it completely failed before, it passes GSM8k now.

I also ran it with MTP on my development branch (#25984), which relies more heavily on this prefill functionality, and it passed as well:

VLLM_ATTENTION_BACKEND=CUTLASS_MLA vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --max-num-batched-tokens 256 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|

benchislett

I too, am curious why this didn't / doesn't cause a shape mismatch. However, it clearly works well to solve the problem.

I would approve if I had any understanding as to why this solves the issue.

benchislett · 2025-10-02T01:10:37Z

FIX #26042

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: simon-mo <simon.mo@hey.com>

…t#26063) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

fix FI accuracy issue

af93f80

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added the v1 label Oct 2, 2025

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

LucasWilkinson added this to the v0.11.0 Cherry Picks milestone Oct 2, 2025

LucasWilkinson mentioned this pull request Oct 2, 2025

[Bugfix] Disable FlashInfer MLA prefill by default due to chunked prefill issues #26049

Closed

5 tasks

Update vllm/v1/attention/backends/mla/common.py

6b35efb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

benchislett added the bug Something isn't working label Oct 2, 2025

benchislett reviewed Oct 2, 2025

View reviewed changes

benchislett mentioned this pull request Oct 2, 2025

[Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA #25984

Open

mgoin approved these changes Oct 2, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025

Merge branch 'main' into lwilkinson/fix-fi-accuracy-issue

8d58e0a

benchislett enabled auto-merge (squash) October 2, 2025 15:12

pavanimajety mentioned this pull request Oct 2, 2025

[Bug]: Low Accuracy for MMLU Pro with DeepSeekR1-FP4 #25209

Open

1 task

benchislett merged commit decf7f7 into vllm-project:main Oct 2, 2025
49 checks passed

mgoin deleted the lwilkinson/fix-fi-accuracy-issue branch October 2, 2025 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Fix FI accuracy issue when used for MLA prefill #26063

[BugFix] Fix FI accuracy issue when used for MLA prefill #26063

LucasWilkinson commented Oct 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

benchislett commented Oct 2, 2025

Uh oh!

benchislett left a comment

Uh oh!

benchislett commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[BugFix] Fix FI accuracy issue when used for MLA prefill #26063

[BugFix] Fix FI accuracy issue when used for MLA prefill #26063

Conversation

LucasWilkinson commented Oct 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR

Main

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

benchislett commented Oct 2, 2025

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson commented Oct 2, 2025 •

edited by github-actions bot

Loading