Skip to content

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Oct 2, 2025

PR

VLLM_LOGGING_LEVEL=DEBUG vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
...
(EngineCore_DP0 pid=3662669) DEBUG 10-01 17:41:11 [v1/attention/.../mla/common.py:980] Using FlashInfer prefill for MLA
...
========================================================================
(vllm) lwilkinson@dgxB200-09:~/code/vllm$ python tests/evals/gsm8k/gsm8k_eval.py
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|█████████████████████████████████████████████████████████████████████| 1319/1319 [00:28<00:00, 46.50it/s]

Results:
Accuracy: 0.782
Invalid responses: 0.000
Total latency: 28.381 s
Questions per second: 46.475

Main

(vllm) lwilkinson@dgxB200-09:~/code/vllm$ python tests/evals/gsm8k/gsm8k_eval.py
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|█████████████████████████████████████████████████████████████████████| 1319/1319 [00:26<00:00, 49.12it/s]

Results:
Accuracy: 0.208
Invalid responses: 0.006
Total latency: 26.864 s
Questions per second: 49.100

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mergify mergify bot added the v1 label Oct 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an accuracy issue with the FlashInfer MLA prefill implementation. The root cause appears to be an inconsistent shape of the log-sum-exp (LSE) tensor returned by FlashInfer, which is (q_len, num_heads) instead of the expected (num_heads, q_len). The changes correctly transpose the LSE tensor in both _run_prefill_new_tokens_fi and _run_prefill_context_chunk_fi to align with other backends, which should resolve the accuracy problem. The fix is logical and well-targeted. I have one suggestion to improve the code's robustness by using isinstance() for type checking, in line with Python best practices.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
@benchislett
Copy link
Collaborator

I ran this to validate on 8xB200:

VLLM_ATTENTION_BACKEND=CUTLASS_MLA vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --max-num-batched-tokens 256

where it completely failed before, it passes GSM8k now.

I also ran it with MTP on my development branch (#25984), which relies more heavily on this prefill functionality, and it passed as well:

VLLM_ATTENTION_BACKEND=CUTLASS_MLA vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --max-num-batched-tokens 256 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|

@benchislett benchislett added the bug Something isn't working label Oct 2, 2025
Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too, am curious why this didn't / doesn't cause a shape mismatch. However, it clearly works well to solve the problem.

I would approve if I had any understanding as to why this solves the issue.

@benchislett
Copy link
Collaborator

FIX #26042

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025
@benchislett benchislett enabled auto-merge (squash) October 2, 2025 15:12
@benchislett benchislett merged commit decf7f7 into vllm-project:main Oct 2, 2025
49 checks passed
simon-mo pushed a commit that referenced this pull request Oct 2, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
@mgoin mgoin deleted the lwilkinson/fix-fi-accuracy-issue branch October 2, 2025 18:36
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…t#26063)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants