Skip to content

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Sep 19, 2025

Purpose

Add fp8 blocked quantization support to compressed tensors MoE.

Test Plan

Run lm-eval on the following models:

  • RedHatAI/Qwen3-30B-A3B-FP8-BLOCK
  • nm-testing/Llama-4-Maverick-17B-128E-Instruct-FP8-BLOCK
  • nm-testing/Llama-4-Scout-17B-16E-Instruct-FP8-BLOCK

Test Result

  • RedHatAI/Qwen3-30B-A3B-FP8-BLOCK
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.86|±  |0.0349|
|     |       |strict-match    |     5|exact_match|↑  | 0.95|±  |0.0219|
  • nm-testing/Llama-4-Maverick-17B-128E-Instruct-FP8-BLOCK
OOM'd on 8 gpu machine.
  • nm-testing/Llama-4-Scout-17B-16E-Instruct-FP8-BLOCK
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for blocked FP8 quantization for Mixture of Experts (MoE) layers, primarily to enable DeepGEMM kernels. The changes are mostly in vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py. My review identifies a critical issue with parameter registration that could lead to incorrect behavior and makes the code confusing. I've also noted a minor issue in vllm/model_executor/warmup/deep_gemm_warmup.py.

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@tlrmchlsmth tlrmchlsmth added this to the v0.10.3 milestone Sep 19, 2025
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm
Copy link
Contributor Author

bnellnm commented Sep 22, 2025

@mgoin , @tlrmchlsmth I think this is ready to go as a first cut.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and thanks for the work!
Could you also add performance metric using vllm bench throughput...?

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm
Copy link
Contributor Author

bnellnm commented Sep 22, 2025

Benchmark with DP=4 + deepep_high_throughput

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  60.33
Total input tokens:                      1021533
Total generated tokens:                  110757
Request throughput (req/s):              16.58
Output token throughput (tok/s):         1835.83
Peak output token throughput (tok/s):    4559.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          18768.05
---------------Time to First Token----------------
Mean TTFT (ms):                          20646.11
Median TTFT (ms):                        23193.82
P99 TTFT (ms):                           50440.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.28
Median TPOT (ms):                        183.99
P99 TPOT (ms):                           310.81
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.92
Median ITL (ms):                         114.73
P99 ITL (ms):                            2261.17
==================================================

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) September 23, 2025 01:48
@tlrmchlsmth tlrmchlsmth merged commit f11e3c5 into vllm-project:main Sep 23, 2025
47 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…vllm-project#25219)

Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…#25219)

Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants