[Kernels] Support blocked fp8 quantization for compressed tensors MoE #25219

bnellnm · 2025-09-19T02:01:24Z

Purpose

Add fp8 blocked quantization support to compressed tensors MoE.

Test Plan

Run lm-eval on the following models:

RedHatAI/Qwen3-30B-A3B-FP8-BLOCK
nm-testing/Llama-4-Maverick-17B-128E-Instruct-FP8-BLOCK
nm-testing/Llama-4-Scout-17B-16E-Instruct-FP8-BLOCK

Test Result

RedHatAI/Qwen3-30B-A3B-FP8-BLOCK

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.86|±  |0.0349|
|     |       |strict-match    |     5|exact_match|↑  | 0.95|±  |0.0219|

nm-testing/Llama-4-Maverick-17B-128E-Instruct-FP8-BLOCK

OOM'd on 8 gpu machine.

nm-testing/Llama-4-Scout-17B-16E-Instruct-FP8-BLOCK

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

gemini-code-assist

Code Review

This pull request adds support for blocked FP8 quantization for Mixture of Experts (MoE) layers, primarily to enable DeepGEMM kernels. The changes are mostly in vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py. My review identifies a critical issue with parameter registration that could lead to incorrect behavior and makes the code confusing. I've also noted a minor issue in vllm/model_executor/warmup/deep_gemm_warmup.py.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

vllm/model_executor/warmup/deep_gemm_warmup.py

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm · 2025-09-22T19:08:45Z

@mgoin , @tlrmchlsmth I think this is ready to go as a first cut.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

yewentao256

Looks good and thanks for the work!
Could you also add performance metric using vllm bench throughput...?

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm · 2025-09-22T23:48:51Z

Benchmark with DP=4 + deepep_high_throughput

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  60.33
Total input tokens:                      1021533
Total generated tokens:                  110757
Request throughput (req/s):              16.58
Output token throughput (tok/s):         1835.83
Peak output token throughput (tok/s):    4559.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          18768.05
---------------Time to First Token----------------
Mean TTFT (ms):                          20646.11
Median TTFT (ms):                        23193.82
P99 TTFT (ms):                           50440.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.28
Median TPOT (ms):                        183.99
P99 TPOT (ms):                           310.81
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.92
Median ITL (ms):                         114.73
P99 ITL (ms):                            2261.17
==================================================

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

…vllm-project#25219) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…#25219) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

bnellnm requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 19, 2025 02:01

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

bnellnm added 2 commits September 19, 2025 16:02

seems to work

971e948

Signed-off-by: Bill Nell <bnell@redhat.com>

remove _inv

3ad0656

Signed-off-by: Bill Nell <bnell@redhat.com>

tlrmchlsmth added this to the v0.10.3 milestone Sep 19, 2025

cleanup

1ce4114

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm force-pushed the ctm-fp8-blocked branch from 216add1 to 1ce4114 Compare September 22, 2025 19:07

mgoin reviewed Sep 22, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

yewentao256 reviewed Sep 22, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

bnellnm added 2 commits September 22, 2025 22:01

remove _inv suffix from weight scales

8250bea

Signed-off-by: Bill Nell <bnell@redhat.com>

make util

dc3c233

Signed-off-by: Bill Nell <bnell@redhat.com>

tlrmchlsmth reviewed Sep 23, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025

Remove debug log for weight quantization

a7d6066

mgoin approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'main' into ctm-fp8-blocked

b79d45e

tlrmchlsmth enabled auto-merge (squash) September 23, 2025 01:48

Merge branch 'main' into ctm-fp8-blocked

ccfaf04

tlrmchlsmth mentioned this pull request Sep 23, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Closed

37 tasks

tlrmchlsmth merged commit f11e3c5 into vllm-project:main Sep 23, 2025
47 checks passed

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Kernels] Support blocked fp8 quantization for compressed tensors MoE (…

d4eb383

…vllm-project#25219) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

This was referenced Sep 26, 2025

Add block quantization e2e test vllm-project/llm-compressor#1867

Draft

Does SGLang support Qwen3 MOE FP8 BLOCK? vllm-project/llm-compressor#1758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernels] Support blocked fp8 quantization for compressed tensors MoE #25219

[Kernels] Support blocked fp8 quantization for compressed tensors MoE #25219

bnellnm commented Sep 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Sep 22, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

bnellnm commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Kernels] Support blocked fp8 quantization for compressed tensors MoE #25219

[Kernels] Support blocked fp8 quantization for compressed tensors MoE #25219

Conversation

bnellnm commented Sep 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Sep 22, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnellnm commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Sep 19, 2025 •

edited by github-actions bot

Loading