Skip to content

Conversation

yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Sep 25, 2025

Purpose

A follow up for #19085

We now have Warm up for DeepGEMM so JIT won't be a problem.

And we don't use torch compile for DeepGEMM after checking the code, so it is safe to remove the redundant logic

Test

Acc

lm_eval --model vllm --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

# now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.8529|±  |0.0098|
|     |       |strict-match    |     5|exact_match||0.8855|±  |0.0088|
# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.8529|±  |0.0098|
|     |       |strict-match    |     5|exact_match||0.8855|±  |0.0088|

Perf

vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enable-expert-parallel

# now
Throughput: 41.33 requests/s, 45353.06 total tokens/s, 4133.26 output tokens/s
# main
Throughput: 41.28 requests/s, 45299.81 total tokens/s, 4128.41 output tokens/s

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a good refactoring that simplifies the DeepGEMM integration. It removes the redundant custom operator registration for w8a8_block_fp8_matmul_deepgemm by deleting the vllm/model_executor/layers/quantization/deepgemm.py file. The usage of the custom op is correctly replaced with a direct call to fp8_gemm_nt in vllm/model_executor/layers/quantization/utils/fp8_utils.py. This change removes unnecessary indirection and makes the code cleaner and easier to follow. The justification provided in the pull request description is sound, and the changes appear correct.

@yewentao256 yewentao256 changed the title [Refactor] Remove DeepGEMM Register [Refactor] Remove DeepGEMM OP Register Sep 25, 2025
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 25, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, unnecessary overhead

@mgoin mgoin merged commit 9fe4c2b into main Sep 26, 2025
63 checks passed
@mgoin mgoin deleted the wentao-remove-deepgemm-register branch September 26, 2025 00:13
yewentao256 added a commit that referenced this pull request Oct 3, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants