-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[Perf] Optimize moe_align_block_size
CUDA kernel
#19572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Optimize moe_align_block_size
CUDA kernel
#19572
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yewentao256, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request optimizes the moe_align_block_size
CUDA kernel, a critical component in Mixture-of-Experts (MoE) models. It replaces disparate implementations with a unified, high-performance C++ kernel, leading to notable speedups for configurations with many experts. The change also simplifies the codebase by deprecating older kernels and includes new tests and benchmarks to validate the improvements.
Highlights
- Kernel Unification: Consolidated multiple CUDA kernel implementations for
moe_align_block_size
into a single, optimized C++ implementation. The previousmoe_align_block_size_kernel
,moe_align_block_size_global_mem_kernel
, andsgl_moe_align_block_size_kernel
have been replaced or refactored. - Performance Improvement: The new C++ kernel shows significant performance improvements, particularly for larger numbers of experts (e.g., 224, 256, 512), as demonstrated by the provided benchmark results on B200.
- Deprecation: The Python
moe_align_block_size_triton
and the C++sgl_moe_align_block_size
functions are marked for deprecation, simplifying the codebase. - Testing and Benchmarking: Added new unit tests and a benchmark script specifically for
moe_align_block_size
to verify correctness and measure performance against the Triton implementation.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly refactors and optimizes the moe_align_block_size
CUDA kernel. Key changes include:
- Kernel Consolidation: The existing
sgl_moe_align_block_size_kernel
is enhanced and becomes the primary CUDA kernel, while older shared-memory and global-memory specific versions are removed. A new specialized kernel (moe_align_block_size_small_batch_expert_kernel
) is introduced for small batch/expert scenarios. - Unified C++ API: The host-side C++ function
moe_align_block_size
now intelligently dispatches to either the small-batch kernel or the main two-kernel path (alignment count + sort). - Python API Simplification: The Python wrapper in
vllm.model_executor.layers.fused_moe.moe_align_block_size
is simplified, removing complex dispatch logic and relying on the C++ backend. - Deprecation: Older/redundant functions like
sgl_moe_align_block_size
(Python and C++) andmoe_align_block_size_triton
are marked for deprecation. - Performance: Benchmark results indicate substantial performance improvements with the new implementation.
- Testing: New benchmark and unit test files are added to verify correctness and performance against a Triton implementation.
The changes appear well-structured and the CUDA kernel logic follows established parallel programming patterns. The simplification of the Python-level dispatch is a welcome improvement for maintainability.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
25d764d
to
b6afb8c
Compare
@mgoin Please take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! I would like an eval for Qwen3 and an end-to-end benchmark on DeepSeek (dummy load is ok) to make sure everything is in order, in addition to CI. Then I think this is good to go
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
This is what I got: lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Origin
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8264|± |0.0104|
| | |strict-match | 5|exact_match|↑ |0.8923|± |0.0085|
Now
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8378|± |0.0102|
| | |strict-match | 5|exact_match|↑ |0.8916|± |0.0086|
vllm bench throughput --model deepseek-ai/deepseek-llm-7b-base --load-format dummy --input-len 1000 --output-len 100
Origin
Throughput: 45.27 requests/s, 49710.98 total tokens/s, 4527.47 output tokens/s
Throughput: 45.22 requests/s, 49653.42 total tokens/s, 4522.34 output tokens/s
Now
Throughput: 45.07 requests/s, 49545.56 total tokens/s, 4506.91 output tokens/s
Throughput: 45.57 requests/s, 50064.75 total tokens/s, 4557.25 output tokens/s
vllm bench throughput --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --load-format dummy --input-len 1000 --output-len 100
origin
Throughput: 6.47 requests/s, 7110.63 total tokens/s, 647.04 output tokens/s
Now
Throughput: 6.51 requests/s, 7158.93 total tokens/s, 651.48 output tokens/s |
More benchmark test added: # E=64 (improvement)
vllm bench throughput --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code
Origin
Throughput: 54.46 requests/s, 59832.15 total tokens/s, 5445.72 output tokens/s
Now
Throughput: 58.19 requests/s, 63906.07 total tokens/s, 5819.48 output tokens/s
# This should be the same since E=256
vllm bench throughput --model RedHatAI/DeepSeek-R1-0528-quantized.w4a16 --load-format dummy --input-len 1000 --output-len 100 -tp 4
Origin
Throughput: 12.22 requests/s, 13414.60 total tokens/s, 1221.75 output tokens/s
Now
Throughput: 12.24 requests/s, 13452.07 total tokens/s, 1224.20 output tokens/s |
itertools.product( | ||
[32, 64, 128, 256], # block_size | ||
[ | ||
1, | ||
4, | ||
16, | ||
64, | ||
256, | ||
1024, | ||
4096, | ||
], # num_tokens | ||
[1, 4, 16, 64], # topk | ||
[64, 160, 256, 257, 260, 264], # num_experts | ||
)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long do these tests take? Should we explicitly list out the problem sizes to test?
And it would be very good to add some tests for non-power-of-two num_tokens, including some odd problem sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thank you for your investigation
Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Purpose
Fixes #19517
The implementation is taken from https://github.com/sgl-project/sglang/blob/8b5f83ed3b7d2a49ad5c5cd5aa61c5d502f47dbc
Specially thanks to SGL developers!
Notes:
moe_align_block_size
kernel now.moe_align_block_size_triton
andsgl_moe_align_block_size
for deprecationTest
Tested on B200
Unit test
Benchmark (with Triton)
End to end Throughput
Throughput(fp16)
vllm bench throughput --model Qwen/Qwen3-30B-A3B --load-format dummy --input-len 1000 --output-len 100
Throughput(fp8)
vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100