-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[feat]: CUTLASS block scaled group gemm for SM100 #19757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat]: CUTLASS block scaled group gemm for SM100 #19757
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @djmmoss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a specialized CUTLASS block-scaled grouped GEMM kernel optimized for NVIDIA's SM100 (Blackwell) architecture. This new implementation serves as an optimized alternative to existing DeepGEMM kernels, particularly for FP8 quantized Mixture-of-Experts (MoE) operations. The changes involve integrating the kernel into the build system, exposing it through the Python API, and updating the MoE layer dispatch logic to leverage this new, hardware-specific optimization, all controlled by a new environment variable.
Highlights
- New Kernel Implementation: Introduces a new CUTLASS block-scaled grouped GEMM kernel specifically optimized for SM100 (Blackwell) architectures. This kernel is designed as an alternative to DeepGEMM for FP8 quantized Mixture-of-Experts (MoE) models, aiming for improved performance on the latest NVIDIA hardware.
- Build System Integration: Updates
CMakeLists.txt
to conditionally build the new kernel. Compilation is enabled only when the CUDA compiler version is 12.8 or greater and the target architecture includes SM100, controlled by a newENABLE_CUTLASS_MOE_SM100
flag. - Python API Exposure: The new
cutlass_blockwise_scaled_grouped_mm
operation is exposed to the Python API. This involves declaring the function incsrc/ops.h
, binding it incsrc/torch_bindings.cpp
, and providing a Python wrapper invllm/_custom_ops.py
. - MoE Layer Integration: The new kernel is integrated into the
fused_experts
logic withinvllm/model_executor/layers/fused_moe/
. It can be conditionally used for FP8 quantized MoE layers based on problem size alignment and a new environment variable, orchestrating input and intermediate FP8 quantization. - Feature Control: A new environment variable,
VLLM_USE_CUTLASS_BLOCKSCALED_GROUPED_GEMM
, has been added invllm/envs.py
. This allows users to explicitly enable or disable the use of this new, hardware-specific kernel. - Comprehensive Testing: A new unit test file,
tests/kernels/moe/test_cutlass_grouped_gemm.py
, has been added. This file includes helper functions for FP8 quantization and validates the correctness and accuracy of the new grouped GEMM kernel against a baseline across various problem sizes.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUTLASS block-scaled grouped GEMM kernel specifically for SM100 (Blackwell) architectures, intended as an alternative to DeepGEMM for FP8 MoE models. The changes include new CUDA C++ kernels, CMake build system updates, Python bindings, and corresponding tests.
Overall, the implementation looks promising. Key areas for attention include ensuring the CMake build correctly handles shared source files like moe_data.cu
to prevent build failures, clarifying some conditions and messages, and addressing a potential issue in the test logic related to uninitialized tensor usage.
csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu
Outdated
Show resolved
Hide resolved
csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu
Outdated
Show resolved
Hide resolved
@@ -1182,6 +1187,18 @@ def fused_experts(hidden_states: torch.Tensor, | |||
a2_scale=a2_scale, | |||
apply_router_weight_on_input=apply_router_weight_on_input, | |||
) | |||
elif (allow_cutlass_block_scaled_grouped_gemm and use_fp8_w8a8 and N > 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The N > 512
check for deep gemm was only for performance reasons. Does the triton kernel actually beat cutlass for N <= 512?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, at the moment cutlass is performing better
return True | ||
|
||
|
||
def run_cutlass_block_scaled_fused_experts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also integrate this with triton_deep_gemm_moe.py
(maybe we should change this name) so it can be used with EP?
expert_offsets[:-1], | ||
) | ||
|
||
assert calc_diff(ref_out, out) < 1e-3, f"Cutlass grouped gemm is not accurate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use torch.testing.assert_close
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, although given the sizes of the test there are some outliers compared to the FP32 baseline, this is why the atol
is fairly lenient
vllm/envs.py
Outdated
# Allow use of Cutlass Blockwise Scaled Grouped GEMM kernels for fused moe ops. | ||
"VLLM_USE_CUTLASS_BLOCKSCALED_GROUPED_GEMM": | ||
lambda: bool(int(os.getenv( | ||
"VLLM_USE_CUTLASS_BLOCKSCALED_GROUPED_GEMM", "0"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an environment variable? I would think we want to use it by default if it is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to the default behavior
elif (allow_cutlass_block_scaled_grouped_gemm and use_fp8_w8a8 | ||
and _valid_cutlass_block_scaled_grouped_gemm(hidden_states, w1, w2)): | ||
assert apply_router_weight_on_input is False | ||
return run_cutlass_block_scaled_fused_experts( | ||
a=hidden_states, | ||
w1=w1, | ||
w2=w2, | ||
w1_scale=w1_scale, | ||
w2_scale=w2_scale, | ||
topk_weights=topk_weights, | ||
topk_ids=topk_ids | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for the usage of CUTLASS to be decided on the quantized method level? For example, the same way CUTLASS MoE is picked in CompressedTensorsMoEMethod
's get_moe_method()
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ElizaWszola . I think it would be better to move the dispatching decisions at the MoeMethod
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR from @ElizaWszola does the integration already, I personally don't mind waiting for that PR to get merged in and then update this PR in a similar fashion to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test for the full fused MoE operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any benchmarks for these kernels that you could report @djmmoss?
typename LayoutSFA, | ||
typename LayoutSFB, | ||
typename ScaleConfig> | ||
__global__ void get_ggemm_starts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do the same thing as __get_group_gemm_starts_blockscale_fp8
in #19983? Checking to see what we can consolidate between the two PRs
static constexpr int AlignmentB = 128 / cutlass::sizeof_bits<ElementB>::value; | ||
static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value; | ||
|
||
using ArchTag = cutlass::arch::Sm100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make ArchTag
a template parameter? And then reuse this class for both SM90 and SM100?
csrc/torch_bindings.cpp
Outdated
@@ -393,6 +393,13 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { | |||
{stride_tag}); | |||
ops.impl("cutlass_scaled_fp4_mm", torch::kCUDA, &cutlass_scaled_fp4_mm); | |||
|
|||
// cutlass blockwise scaledgroup GEMM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// cutlass blockwise scaledgroup GEMM | |
// cutlass blockwise scaled group GEMM |
@tlrmchlsmth I had a look over #19983, I could likely integrate the SM100 changes fairly simply if the In regard to performance, I'm mainly looking at GB200 DS-R1. For a single-node (TP4) you can expect roughly a 1.6x speed up in max-throughput and ~20% improvement in min-latency. For two-node (TP8) max-throughput improvement is around 1.4x with the same ~20% improvement in min-latency. |
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
1b21ec5
to
4adbbaa
Compare
FYI I include moe_data for sm100 here #20086 |
CMakeLists.txt
Outdated
set(CUTLASS_REVISION "v3.9.2" CACHE STRING "CUTLASS revision to use") | ||
set(CUTLASS_REVISION "f115c3f85467d5d9619119d1dbeb9c03c3d73864" CACHE STRING "CUTLASS revision to use") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just needs CUTLASS 4.0 right? It would be nice to wait for the tag -- is there an estimated date for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tag was added last Friday, I've made the update 👍
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
CMakeLists.txt
Outdated
@@ -296,6 +296,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") | |||
"csrc/quantization/fp4/nvfp4_quant_entry.cu" | |||
"csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu" | |||
"csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu" | |||
"csrc/quantization/cutlass_w8a8/moe/moe_data.cu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right, have you seen #20086 where I pulled moe_data.cu
into it's own case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that, I've fixed it now but there are some issues related to the cutlass v4.0.0 upgrade I'm working through
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
|
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
hey @djmmoss I enabled the full CI and there are failures on the cutlass moe entrypoint https://buildkite.com/vllm/ci/builds/23192/steps/canvas?jid=0197d1b9-6de4-4bef-8c61-59fa059d4c44 |
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
This pull request has merge conflicts that must be resolved before it can be |
I've pushed up the fix |
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Duncan Moss <dmoss@nvidia.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Duncan Moss <dmoss@nvidia.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Duncan Moss <dmoss@nvidia.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Duncan Moss <dmoss@nvidia.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR adds a CUTLASS block scaled group GEMM implementation for SM100. This is intended to be used as an alternative to DeepGEMM for Blackwell devices.
Test Plan
Adds unit for the new function:
tests/kernels/moe/test_cutlass_grouped_gemm.py
Running the test:
Test Result
(Optional) Documentation Update