[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

bnellnm · 2025-06-16T22:45:33Z

In order to avoid any issues with activation chunking and torch.compile, I've moved the remainder of the kernel calls in the FusedMoE layer into the moe_forward custom op. I've also added some additional testing for cudagraphs at the pytest level for triton, deep gemm and cutlass.

Some sizes are disabled for cutlass since those result in wrong answer errors.

This should help address #19631

I'm going to include these changes in #19636.

cc @tlrmchlsmth , @varun-sundar-rabindranath , @ElizaWszola

Signed-off-by: Bill Nell <bnell@redhat.com>

gemini-code-assist · 2025-06-16T22:45:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions · 2025-06-16T22:45:42Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm · 2025-06-17T21:46:44Z

After updating the unit tests with mark_dynamic, I'm running into torch.compile problems. At this point, I'm going to back off on some of the changes and wrap the whole loop in a custom op.

zou3519 · 2025-06-17T23:03:16Z

After updating the unit tests with mark_dynamic, I'm running into torch.compile problems. At this point, I'm going to back off on some of the changes and wrap the whole loop in a custom op.

Are you referring to #19631 or something else?

Yes, the way to avoid specialization is to hide some things in a custom op

bnellnm · 2025-06-18T17:00:05Z

After updating the unit tests with mark_dynamic, I'm running into torch.compile problems. At this point, I'm going to back off on some of the changes and wrap the whole loop in a custom op.

Are you referring to #19631 or something else?

Yes, the way to avoid specialization is to hide some things in a custom op

Yeah, #19631.

Signed-off-by: Bill Nell <bnell@redhat.com>

tlrmchlsmth

To document somewhere in the PR description why we change from .shape to .size(), pasting from a DM between me and @bnellnm:

i was messing with torch compile and wanted to make sure any dynamic dims got preserved properly. .size() preserves SymInts where .shape doesn't. i figured it was safer to just switch all of them
safer+more consistent

Signed-off-by: Bill Nell <bnell@redhat.com>

zou3519 · 2025-06-25T13:14:11Z

To document somewhere in the PR description why we change from .shape to .size(), pasting from a DM between me and @bnellnm:

i was messing with torch compile and wanted to make sure any dynamic dims got preserved properly. .size() preserves SymInts where .shape doesn't. i figured it was safer to just switch all of them
safer+more consistent

I don't think there should be a difference between these two. If there is, then it's a bug in PyTorch -- both of them should preserve SymInts. cc @bobrenjc93

bnellnm · 2025-06-25T15:18:53Z

To document somewhere in the PR description why we change from .shape to .size(), pasting from a DM between me and @bnellnm:

i was messing with torch compile and wanted to make sure any dynamic dims got preserved properly. .size() preserves SymInts where .shape doesn't. i figured it was safer to just switch all of them
safer+more consistent

I don't think there should be a difference between these two. If there is, then it's a bug in PyTorch -- both of them should preserve SymInts. cc @bobrenjc93

Maybe this has changed in newer versions of pytorch but iirc at one point using .size() was recommended over .shape for preserving dynamic dimensions.

yeqcharlotte · 2025-06-25T16:42:49Z

tests/kernels/moe/test_cutlass_moe.py

@@ -29,7 +29,10 @@
    (224, 1024, 1536),
    (224, 3072, 1024),
    (224, 3072, 1536),
-    (1024 * 128, 1024, 1024),
+    (32768, 1024, 1024),
+    # These sizes trigger wrong answers.


@bnellnm could you share some e2e numerical testing results here? this means the current approach still has correctness problems?

This is an issue with the cutlass kernels themselves (separate from chunking). I'm not sure if the tolerances need to be changed or if there's a real problem with cutlass. @ElizaWszola can probably provide some more insight here.

tdoublep · 2025-06-26T18:33:47Z

This PR broke the following two tests (they are not run by default during CI):

FAILED models/language/generation/test_common.py::test_models[False-5-32-TitanML/tiny-mixtral] - AssertionError: Test0:
--
  | [2025-06-26T17:43:39Z] Matched tokens:	[]
  | [2025-06-26T17:43:39Z] hf:	'ottdone бли施 Besodigd De Account me merc De me ге酒ott Bes酒ottott酒наль酒ott Potavirus酒酒酒酒酒 Davisavirus'	{1562: -7.970005989074707, 29191: -8.063755989074707, 13654: -8.188755989074707, 30658: -8.235630989074707, 20671: -8.360630989074707}
  | [2025-06-26T17:43:39Z] vllm:	' orth surg блиDIMemulumodigdría denom Васи Васи scattered whispsi黄tiestiestiesacon crienthenthenthenthSingriegpsi"},iface chefarn寸'	{17396: Logprob(logprob=-8.14117431640625, rank=1, decoded_token='▁orth'), 17358: Logprob(logprob=-8.21929931640625, rank=2, decoded_token='▁conflic'), 15630: Logprob(logprob=-8.25054931640625, rank=3, decoded_token='odigd'), 8393: Logprob(logprob=-8.25054931640625, rank=4, decoded_token='▁Bes'), 18018: Logprob(logprob=-8.31304931640625, rank=5, decoded_token='▁preserve')}
  | [2025-06-26T17:43:39Z] 

FAILED models/language/generation/test_granite.py::test_models[5-64-bfloat16-ibm/PowerMoE-3b] - AssertionError: Test0:
  | [2025-06-26T17:43:39Z] Matched tokens:	[203]
  | [2025-06-26T17:43:39Z] hf:	'\n## Installation\n\n### Requirements\n\n- Python 3.7+\n- PyTorch 1.8+\n- Transformers 4.10+\n- Hugging Face Transformers 4.10+\n- PyTorch-Lightning 1.4+\n- PyTorch-Lightning Data'	{433: -0.1428334265947342, 1318: -2.1428334712982178, 1011: -4.892833232879639, 1482: -5.142833232879639, 383: -6.642833232879639}
  | [2025-06-26T17:43:39Z] vllm:	'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe 1\n\n\n\n\n\n\n\nThe 1\n\n\n\n\nThe 1\n\n\n\n\nThe 1\n\n\n\n\nThe 1\n\n\n\nThe 1\n'	{203: Logprob(logprob=-1.6709017753601074, rank=1, decoded_token='Ċ'), 31: Logprob(logprob=-2.5459017753601074, rank=2, decoded_token='-'), 1318: Logprob(logprob=-3.6396517753601074, rank=3, decoded_token='The'), 51: Logprob(logprob=-4.077151775360107, rank=4, decoded_token='A'), 15698: Logprob(logprob=-4.577151775360107, rank=5, decoded_token='âĢ¢')}

I have confirmed that both tests pass using the commit before this PR was merged (f59fc60). They also pass with the changes from this PR but only if I enable eager mode.

The failure seems to come specifically from always using the torch op. If I take f59fc60 and set use_direct_call=False then it fails in the same way.

bnellnm · 2025-06-26T19:57:39Z

[Fix] Fix gemma CI test failing on main #20124

The solution for failures like this is to make sure a forward_context is always set up while running a model. It may be possible to check this and use a dummy context instead.

I'll either update the test or make the MoE layer tolerate a missing context.

bnellnm added 8 commits June 16, 2025 17:39

turn try_get_optimal_moe_config into an op so it can be torch.compiled

3d288bf

Signed-off-by: Bill Nell <bnell@redhat.com>

lint

385e0c5

Signed-off-by: Bill Nell <bnell@redhat.com>

torch.compile tests

c98ffbe

Signed-off-by: Bill Nell <bnell@redhat.com>

add tests

c1c362a

Signed-off-by: Bill Nell <bnell@redhat.com>

add compiler + cudagraph tests

776ad95

Signed-off-by: Bill Nell <bnell@redhat.com>

tests

961b5e8

Signed-off-by: Bill Nell <bnell@redhat.com>

reduce number of compile/cudagraph tests

bd9bd37

Signed-off-by: Bill Nell <bnell@redhat.com>

lint

23f26c9

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners June 16, 2025 22:45

bnellnm added 4 commits June 16, 2025 23:00

fix lint

5d564f6

Signed-off-by: Bill Nell <bnell@redhat.com>

fix lint

06b4583

Signed-off-by: Bill Nell <bnell@redhat.com>

replace import that lint removed

463ccaa

Signed-off-by: Bill Nell <bnell@redhat.com>

fixes

4ab6af7

Signed-off-by: Bill Nell <bnell@redhat.com>

houseroad requested a review from zou3519 June 17, 2025 03:33

lint

695203d

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm added 3 commits June 18, 2025 19:59

opify at a higher level

287a204

Signed-off-by: Bill Nell <bnell@redhat.com>

de-opify deepgemm kernels

1c9fd39

Signed-off-by: Bill Nell <bnell@redhat.com>

remove cruft

79a1962

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm changed the title ~~[Kernel] Clean up and test torch.compile + cudagraph support for triton, cutlass + deep gemm MoE ops.~~ [Kernel] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. Jun 18, 2025

bnellnm changed the title ~~[Kernel] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs.~~ [Kernel][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. Jun 18, 2025

bnellnm changed the title ~~[Kernel][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs.~~ [Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. Jun 18, 2025

fix up deep gemm tests

8b5492c

Signed-off-by: Bill Nell <bnell@redhat.com>

This was referenced Jun 19, 2025

[Kernels] MoE refactor #19636

Open

[Bug]: Illegal memory access on llama4 maverick #19631

Open

tlrmchlsmth reviewed Jun 24, 2025

View reviewed changes

tlrmchlsmth approved these changes Jun 24, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 24, 2025

tlrmchlsmth enabled auto-merge (squash) June 24, 2025 20:00

fix test_mixtral_moe test

de1d096

Signed-off-by: Bill Nell <bnell@redhat.com>

auto-merge was automatically disabled June 24, 2025 21:48
Head branch was pushed to by a user without write access

vllm-bot merged commit 015fab8 into vllm-project:main Jun 25, 2025
69 of 71 checks passed

bnellnm deleted the torch-compile-stuff branch June 25, 2025 06:24

yeqcharlotte reviewed Jun 25, 2025

View reviewed changes

tdoublep mentioned this pull request Jun 26, 2025

[Fix] Fix gemma CI test failing on main #20124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

Uh oh!

bnellnm commented Jun 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

bnellnm commented Jun 17, 2025

Uh oh!

zou3519 commented Jun 17, 2025

Uh oh!

bnellnm commented Jun 18, 2025

Uh oh!

tlrmchlsmth left a comment •

edited

Loading

Uh oh!

Uh oh!

zou3519 commented Jun 25, 2025

Uh oh!

bnellnm commented Jun 25, 2025

Uh oh!

yeqcharlotte Jun 25, 2025

Uh oh!

bnellnm Jun 25, 2025

Uh oh!

tdoublep commented Jun 26, 2025 •

edited

Loading

Uh oh!

bnellnm commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

Uh oh!

Conversation

bnellnm commented Jun 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

bnellnm commented Jun 17, 2025

Uh oh!

zou3519 commented Jun 17, 2025

Uh oh!

bnellnm commented Jun 18, 2025

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zou3519 commented Jun 25, 2025

Uh oh!

bnellnm commented Jun 25, 2025

Uh oh!

yeqcharlotte Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnellnm commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Jun 16, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth left a comment •

edited

Loading

tdoublep commented Jun 26, 2025 •

edited

Loading

bnellnm commented Jun 26, 2025 •

edited

Loading