Skip to content

[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. #19717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 25, 2025

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Jun 16, 2025

In order to avoid any issues with activation chunking and torch.compile, I've moved the remainder of the kernel calls in the FusedMoE layer into the moe_forward custom op. I've also added some additional testing for cudagraphs at the pytest level for triton, deep gemm and cutlass.

Some sizes are disabled for cutlass since those result in wrong answer errors.

This should help address #19631

I'm going to include these changes in #19636.

cc @tlrmchlsmth , @varun-sundar-rabindranath , @ElizaWszola

bnellnm added 8 commits June 16, 2025 17:39
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

bnellnm added 4 commits June 16, 2025 23:00
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@houseroad houseroad requested a review from zou3519 June 17, 2025 03:33
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 17, 2025

After updating the unit tests with mark_dynamic, I'm running into torch.compile problems. At this point, I'm going to back off on some of the changes and wrap the whole loop in a custom op.

@zou3519
Copy link
Collaborator

zou3519 commented Jun 17, 2025

After updating the unit tests with mark_dynamic, I'm running into torch.compile problems. At this point, I'm going to back off on some of the changes and wrap the whole loop in a custom op.

Are you referring to #19631 or something else?

Yes, the way to avoid specialization is to hide some things in a custom op

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 18, 2025

After updating the unit tests with mark_dynamic, I'm running into torch.compile problems. At this point, I'm going to back off on some of the changes and wrap the whole loop in a custom op.

Are you referring to #19631 or something else?

Yes, the way to avoid specialization is to hide some things in a custom op

Yeah, #19631.

bnellnm added 3 commits June 18, 2025 19:59
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm bnellnm changed the title [Kernel] Clean up and test torch.compile + cudagraph support for triton, cutlass + deep gemm MoE ops. [Kernel] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. Jun 18, 2025
@bnellnm bnellnm changed the title [Kernel] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. [Kernel][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. Jun 18, 2025
@bnellnm bnellnm changed the title [Kernel][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. [Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. Jun 18, 2025
Signed-off-by: Bill Nell <bnell@redhat.com>
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To document somewhere in the PR description why we change from .shape to .size(), pasting from a DM between me and @bnellnm:

i was messing with torch compile and wanted to make sure any dynamic dims got preserved properly. .size() preserves SymInts where .shape doesn't. i figured it was safer to just switch all of them
safer+more consistent

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 24, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) June 24, 2025 20:00
Signed-off-by: Bill Nell <bnell@redhat.com>
auto-merge was automatically disabled June 24, 2025 21:48

Head branch was pushed to by a user without write access

@vllm-bot vllm-bot merged commit 015fab8 into vllm-project:main Jun 25, 2025
69 of 71 checks passed
@bnellnm bnellnm deleted the torch-compile-stuff branch June 25, 2025 06:24
@zou3519
Copy link
Collaborator

zou3519 commented Jun 25, 2025

To document somewhere in the PR description why we change from .shape to .size(), pasting from a DM between me and @bnellnm:

i was messing with torch compile and wanted to make sure any dynamic dims got preserved properly. .size() preserves SymInts where .shape doesn't. i figured it was safer to just switch all of them
safer+more consistent

I don't think there should be a difference between these two. If there is, then it's a bug in PyTorch -- both of them should preserve SymInts. cc @bobrenjc93

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 25, 2025

To document somewhere in the PR description why we change from .shape to .size(), pasting from a DM between me and @bnellnm:

i was messing with torch compile and wanted to make sure any dynamic dims got preserved properly. .size() preserves SymInts where .shape doesn't. i figured it was safer to just switch all of them
safer+more consistent

I don't think there should be a difference between these two. If there is, then it's a bug in PyTorch -- both of them should preserve SymInts. cc @bobrenjc93

Maybe this has changed in newer versions of pytorch but iirc at one point using .size() was recommended over .shape for preserving dynamic dimensions.

@@ -29,7 +29,10 @@
(224, 1024, 1536),
(224, 3072, 1024),
(224, 3072, 1536),
(1024 * 128, 1024, 1024),
(32768, 1024, 1024),
# These sizes trigger wrong answers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnellnm could you share some e2e numerical testing results here? this means the current approach still has correctness problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue with the cutlass kernels themselves (separate from chunking). I'm not sure if the tolerances need to be changed or if there's a real problem with cutlass. @ElizaWszola can probably provide some more insight here.

@tdoublep
Copy link
Member

tdoublep commented Jun 26, 2025

This PR broke the following two tests (they are not run by default during CI):

FAILED models/language/generation/test_common.py::test_models[False-5-32-TitanML/tiny-mixtral] - AssertionError: Test0:
--
  | [2025-06-26T17:43:39Z] Matched tokens:	[]
  | [2025-06-26T17:43:39Z] hf:	'ottdone бли施 Besodigd De Account me merc De me ге酒ott Bes酒ottott酒наль酒ott Potavirus酒酒酒酒酒 Davisavirus'	{1562: -7.970005989074707, 29191: -8.063755989074707, 13654: -8.188755989074707, 30658: -8.235630989074707, 20671: -8.360630989074707}
  | [2025-06-26T17:43:39Z] vllm:	' orth surg блиDIMemulumodigdría denom Васи Васи scattered whispsi黄tiestiestiesacon crienthenthenthenthSingriegpsi"},iface chefarn寸'	{17396: Logprob(logprob=-8.14117431640625, rank=1, decoded_token='▁orth'), 17358: Logprob(logprob=-8.21929931640625, rank=2, decoded_token='▁conflic'), 15630: Logprob(logprob=-8.25054931640625, rank=3, decoded_token='odigd'), 8393: Logprob(logprob=-8.25054931640625, rank=4, decoded_token='▁Bes'), 18018: Logprob(logprob=-8.31304931640625, rank=5, decoded_token='▁preserve')}
  | [2025-06-26T17:43:39Z] 

FAILED models/language/generation/test_granite.py::test_models[5-64-bfloat16-ibm/PowerMoE-3b] - AssertionError: Test0:
  | [2025-06-26T17:43:39Z] Matched tokens:	[203]
  | [2025-06-26T17:43:39Z] hf:	'\n## Installation\n\n### Requirements\n\n- Python 3.7+\n- PyTorch 1.8+\n- Transformers 4.10+\n- Hugging Face Transformers 4.10+\n- PyTorch-Lightning 1.4+\n- PyTorch-Lightning Data'	{433: -0.1428334265947342, 1318: -2.1428334712982178, 1011: -4.892833232879639, 1482: -5.142833232879639, 383: -6.642833232879639}
  | [2025-06-26T17:43:39Z] vllm:	'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe 1\n\n\n\n\n\n\n\nThe 1\n\n\n\n\nThe 1\n\n\n\n\nThe 1\n\n\n\n\nThe 1\n\n\n\nThe 1\n'	{203: Logprob(logprob=-1.6709017753601074, rank=1, decoded_token='Ċ'), 31: Logprob(logprob=-2.5459017753601074, rank=2, decoded_token='-'), 1318: Logprob(logprob=-3.6396517753601074, rank=3, decoded_token='The'), 51: Logprob(logprob=-4.077151775360107, rank=4, decoded_token='A'), 15698: Logprob(logprob=-4.577151775360107, rank=5, decoded_token='âĢ¢')}

I have confirmed that both tests pass using the commit before this PR was merged (f59fc60). They also pass with the changes from this PR but only if I enable eager mode.

The failure seems to come specifically from always using the torch op. If I take f59fc60 and set use_direct_call=False then it fails in the same way.

@bnellnm
Copy link
Contributor Author

bnellnm commented Jun 26, 2025

[Fix] Fix gemma CI test failing on main #20124

The solution for failures like this is to make sure a forward_context is always set up while running a model. It may be possible to check this and use a dummy context instead.

I'll either update the test or make the MoE layer tolerate a missing context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants