Skip to content

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Oct 1, 2025

The early return in compute( calls arrive:

      cutlass::arch::NamedBarrier(
          (kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp,
          kNamedBarrierEpilogue
      ).arrive();

but didn't have any barrier before looping around and calling it again causing a deadlock when the load warps waits on:

cutlass::arch::NamedBarrier((kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp, kNamedBarrierEpilogue).arrive_and_wait();

Co-authored-by: Robert Shaw robshaw@redhat.com

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential hang in the CUTLASS MLA kernel under load. The changes are mostly stylistic, involving code re-indentation. However, there is one critical bug fix that changes a cutlass::arch::NamedBarrier::arrive() call to arrive_and_wait(). This change correctly resolves a race condition that could lead to a deadlock between compute and load warps, which is the likely cause of the hang. My review confirms this fix is correct and critical.

@robertgshaw2-redhat
Copy link
Collaborator

TODOs (follow-up)

  • investigate whether this can solve the hangs for num_kv_splits>1
  • investigate a dynamic scheduler

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/fix-cutlass-mla-hang branch from 746ef1a to a36b036 Compare October 1, 2025 14:10
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/fix-cutlass-mla-hang branch from a36b036 to 3a8a634 Compare October 1, 2025 14:12
@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models labels Oct 1, 2025
(kNumComputeWarps + kNumLoadWarps) * NumThreadsPerWarp,
kNamedBarrierEpilogue
).arrive();
).arrive_and_wait();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for reviewer --- this is the line that fixes the deadlock

@robertgshaw2-redhat robertgshaw2-redhat changed the title [BugFix] Fix CUTLASS MLA hang under load [BugFix][DP/EP] Fix CUTLASS MLA hang under load Oct 1, 2025
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mgoin
Copy link
Member

mgoin commented Oct 1, 2025

Failing tests are known on main and being resolved. Blackwell tests are green, merging

@vllm-bot vllm-bot merged commit 1726e93 into vllm-project:main Oct 1, 2025
79 of 84 checks passed
soldni pushed a commit to soldni/vllm that referenced this pull request Oct 1, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
simon-mo pushed a commit that referenced this pull request Oct 2, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants