Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix marlin kernel crash on H100 #4218

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

alexm-neuralmagic
Copy link
Contributor

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

Copy link
Collaborator

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to keep the cache hint? Seems pretty useful but if you measured no difference then it might be alright

@alexm-neuralmagic
Copy link
Contributor Author

I tried various modifications to the PTX to keep the cache-hint, but it did not work.

Copy link
Collaborator

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, I validated the fix with the reproduction in neuralmagic#187. Always great to see fixes that make things simpler ❤️

@pcmoritz pcmoritz merged commit aae0824 into vllm-project:main Apr 24, 2024
47 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: #187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
mgoin added a commit to neuralmagic/nm-vllm that referenced this pull request May 16, 2024
The reason for the crash was the inline PTX assembly that introduced the
async_copy with streaming behavior. The solution is to use the more
standard PTX for async_copy (without the fractional L2 policy for
"evict_first"). There is no performance difference between standard
async_copy PTX and the previous one.
Ported from dense marlin:
vllm-project#4218
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants