-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
[V1] Resolve failed concurrent structred output requests #19565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes vllm-project#19493 Closes vllm-project#18376 Related to vllm-project#18780 Several people have noticed errors when using both the `xgrammar` and `guidance` backends where we would start generating invalid tokens for a request and they would be continuously rejected by the backend currently in use. The conditions seemed to be: - Only impacts certain models - Occurs with concurrent structured output requests After further investigation once an easy way to reproduce was provided via vllm-project#19493, I identified more details about the failure: - When the failured occurred in my test using a concurrency of 2, whichever request came in first was always successful. It was the second request that would fail. Debugging further identified that the bitmask was not being applied correctly, but only for that second request. In the GPU model runner, this translates to the 2nd row in the bitmask tensor and the 2nd row of the logits tensor. I could see that a couple bytes were left unmasked. I suspect the reason the issue appears to be model specific has to do with the vocab and what the tokens are that were left unmasked. I have not verified this part for sure. The reason it occurred with both structured output backends is because we use the `xgrammar` library's implementation of applying the bitmask in all cases. Xgrammar on cuda, by default, uses a triton kernel for applying the bitmask. I identified that by forcing it to use the `torch.compile` implementation instead, the problem is resolved. The torch implementation is used for all other accelerator types in Xgrammar's logic, so it seems fine to just force the use of that implementation. I have not yet narrowed down the problem in triton kernel, but this change works around the problem for vLLM. We can move back to Xgrammar's wrapper that chooses which implementation to use once we can verify everything is working properly again. Signed-off-by: Russell Bryant <rbryant@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @russellb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical bug affecting concurrent structured output requests, which was traced back to an issue within the xgrammar
library's Triton kernel for applying token bitmasks. The change implements a temporary workaround by explicitly selecting the torch.compile
based kernel within vLLM, mitigating the failure while the root cause in the Triton kernel is further investigated.
Highlights
- Bug Fix: Resolves an issue causing concurrent structured output requests to fail for certain models when using the
xgrammar
orguidance
backends. - Workaround: Implements a workaround for a suspected bug in the
xgrammar
Triton kernel used for applying the token bitmask on CUDA. - Kernel Selection: Forces the use of the
xgrammar
torch.compile
implementation for applying the token bitmask invllm/v1/worker/gpu_model_runner.py
instead of relying onxgrammar
's default selection logic (which uses the problematic Triton kernel on CUDA).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug related to concurrent structured output requests failing due to issues with xgrammar
's Triton kernel for bitmask application. The proposed workaround, forcing the use of xgrammar
's torch.compile
implementation, is a pragmatic solution and seems well-investigated. The change is minimal and targeted.
I've suggested adding an inline comment to document the workaround for future maintainability. Additionally, there's a minor typo in the PR title: "structred" should be "structured".
Signed-off-by: Russell Bryant <rbryant@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for figuring that out!
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…ct#19565) Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: minpeter <kali2005611@gmail.com>
…ct#19565) Signed-off-by: Russell Bryant <rbryant@redhat.com>
…ct#19565) Signed-off-by: Russell Bryant <rbryant@redhat.com>
…ct#19565) Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…nd stride doesn't match (#390) # Motivation In vLLM, in order to address [Issue #19493](vllm-project/vllm#19493), [PR #19565](vllm-project/vllm#19565) served as a workaround to mitigate the issue by switching to use torch.compile version instead of triton version. # Root cause With some investigation, we found that the error would happen when logits.stride()[0] != logits.shape[-1]. # Changes - Fix CPU and Triton-version apply_grammar_bitmask for this scenario - use stride[0] instead of shape[-1] when access rows - Add a unit test to reproduce the errors # Tests We've verified - New unit tests would failed on trunk - New unit tests passed with the change (and no new failure, however, there're a lot of failing tests in trunk, 47 failed in tests/python/test_token_bitmask_operations.py) # Followup We will follow up with a benchmark comparison to see if we should bring back triton-version or use the new cuda version on vLLM side. --------- Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…lm-project#19565)" This reverts commit c57bb19. Signed-off-by: Russell Bryant <rbryant@redhat.com>
Closes #19493
Closes #18376
Related to #18780
Several people have noticed errors when using both the
xgrammar
andguidance
backends where we would start generating invalid tokens for arequest and they would be continuously rejected by the backend currently
in use. The conditions seemed to be:
After further investigation once an easy way to reproduce was provided
via #19493, I identified more details about the failure:
whichever request came in first was always successful. It was the
second request that would fail.
Debugging further identified that the bitmask was not being applied
correctly, but only for that second request. In the GPU model runner,
this translates to the 2nd row in the bitmask tensor and the 2nd row
of the logits tensor. I could see that a couple bytes were left
unmasked.
I suspect the reason the issue appears to be model specific has to do
with the vocab and what the tokens are that were left unmasked. I have
not verified this part for sure.
The reason it occurred with both structured output backends is because
we use the
xgrammar
library's implementation of applying the bitmaskin all cases.
Xgrammar on cuda, by default, uses a triton kernel for applying the
bitmask. I identified that by forcing it to use the
torch.compile
implementation instead, the problem is resolved. The torch
implementation is used for all other accelerator types in Xgrammar's
logic, so it seems fine to just force the use of that implementation.
I have not yet narrowed down the problem in triton kernel, but this
change works around the problem for vLLM.
We can move back to Xgrammar's wrapper that chooses which implementation
to use once we can verify everything is working properly again.
Signed-off-by: Russell Bryant rbryant@redhat.com