-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leave response and sendError when request is canceled #3267
Leave response and sendError when request is canceled #3267
Conversation
Hi @slashvar Thanks for sending this PR! Do you have the steps for how to repro this problem |
Hi @agunapal ! For now, I don't have an easy repro step. On our side, we see it happening under heavily load or when the number of workers/instances is a bit too small. We have a tight time budget (about 10 to 50ms max) and usually a very large volume of requests (something like 100 to 1k request per seconds, on several instances with multiple workers). At least the scenario is always the same:
And after that, the worker stops answering requests. This is the same problem described in the linked issue #3087 but also in #3168 and #3091 I'm working on simple reproduction step that I can share. |
@slashvar Thank you for the details. Have you verified this fix works for you? |
Hello, sorry for the delay, settings an environment where I can properly test the issue with build from source was longer than expected. TL;DR the fix works. I've managed to reproduce the problem using the mnist example, using a a gRPC client in go (adapting my code that reproduces the issue within our environment) and an explicit timeout on the request. Without the fix:
With the fix:
One note: there is still another exception from time to time but this one happen when closing the request (when calling Here is my test code: https://gist.github.com/slashvar/8874c52d88895a922398289f81cd7a08 |
Awesome. Thanks! I'll try your fix with a testcase I tried. Load an LLM with max_new_tokens = large value and send an inference request |
@slashvar I'm hoping my repro is reliable. I basically cancel one request and then send another request. and I see it crash. I tried with your fix and I don't see the crash any more. My colleague @namannandan has identified other cases to be fixed. For now, I'll merge your PR. Please try with the nightlies to see if your issue is resolved. We can merge the other PR if the issue is not fully resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Tried the fix. and it seems to address the crash
Description
Leave
response
andsendError
method whenresponseObserver
is canceled to avoid continuing on canceled requests.This was the previous behavior introduced by #2420 and lost during a later refactoring. A similar early return on cancel exists for case
OIPPREDICT
inresponse
method.Fixes #3087
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
To my knowledge, there is no dedicated tests for this issue (it was fixed before and then reintroduced).
Currently trying to figure out how to properly test it.
Checklist: