-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Feature] Add abort request endpoint to handle request cancellations #26635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new /abort_request endpoint to allow for the cancellation of requests and the release of their associated resources, particularly for delay_free requests in disaggregated prefill scenarios. The changes are well-implemented across the API server, protocol, scheduler, and output processor. However, I've identified a critical security vulnerability: the new endpoint is not authenticated. My review includes a comment with a suggested fix for this issue.
| return JSONResponse(content=response.model_dump()) | ||
|
|
||
|
|
||
| @router.post("/abort_request", dependencies=[Depends(validate_json_request)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new /abort_request endpoint is not protected by authentication. The AuthenticationMiddleware in this file only protects routes that start with /v1. This could allow unauthorized users to abort arbitrary requests if they can guess or obtain a request_id.
To fix this, the endpoint should be moved under the /v1 path.
| @router.post("/abort_request", dependencies=[Depends(validate_json_request)]) | |
| @router.post("/v1/abort_request", dependencies=[Depends(validate_json_request)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| # Invalid request ID. | ||
| continue | ||
|
|
||
| if request.is_finished(): | ||
| # If the request is already finished, only FINISHED_ABORTED is | ||
| # allowed, which is used to force resource cleanup. | ||
| assert finished_status == RequestStatus.FINISHED_ABORTED, ( | ||
| "Only FINISHED_ABORTED is allowed for requests that are " | ||
| "already finished." | ||
| ) | ||
| logger.info("Aborting request %s, freeing blocks.", req_id) | ||
| self._free_blocks(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle finished requests before early return
The new block that frees KV blocks for already finished requests can never run because the loop above still treats request.is_finished() as an invalid request and immediately continues. When the abort endpoint is invoked for a request that has already finished, this method skips the request entirely and no cleanup is performed, leaving the KV cache allocated. Move the finished-request logic ahead of the early return or drop the request.is_finished() check in the guard so the cleanup path can execute.
Useful? React with 👍 / 👎.
|
What component would call this endpoint? How can we authenticate these requests? Who should have permission to abort a given request? In this thread we decided to rely on (at least in the NIXL case) the P worker timing out stranded KV blocks, even in the case of the client disconnecting. /cc @njhill @NickLucche |
The endpoint should be called by the proxy component that handles the prefill and decode request forwarding. When this proxy disconnects from the user, it can actively call this endpoint on the prefill node to ensure timely release of the KV cache associated with that request. While the timeout mechanism can achieve similar functionality, it may introduce unnecessary delays. Reducing the timeout duration could potentially lead to functional issues. Therefore, I believe the timeout mechanism is better suited as a fallback solution rather than the primary method for actively releasing KV cache. Regarding authentication, the current vLLM implementation doesn't actually implement endpoint authorization. While the |
|
@jianzs there are various downsides to having an out-of-band abort endpoint. It breaks the kv connector abstraction a bit since in this situation it's the connector which "owns" the blocks and it should therefore be the one to release them. In theory, the window that we have to rely on the timeout fallback for should be very small:
So the timeout should only end up being used if the cancellation happens between the proxy receiving the prefill response and sending the decode request (in practice this should be immediate), or if the decode worker dies before it's generated the first token for the request. I think these cases should be rare enough to leave to the fallback. However @markmc found that (1.) above is actually not working as intended, so if you observe that these timeout aborts are happening more often than expected (and affecting kvcache usage), that's probably the reason. It should be fairly straightforward to fix this though, see #26400. |
Purpose
In scenarios with disaggregated prefill, if an error happens during decoding or the user manually cancels the process before the KV cache is retrieved, the KV cache stored on the prefill node can't be released. This occurs because the HTTP stream between the prefill node and the user has already closed. To solve this issue, this pull request adds an active abort interface that proactively frees KV cache for
delay_freerequests.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.