Skip to content

Conversation

@jianzs
Copy link
Contributor

@jianzs jianzs commented Oct 11, 2025

Purpose

In scenarios with disaggregated prefill, if an error happens during decoding or the user manually cancels the process before the KV cache is retrieved, the KV cache stored on the prefill node can't be released. This occurs because the HTTP stream between the prefill node and the user has already closed. To solve this issue, this pull request adds an active abort interface that proactively frees KV cache for delay_free requests.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new /abort_request endpoint to allow for the cancellation of requests and the release of their associated resources, particularly for delay_free requests in disaggregated prefill scenarios. The changes are well-implemented across the API server, protocol, scheduler, and output processor. However, I've identified a critical security vulnerability: the new endpoint is not authenticated. My review includes a comment with a suggested fix for this issue.

return JSONResponse(content=response.model_dump())


@router.post("/abort_request", dependencies=[Depends(validate_json_request)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new /abort_request endpoint is not protected by authentication. The AuthenticationMiddleware in this file only protects routes that start with /v1. This could allow unauthorized users to abort arbitrary requests if they can guess or obtain a request_id.

To fix this, the endpoint should be moved under the /v1 path.

Suggested change
@router.post("/abort_request", dependencies=[Depends(validate_json_request)])
@router.post("/v1/abort_request", dependencies=[Depends(validate_json_request)])

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines 1193 to +1204
# Invalid request ID.
continue

if request.is_finished():
# If the request is already finished, only FINISHED_ABORTED is
# allowed, which is used to force resource cleanup.
assert finished_status == RequestStatus.FINISHED_ABORTED, (
"Only FINISHED_ABORTED is allowed for requests that are "
"already finished."
)
logger.info("Aborting request %s, freeing blocks.", req_id)
self._free_blocks(request)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle finished requests before early return

The new block that frees KV blocks for already finished requests can never run because the loop above still treats request.is_finished() as an invalid request and immediately continues. When the abort endpoint is invoked for a request that has already finished, this method skips the request entirely and no cleanup is performed, leaving the KV cache allocated. Move the finished-request logic ahead of the early return or drop the request.is_finished() check in the guard so the cleanup path can execute.

Useful? React with 👍 / 👎.

@markmc
Copy link
Member

markmc commented Oct 13, 2025

What component would call this endpoint? How can we authenticate these requests? Who should have permission to abort a given request?

In this thread we decided to rely on (at least in the NIXL case) the P worker timing out stranded KV blocks, even in the case of the client disconnecting.

/cc @njhill @NickLucche

@jianzs
Copy link
Contributor Author

jianzs commented Oct 13, 2025

What component would call this endpoint? How can we authenticate these requests? Who should have permission to abort a given request?

In this thread we decided to rely on (at least in the NIXL case) the P worker timing out stranded KV blocks, even in the case of the client disconnecting.

/cc @njhill @NickLucche

The endpoint should be called by the proxy component that handles the prefill and decode request forwarding. When this proxy disconnects from the user, it can actively call this endpoint on the prefill node to ensure timely release of the KV cache associated with that request. While the timeout mechanism can achieve similar functionality, it may introduce unnecessary delays. Reducing the timeout duration could potentially lead to functional issues. Therefore, I believe the timeout mechanism is better suited as a fallback solution rather than the primary method for actively releasing KV cache.

Regarding authentication, the current vLLM implementation doesn't actually implement endpoint authorization. While the /v1 series of endpoints can be restricted using --api-key or the VLLM_API_KEY environment variable, this is rarely configured in practice. If authentication is a concern, the abort_request endpoint could also be secured using the same configuration.

@njhill
Copy link
Member

njhill commented Oct 13, 2025

@jianzs there are various downsides to having an out-of-band abort endpoint. It breaks the kv connector abstraction a bit since in this situation it's the connector which "owns" the blocks and it should therefore be the one to release them.

In theory, the window that we have to rely on the timeout fallback for should be very small:

  1. If the proxy cancels the prefill request before it returns, then it should not reserve blocks.
  2. If the proxy cancels the in-flight decode request it should still notify the prefill worker to release the blocks, even if the request is in the waiting queue.

So the timeout should only end up being used if the cancellation happens between the proxy receiving the prefill response and sending the decode request (in practice this should be immediate), or if the decode worker dies before it's generated the first token for the request. I think these cases should be rare enough to leave to the fallback.

However @markmc found that (1.) above is actually not working as intended, so if you observe that these timeout aborts are happening more often than expected (and affecting kvcache usage), that's probably the reason. It should be fairly straightforward to fix this though, see #26400.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants