[Feature] Add abort request endpoint to handle request cancellations #26635

jianzs · 2025-10-11T11:59:50Z

Purpose

In scenarios with disaggregated prefill, if an error happens during decoding or the user manually cancels the process before the KV cache is retrieved, the KV cache stored on the prefill node can't be released. This occurs because the HTTP stream between the prefill node and the user has already closed. To solve this issue, this pull request adds an active abort interface that proactively frees KV cache for delay_free requests.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

gemini-code-assist

Code Review

This pull request introduces a new /abort_request endpoint to allow for the cancellation of requests and the release of their associated resources, particularly for delay_free requests in disaggregated prefill scenarios. The changes are well-implemented across the API server, protocol, scheduler, and output processor. However, I've identified a critical security vulnerability: the new endpoint is not authenticated. My review includes a comment with a suggested fix for this issue.

gemini-code-assist · 2025-10-11T12:00:47Z

vllm/entrypoints/openai/api_server.py

    return JSONResponse(content=response.model_dump())


+@router.post("/abort_request", dependencies=[Depends(validate_json_request)])


The new /abort_request endpoint is not protected by authentication. The AuthenticationMiddleware in this file only protects routes that start with /v1. This could allow unauthorized users to abort arbitrary requests if they can guess or obtain a request_id.

To fix this, the endpoint should be moved under the /v1 path.

Suggested change

@router.post("/abort_request", dependencies=[Depends(validate_json_request)])

@router.post("/v1/abort_request", dependencies=[Depends(validate_json_request)])

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-11T12:02:00Z

vllm/v1/core/sched/scheduler.py

                # Invalid request ID.
                continue

+            if request.is_finished():
+                # If the request is already finished, only FINISHED_ABORTED is
+                # allowed, which is used to force resource cleanup.
+                assert finished_status == RequestStatus.FINISHED_ABORTED, (
+                    "Only FINISHED_ABORTED is allowed for requests that are "
+                    "already finished."
+                )
+                logger.info("Aborting request %s, freeing blocks.", req_id)
+                self._free_blocks(request)


Handle finished requests before early return

The new block that frees KV blocks for already finished requests can never run because the loop above still treats request.is_finished() as an invalid request and immediately continues. When the abort endpoint is invoked for a request that has already finished, this method skips the request entirely and no cleanup is performed, leaving the KV cache allocated. Move the finished-request logic ahead of the early return or drop the request.is_finished() check in the guard so the cleanup path can execute.

Useful? React with 👍 / 👎.

markmc · 2025-10-13T14:21:00Z

What component would call this endpoint? How can we authenticate these requests? Who should have permission to abort a given request?

In this thread we decided to rely on (at least in the NIXL case) the P worker timing out stranded KV blocks, even in the case of the client disconnecting.

/cc @njhill @NickLucche

jianzs · 2025-10-13T14:36:02Z

What component would call this endpoint? How can we authenticate these requests? Who should have permission to abort a given request?

In this thread we decided to rely on (at least in the NIXL case) the P worker timing out stranded KV blocks, even in the case of the client disconnecting.

/cc @njhill @NickLucche

The endpoint should be called by the proxy component that handles the prefill and decode request forwarding. When this proxy disconnects from the user, it can actively call this endpoint on the prefill node to ensure timely release of the KV cache associated with that request. While the timeout mechanism can achieve similar functionality, it may introduce unnecessary delays. Reducing the timeout duration could potentially lead to functional issues. Therefore, I believe the timeout mechanism is better suited as a fallback solution rather than the primary method for actively releasing KV cache.

Regarding authentication, the current vLLM implementation doesn't actually implement endpoint authorization. While the /v1 series of endpoints can be restricted using --api-key or the VLLM_API_KEY environment variable, this is rarely configured in practice. If authentication is a concern, the abort_request endpoint could also be secured using the same configuration.

njhill · 2025-10-13T17:45:15Z

@jianzs there are various downsides to having an out-of-band abort endpoint. It breaks the kv connector abstraction a bit since in this situation it's the connector which "owns" the blocks and it should therefore be the one to release them.

In theory, the window that we have to rely on the timeout fallback for should be very small:

If the proxy cancels the prefill request before it returns, then it should not reserve blocks.
If the proxy cancels the in-flight decode request it should still notify the prefill worker to release the blocks, even if the request is in the waiting queue.

So the timeout should only end up being used if the cancellation happens between the proxy receiving the prefill response and sending the decode request (in practice this should be immediate), or if the decode worker dies before it's generated the first token for the request. I think these cases should be rare enough to leave to the fallback.

However @markmc found that (1.) above is actually not working as intended, so if you observe that these timeout aborts are happening more often than expected (and affecting kvcache usage), that's probably the reason. It should be fairly straightforward to fix this though, see #26400.

[Feature] Add abort request endpoint to handle request cancellations

73ebae4

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

jianzs requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, chaunceyjiang, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 11, 2025 11:59

mergify bot added frontend v1 labels Oct 11, 2025

gemini-code-assist bot reviewed Oct 11, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 11, 2025

View reviewed changes

markmc added the kv-connector label Oct 13, 2025

NickLucche mentioned this pull request Dec 3, 2025

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] Add abort request endpoint to handle request cancellations #26635

[Feature] Add abort request endpoint to handle request cancellations #26635

jianzs commented Oct 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 11, 2025

Uh oh!

markmc commented Oct 13, 2025

Uh oh!

jianzs commented Oct 13, 2025

Uh oh!

njhill commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return JSONResponse(content=response.model_dump())


		@router.post("/abort_request", dependencies=[Depends(validate_json_request)])

	@router.post("/abort_request", dependencies=[Depends(validate_json_request)])
	@router.post("/v1/abort_request", dependencies=[Depends(validate_json_request)])

Uh oh!

[Feature] Add abort request endpoint to handle request cancellations #26635

Are you sure you want to change the base?

[Feature] Add abort request endpoint to handle request cancellations #26635

Conversation

jianzs commented Oct 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

markmc commented Oct 13, 2025

Uh oh!

jianzs commented Oct 13, 2025

Uh oh!

njhill commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jianzs commented Oct 11, 2025 •

edited by github-actions bot

Loading