-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Description
🚀 The feature, motivation and pitch
RFC: Clarifying vLLM Shutdown Semantics
Introduction
This RFC defines the expected behavior of vLLM during shutdown, addressing two distinct use cases:
- Library API usage (primarily offline inference).
- Online serving usage (long-running HTTP server deployments).
Documenting these expectations will reduce surprises for users and help us achieve a consistent implementation across all features. This RFC was motivated by #22295 and #22828.
Library API Shutdown
Library users expect deterministic, resource-safe cleanup when disposing of LLM instances.
Expectations
- Library users may create and destroy
vllm.LLMinstances. - Deleting an instance should:
- Release all associated resources (memory, file descriptors, temporary files, etc.).
- Shut down any child processes gracefully.
- No library should define global signal handlers.
- Signal handlers are global and may interfere with applications embedding vLLM.
- Applications must configure their own signal-handling behavior.
Online Serving Shutdown
Online serving users expect graceful termination semantics, consistent with production environments like Kubernetes, where shutdown is routine due to auto-scaling or rolling upgrades.
Note - vLLM does not support any partial restart scenario, for example a "reload model" or "restart workers" capability.
Context
vllm servelaunches an HTTP server for long-running online inference.- Each request is relatively expensive (100ms+).
- Deployments typically involve multiple replicas behind a load balancer (e.g. Kubernetes Service).
Expectations
-
Signal Handling
- Parent process receives a
SIGTERMon shutdown. SIGINT(e.g. Ctrl-C) is treated the same asSIGTERM.- Signals are sent only to the parent; child processes must not act on signals directly.
- Parent process receives a
-
Graceful Quiescence
- Stop accepting new HTTP requests immediately.
- Allow in-flight requests to complete, bounded by a configurable max wait time.
- Afterward, instruct child processes to exit.
-
Child Processes
- Child processes must not make independent shutdown decisions affecting in-flight requests; the parent process determines whether requests are aborted or completed
- Child processes may delay shutdown for other reasons (e.g. KV transfer).
- Each process is responsible for reaping its own children via
waitpid().
-
Resource Cleanup
- Close network sockets, release locks, delete temporary files, etc.
Kubernetes Integration
Kubernetes shutdown semantics should align naturally with vLLM’s shutdown design.
- When a Pod enters Terminating:
- Kubernetes removes its endpoints from the Service LB → no new traffic.
- The container runtime sends
SIGTERMto the container’s parent process. - After
terminationGracePeriodSeconds(default 30s),SIGKILLis sent if still running. - Optionally, a
preStophook can be used to delaySIGTERM(e.g. sleep 5s) to allow straggling requests before quiescence.
KV Transfer Considerations
In disaggregated deployments (e.g. NIXL prefill/decode):
- Prefill workers may need to delay shutdown until KV transfer is complete.
- The
KVConnector.shutdown()method should allow the connector to defer worker process (and parents) shutdown until all pending KV transfers complete or time out.
Open Questions
- Online serving API: consider how shutdown guarantees are reflected in the library API for online serving.
- Ray deployments: do Ray users have specific shutdown expectations?
- Observability: should the "I'm shutting down" status of the API server be observable, e.g. via the /health API?
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.