You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #32455, we broke down vLLM’s goal into various special interest groups (SIGs). Please find below the SIG’s area and their roadmap. You can find regular meetings of these SIGs at this public calendar.
The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path.
Model Runner V2 hardening and making it default:
expand testing coverage
support wide-ep out of the box.
Continuing to fill gaps Model Runner V2 Design Docs. Currently, SIG Core’s goal is to focus on a stable and efficient core that is principled, modular and clean. This means MRV1 will stay in Q2 to handle long tail use cases as we enable more use cases for MRV2
KV cache manager rethink for complex KV cache layout
Offloading: CPU offloading + Disk + overall connector API on this part of the path
Address known scheduler issues (avoid excessive preemption, prefill HoL blocking) Scheduler Items
Further process management hardening/simplification
Work out auto-tuning / out-of-box performance improvement
The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of GB200, B200, and H200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team.
The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability.
Nightly performance evaluation for prioritized models on hardware cluster
vLLM's quantization support, including native online, LLM Compressor, and external integrations like ModelOpt.
Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads.
Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc
Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors.
Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and sensitive layers such as attention projections
Continue improving weight reloading for RL by reducing memory usage further and supporting reload of already-quantized/swizzled weights.
Expand the path toward broader bitwidth kernel support for W{1-8}A{16/8/4}, including integration with humming-kernel.
We aim to pay off technical debt accumulated in development of V1's speculative decoding, harden production-ready speculative decoding features, optimize for large-scale and high-throughput speculation as well as extreme speculation as low-latency, and support and grow the Speculators training pipeline.
The team focuses on improving performance, portability, and developer productivity via PyTorch compilation integration. Work includes custom compile & fusion passes, vLLM IR for kernel registration, reducing compile time via caching, improving developer UX with torch.compile, and co-development of new torch.compile features.
Improve torch.compile compilation times overall.
Targeting up to 1.3x cold compile time speedups (with PyTorch 2.12)
Reduce warm compile time down to <= 2s (aka up to 5x speedup) (with PyTorch 2.12)
Add an option to overlap weight loading and compilation (unstable feature for Q2, stable in Q3)
Full vLLM IR migration
Ship the Improved perf dashboard to track compile speedups and breakdown warm and cold start times.
vLLM begins using at least 1 custom helion kernel by default
Support for torch.compile x CUDA streams (in PyTorch 2.12)
Support for torch.compile x nvsymmetric memory integration (in PyTorch 2.12)
Unwrap wrapped custom ops (MLA, Fused MoE) - exposes more operators to Inductor and custom passes for optimization.
Continue enabling more optimizations by default (Inductor partition, attn+quant fusion, Async TP)
Roughly 1/4th of vLLM multimodal models have encoder compilation supported
Drive alignment with the OSS community on a backed → unbacked migration plan, and execute initial adoption by enabling X+ models to use unbacked shapes by default.
Inductor generates more fusions natively. This may include padding/quant/collective fusions. (PyTorch 2.12)
Roll out Inductor PDL where profitable, improve implementation
RL
Channel: #sig-post-training
The team focuses on delivering vLLM the best engine features for RL rollout including weight sync, kv cache reset, and ease-of-modification.
The team focuses on developing world class infrastructure for vLLM’s CI system and ensuring we have a secure and reliable build and release process.
Time to signal -> 30 mins
Model eval coverage for popular models x hardware matrix
Automatic test target determination
Improving signals with nightly torch
More AMD test coverage
On improving release gating signals, going beyond a green build and tests
All CI tests (that are not soft fail)
Separate e2e integration tests into long-running release tests suite
Model eval
Perf benchmark for regression
—------
This roadmap covers the majority of tracked items. The vLLM team continues to review issues and pull requests and open for wider collaboration expanding model, hardware, and optimization coverage. Please feel free to leave any feedback and comments, and directly work with SIG areas for deeper collaborations.
In #32455, we broke down vLLM’s goal into various special interest groups (SIGs). Please find below the SIG’s area and their roadmap. You can find regular meetings of these SIGs at this public calendar.
Core
Slack Channel: #sig-core
Members: @WoosukKwon @njhill
The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path.
Large Scale Serving
Slack Channel: #sig-large-scale-serving
Project board: Large-Scale Serving
Members: @tlrmchlsmth
The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of GB200, B200, and H200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team.
Call for experiments/prototypes:
Model Performance
Channel: #sig-model-performance
Members: @robertgshaw2-redhat @simon-mo
The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability.
Quantization
Meeting Time/Link: Every week, meeting
Channel: #sig-quantization
Members: @mgoin @dsikka
vLLM's quantization support, including native online, LLM Compressor, and external integrations like ModelOpt.
Speculative Decoding
Meeting Time/Link: Every week, meeting
Channels: #sig-spec-decode, #speculators
Members: @benchislett @fynnsu @mgoin
We aim to pay off technical debt accumulated in development of V1's speculative decoding, harden production-ready speculative decoding features, optimize for large-scale and high-throughput speculation as well as extreme speculation as low-latency, and support and grow the Speculators training pipeline.
Torch Compile
Channel: #sig-torch-compile
Members: @ProExpertProg @zou3519
The team focuses on improving performance, portability, and developer productivity via PyTorch compilation integration. Work includes custom compile & fusion passes, vLLM IR for kernel registration, reducing compile time via caching, improving developer UX with torch.compile, and co-development of new torch.compile features.
RL
Channel: #sig-post-training
The team focuses on delivering vLLM the best engine features for RL rollout including weight sync, kv cache reset, and ease-of-modification.
MultiModality & Omni Modality
Channel: #sig-multi-modality
Members: @ywang96 @DarkLight1337
The team supports the abstractions, model support, and optimizations of multi-modality input.
On vllm-omni side,
CI, Build, and Release
Channel: #sig-ci
Members: @khluu
The team focuses on developing world class infrastructure for vLLM’s CI system and ensuring we have a secure and reliable build and release process.
On improving release gating signals, going beyond a green build and tests
—------
This roadmap covers the majority of tracked items. The vLLM team continues to review issues and pull requests and open for wider collaboration expanding model, hardware, and optimization coverage. Please feel free to leave any feedback and comments, and directly work with SIG areas for deeper collaborations.