Skip to content

[Roadmap] [Draft] vLLM Roadmap Q2 2026 #39749

@simon-mo

Description

@simon-mo

In #32455, we broke down vLLM’s goal into various special interest groups (SIGs). Please find below the SIG’s area and their roadmap. You can find regular meetings of these SIGs at this public calendar.

Core

Slack Channel: #sig-core
Members: @WoosukKwon @njhill

The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path.

  • Model Runner V2 hardening and making it default:
  • expand testing coverage
  • support wide-ep out of the box.
  • Continuing to fill gaps Model Runner V2 Design Docs. Currently, SIG Core’s goal is to focus on a stable and efficient core that is principled, modular and clean. This means MRV1 will stay in Q2 to handle long tail use cases as we enable more use cases for MRV2
  • KV cache manager rethink for complex KV cache layout
  • Offloading: CPU offloading + Disk + overall connector API on this part of the path
  • Address known scheduler issues (avoid excessive preemption, prefill HoL blocking) Scheduler Items
  • Further process management hardening/simplification
  • Work out auto-tuning / out-of-box performance improvement

Large Scale Serving

Slack Channel: #sig-large-scale-serving
Project board: Large-Scale Serving
Members: @tlrmchlsmth

The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of GB200, B200, and H200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team.

  • Zero cost async EPLB
  • Experimental fault tolerant EP
  • Elastic EP (scale up/down) production ready
  • Bidirectional KV transfers
  • Numerics monitoring/debug harness

Call for experiments/prototypes:

  • Experimental AFD
  • Pipeline parallel optimizations?

Model Performance

Channel: #sig-model-performance
Members: @robertgshaw2-redhat @simon-mo

The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability.

  • Nightly performance evaluation for prioritized models on hardware cluster
    • Models: Kimi K2.5, Qwen 3.5, DeepSeek V3.2, Minimax 2.7, GLM 5.1
    • Hardware: GB200, B300, H200, (maybe) MI355
    • Workload: InferenceX, and bottom up workload (bs=1, bs=16, etc)
  • Weekly progress on performance gaps and trace sharing
  • No accuracy regression as we turn on performance enhancement by default, by nightly accuracy sweep.

Quantization

Meeting Time/Link: Every week, meeting
Channel: #sig-quantization
Members: @mgoin @dsikka

vLLM's quantization support, including native online, LLM Compressor, and external integrations like ModelOpt.

  • Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads.
  • Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc
  • Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors.
  • Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and sensitive layers such as attention projections
  • Continue improving weight reloading for RL by reducing memory usage further and supporting reload of already-quantized/swizzled weights.
  • Expand the path toward broader bitwidth kernel support for W{1-8}A{16/8/4}, including integration with humming-kernel.

Speculative Decoding

Meeting Time/Link: Every week, meeting
Channels: #sig-spec-decode, #speculators
Members: @benchislett @fynnsu @mgoin

We aim to pay off technical debt accumulated in development of V1's speculative decoding, harden production-ready speculative decoding features, optimize for large-scale and high-throughput speculation as well as extreme speculation as low-latency, and support and grow the Speculators training pipeline.

  • Supporting and scaling Speculators
  • Extensible speculator interfaces
    • First-class support for many speculation backends in ModelRunnerV2
    • Composable speculation e.g. hybrid ngram-eagle speculation
  • Hardening speculative decoding
    • Broader E2E coverage of EAGLE, DFlash, and MTP
    • Improvements, fixes, and tests for RL and large-scale serving
  • Optimizing speculation across the concurrency range
    • Full CUDA Graph support for drafting
    • Broader support of attention backends and fine-grained selection for drafting
    • Dynamic speculation based on batch size
    • Optimized attention kernels for heterogeneous speculation within a batch

Torch Compile

Channel: #sig-torch-compile
Members: @ProExpertProg @zou3519

The team focuses on improving performance, portability, and developer productivity via PyTorch compilation integration. Work includes custom compile & fusion passes, vLLM IR for kernel registration, reducing compile time via caching, improving developer UX with torch.compile, and co-development of new torch.compile features.

  • Improve torch.compile compilation times overall.
    • Targeting up to 1.3x cold compile time speedups (with PyTorch 2.12)
    • Reduce warm compile time down to <= 2s (aka up to 5x speedup) (with PyTorch 2.12)
    • Add an option to overlap weight loading and compilation (unstable feature for Q2, stable in Q3)
  • Full vLLM IR migration
  • Ship the Improved perf dashboard to track compile speedups and breakdown warm and cold start times.
  • vLLM begins using at least 1 custom helion kernel by default
  • Support for torch.compile x CUDA streams (in PyTorch 2.12)
  • Support for torch.compile x nvsymmetric memory integration (in PyTorch 2.12)
  • Unwrap wrapped custom ops (MLA, Fused MoE) - exposes more operators to Inductor and custom passes for optimization.
  • Continue enabling more optimizations by default (Inductor partition, attn+quant fusion, Async TP)
  • Roughly 1/4th of vLLM multimodal models have encoder compilation supported
  • Drive alignment with the OSS community on a backed → unbacked migration plan, and execute initial adoption by enabling X+ models to use unbacked shapes by default.
  • Inductor generates more fusions natively. This may include padding/quant/collective fusions. (PyTorch 2.12)
  • Roll out Inductor PDL where profitable, improve implementation

RL

Channel: #sig-post-training

The team focuses on delivering vLLM the best engine features for RL rollout including weight sync, kv cache reset, and ease-of-modification.

MultiModality & Omni Modality

Channel: #sig-multi-modality
Members: @ywang96 @DarkLight1337

The team supports the abstractions, model support, and optimizations of multi-modality input.

  • Enhance testing coverage on ViT cuda graph + torch compile
  • Turn encoder optimizations from MLPerf sprint on by default whenever available
  • Make API more flexible & less abstraction

On vllm-omni side,

  • Large-scale serving, support “PD” but for vLLM-Omni, Individual “Stages” can be initialized with different number of replicas
  • Support large scale users of vLLM-Omni

CI, Build, and Release

Channel: #sig-ci
Members: @khluu

The team focuses on developing world class infrastructure for vLLM’s CI system and ensuring we have a secure and reliable build and release process.

  • Time to signal -> 30 mins
  • Model eval coverage for popular models x hardware matrix
  • Automatic test target determination
  • Improving signals with nightly torch
  • More AMD test coverage

On improving release gating signals, going beyond a green build and tests

  • All CI tests (that are not soft fail)
  • Separate e2e integration tests into long-running release tests suite
  • Model eval
  • Perf benchmark for regression

—------
This roadmap covers the majority of tracked items. The vLLM team continues to review issues and pull requests and open for wider collaboration expanding model, hardware, and optimization coverage. Please feel free to leave any feedback and comments, and directly work with SIG areas for deeper collaborations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    rocmRelated to AMD ROCm

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions