Skip to content

[Roadmap] vLLM Roadmap Q2 2025 #15735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
66 tasks
simon-mo opened this issue Mar 29, 2025 · 8 comments
Open
66 tasks

[Roadmap] vLLM Roadmap Q2 2025 #15735

simon-mo opened this issue Mar 29, 2025 · 8 comments

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Mar 29, 2025

This page is accessible via roadmap.vllm.ai

This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack


Core Themes

Path to vLLM v1.0.0
We want to fully remove the V0 engine and clean up the codebase for unpopular and unsupported features. The v1.0.0 version of vLLM will be performant and easy to maintain, as well as modular and extensible, with backward compatibility.

  • V1 core feature set
    • Hybrid memory allocators
    • Jump decoding
    • Redesigned native support for pipeline parallelism
    • Redesigned spec decode
    • Redesigned sampler with modularity support
  • Close the feature gaps and fully remove V0
    • Attention backends
    • Pooling models
    • Mamba/Hybrid models
    • (TBD) encoder and encoder decoder
    • Hardware support
  • Performance
    • Further lower scheduler overhead
    • Further enhance LoRA performance
    • API Server Scale-out

Cluster Scale Serving
As the model expands in size, serving them in multi-node scale-out and disaggregating prefill and decode becomes the way to go. We are fully committed to making vLLM the best engine for cluster scale serving.

  • Data Parallelism
    • Single node DP
    • API Server and Engine decoupling (any to any communication)
  • Expert Parallelism
    • DeepEP or other library integrations
    • Transition from fused_moe to cutlass based grouped gemm.
  • Online Reconfiguration (e.g. EPLB)
    • Online reconfiguration
    • Zero-overhead expert movement
  • Prefill Decode Disaggregation
    • 1P1D in V1: both symmetric TP/PP and asymmetric TP/PP
    • XPYD
    • Data Parallel Compatibility
    • NIXL integration
    • Overhead Reduction & Performance Enhancements
  • KV Cache Storage
    • Offload KV cache to CPU
    • Offload KV cache to disk
    • Integration with Mooncake and LMCache
  • DeepSeek Specific Enhancements
    • MLA enhancements: TP, FlashAttention, FlashInfer, Blackwell Kernels.
    • MTP enhancements: V1 support, further lower overhead.
  • Others
    • Investigate communication and compute pipelining

vLLM for Production
vLLM is designed for production. We will continue to enhance stability and tune the systems around vLLM for optimal performance.

  • Testing:
    • Comprehensive performance suite
    • Enhance accuracy testing coverage
    • Large-scale deployment + testing
    • Stress and longevity testing
  • Offer tuned recipes and analysis for different models and hardware combinations.
  • Multi-platform wheels and containers for production use cases.

Features

Models

  • Scaling Omni Modality
  • Long Context
  • Stable OOT model registration interface
  • Attention Sparsity: support the sparse mechanism for new models.

Use Case

  • Enhance testing and performance related to RLHF workflow
  • Add data parallel routing for large-scale batch Inference
  • Investigate batch size invariance and tran/inference equivalence.

Hardware

  • Stable Plugin Architecture for hardware platforms
  • Blackwell Enhancements
  • Full Production readiness for AMD, TPU, Neuron.

Optimizations

  • EAGLE3
  • FP4 enhancements
  • FlexAttention
  • Investigate: fbgemm, torchao, cuTile

Community

  • Blogs
  • Case Studies
  • Website
  • Onboarding tasks and new contributors training program

vLLM Ecosystem


If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #11862, #9006, #5805, #3861, #2681, #244

@simon-mo simon-mo added performance Performance-related issues and removed performance Performance-related issues labels Mar 29, 2025
@simon-mo simon-mo pinned this issue Mar 29, 2025
@wangxiyuan
Copy link
Contributor

great! Thanks for the work.

And here is the Q2 roadmap of vllm-ascend vllm-project/vllm-ascend#448 following up. Could you please add the link to Hardware or Ecosystem section? Thanks!

@manzke
Copy link

manzke commented Apr 10, 2025

For the v1 you should consider also the security side of it. I guess a lot of people are using vllm via the docker images, which partially are using 20.04, some 22.04.
The 0.8.2 (and I haven't seen changes for 0.8.3) has nearly 50 CVEs marked as HIGH as well as more than 2500 MEDIUM.

@hackey
Copy link

hackey commented Apr 12, 2025

Hi! When switching to a new engine, I am very interested in how things will be with AMD ROCM support and in particular Navi 3 (rdna 3). I have been waiting for a bug fix for the codestral-mamba model for almost two months. And the model itself was released a long time ago in 2024. But it seems that no one is fixing the bug that was introduced.

#13678 (comment)

@MrVolts
Copy link

MrVolts commented Apr 13, 2025

It would be great to see fp8 support for sm120 (Blackwell devices) now that Cutlass has added support for sm120, sm120a as of V3.9. This would mean that Blackwell users can best take advantage of native int4 and int8 support for extra speed. Currently there is only support for sm100 and prior.

@skylee-01
Copy link
Contributor

Does "Redesigned spec decode" mean redesigning the implementation of v0? What are the shortcomings of v0's implementation?

@skylee-01
Copy link
Contributor

【Further reduce scheduler overhead】, we tested v1 and found that the effect was quite good. Where else can optimizations be made to the scheduler?

@skylee-01
Copy link
Contributor

【API Server Scale-out】 I don't understand, can you further explain it?

@ANormalMan12
Copy link

It is quite cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants