Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open
37 tasks
simon-mo opened this issue Oct 1, 2024 · 4 comments
Open
37 tasks

[Roadmap] vLLM Roadmap Q4 2024 #9006

simon-mo opened this issue Oct 1, 2024 · 4 comments

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Oct 1, 2024

This page is accessible via roadmap.vllm.ai

Themes.

As before, we categorized our roadmap into 6 broad themes: broad model support, wide hardware coverage, state of the art performance optimization, production level engine, strong OSS community, and extensible architectures. As we are seeing more

Broad Model Support

Help wanted:

Hardware Support

  • A feature matrix for all the hardware that vLLM supports, and their maturity level
  • Expanding features support on various hardwares
    • Fast PagedAttention and Chunked Prefill on Inferentia
    • Upstream of Intel Gaudi
    • Enhancements in TPU Support
    • Upstream enhancements in AMD MI300x
    • Performance enhancement and measurement for NVIDIA H200
    • New accelerator support: IBM Spyre

Help wanted:

  • Design for pluggable, out-of-tree hardware backend similar to PyTorch’s PrivateUse API
  • Prototype JAX support

Performance Optimizations

  • Turn on chunked prefill, prefix caching, speculative decoding by default
  • Optimizations for structured outputs
  • Fused GEMM/all-reduce leveraging Flux and AsyncTP
  • Enhancement and overhead-removal in offline LLM use cases.
  • Better kernels (FA3, FlashInfer, FlexAttention, Triton)
  • Native integration with torch.compile

Help wanted:

Production Features

  • KV cache offload to CPU and disk
  • Disaggregated Prefill
  • More control in prefix caching, and scheduler policies
  • Automated speculative decoding policy, see Dynamic Speculative Decoding

Help wanted

  • Support multiple models in the same server

OSS Community

  • Enhancements in performance benchmark: more realistic workload, more hardware backends (H200s)
  • Better developer documentations for getting started with contribution and research

Help wanted

  • Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

Extensible Architecture

  • Full support for torch.compile
  • vLLM Engine V2: Asynchronous Scheduling and Prefix Caching Centric Design (vLLM's V2 Engine Architecture #8779)
  • A generic memory manager supporting multi-modality, sparsity, and others

If any of the items you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #5805, #3861, #2681, #244

@simon-mo simon-mo changed the title [Roadmap]: vLLM Roadmap Q4 2024 [Roadmap] vLLM Roadmap Q4 2024 Oct 1, 2024
@simon-mo simon-mo pinned this issue Oct 1, 2024
@IsaacRe
Copy link

IsaacRe commented Oct 2, 2024

Support for KV cache compression

@ksjadeja
Copy link

ksjadeja commented Oct 4, 2024

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

@sylviayangyy
Copy link

sylviayangyy commented Oct 12, 2024

Hi, do we have any follow-up issue or Slack channel for the "KV cache offload to CPU and disk" task? Our team has previously explored some "KV cache offload" work based on vLLM, and we’d be happy to join any relevant discussion or contribute to the development if there's such chance~

Personally, also looking forward to know more about "More control in prefix caching, and scheduler policies" part😊.

@zeroorhero
Copy link

@simon-mo hi,regarding the topic “KV cache offload to CPU and disk”, I previously implemented a version that stores kv cache in a local file(#8018). Of course, I also did relevant abstractions and can add other media. Is there a slack channel for this? We can discuss the specific scheme. I am also quite interested in this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants