[Roadmap] vLLM Roadmap Q3 2024 #5805

simon-mo · 2024-06-25T00:08:09Z

Anything you want to discuss about vllm.

This document includes the features in vLLM's roadmap for Q3 2024. Please feel free to discuss and contribute, as this roadmap is shaped by the vLLM community.

Themes.

As before, we categorized our roadmap into 6 broad themes:

Broad model support: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs.
Excellent hardware coverage: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip.
Performance optimization:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
Production level engine: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service.
Strong OSS product: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
Extensible architectures: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.

Broad Model Support

Support Large Models (Arctic, Nemotron4, Llama3 400B+ when released)
- Via Pipeline Parallelism [Core] [2/N] Pipeline Parallel Support #4412
- Via FP8
New Attention Mechanism (Jamba, Phi3-Small, etc)
Encoder Decoder ([Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837, [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888, [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942)
Multi-Modal [RFC]: Multi-modality Support Refactoring #4194

Help wanted:

Whisper and the audio API
Arbitrary HF model
Chameleon ([Model] Initial Support for Chameleon #5770)
Multi token prediction
Reward model API
Embedding Model Expansion (Bert, XLMRoberta) ([Model] Bert Embedding Model #5447)

Hardware Support

A feature matrix for all the hardware that vLLM supports, and their maturity level
Enhanced performance benchmark across hardwares
Expanding features support on various hardwares
- PagedAttention and Chunked Prefill on Inferentia
- Chunked Prefill on Intel CPU/GPU
- PagedAttention on Intel Gaudi
- TP and INT8 on TPU
- Bug fixes and GEMM tuning on AMD GPUs

Performance Optimizations

Production Features

Help wanted

Support multiple models in the same server
[Feedback wanted] Disaggregated prefill: please discuss with us your use case and in what scenario it is preferred over chunked prefill.

OSS Community

Reproducible performance benchmark on realistic workload
CI enhancements
Release process: minimize breaking changes and include deprecations

Help wanted

Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

Extensible Architecture

KV cache transfer [RFC]: Implement disaggregated prefilling via KV cache transfer #5557
Distributed execution [RFC]: A Flexible Architecture for Distributed Inference #5775
Improvements to scheduler and memory manager supporting new attention mechanisms
Performance enhancement for multi-modal processing

If any of the item you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Jeffwan · 2024-06-25T01:07:41Z

Support multiple models in the same server

Does vLLM need the multi-model support similar like what FastChat does or something else?

CSEEduanyu · 2024-06-25T02:11:15Z

#2809 hello,how about this？

jeejeelee · 2024-06-26T15:09:54Z

Hi, the issues were mentioned in #5036 and should be taken into account.

MeJerry215 · 2024-06-27T06:49:32Z

Will vLLM use Triton more to optimize operators' performance in future, or will it consider using the torch.compile mechanism more?

And are there any plans for this?

ashim-mahara · 2024-06-27T19:45:36Z

Hi! Is there or will there be support for the OpenAI Batch API ?

huseinzol05 · 2024-06-28T11:22:27Z

I am doing for Whisper, my fork at https://github.com/mesolitica/vllm-whisper, the frontend later should compatible with OpenAI API plus able to stream output tokens, few hiccups, still trying to figure out based on T5 branch,

vllm/vllm/model_executor/layers/enc_dec_attention.py

Line 83 in 9f20ccf

out = xops.memory_efficient_attention_forward(

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal
Reuse KV Cache Cross Attention from the first step for the next steps.

huseinzol05 · 2024-06-28T14:44:30Z

Able to load and infer, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, but the output is still trash, might be bugs related to weights or the attention, still debugging

simon-mo added misc and removed misc labels Jun 25, 2024

simon-mo pinned this issue Jun 25, 2024

simon-mo mentioned this issue Jun 25, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] vLLM Roadmap Q3 2024 #5805

[Roadmap] vLLM Roadmap Q3 2024 #5805

simon-mo commented Jun 25, 2024 •

edited by ywang96

Loading

Jeffwan commented Jun 25, 2024

CSEEduanyu commented Jun 25, 2024

jeejeelee commented Jun 26, 2024

MeJerry215 commented Jun 27, 2024 •

edited

Loading

ashim-mahara commented Jun 27, 2024

huseinzol05 commented Jun 28, 2024 •

edited

Loading

huseinzol05 commented Jun 28, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

[Roadmap] vLLM Roadmap Q3 2024 #5805

Comments

simon-mo commented Jun 25, 2024 • edited by ywang96 Loading

Anything you want to discuss about vllm.

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

Jeffwan commented Jun 25, 2024

CSEEduanyu commented Jun 25, 2024

jeejeelee commented Jun 26, 2024

MeJerry215 commented Jun 27, 2024 • edited Loading

ashim-mahara commented Jun 27, 2024

huseinzol05 commented Jun 28, 2024 • edited Loading

huseinzol05 commented Jun 28, 2024

simon-mo commented Jun 25, 2024 •

edited by ywang96

Loading

MeJerry215 commented Jun 27, 2024 •

edited

Loading

huseinzol05 commented Jun 28, 2024 •

edited

Loading