-
Notifications
You must be signed in to change notification settings - Fork 47
vllm-omni post init and roadmap updates #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+100
−0
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
25adc42
vllm-omni post init and roadmap updates
hsliuustc0106 e178c1b
update figures
hsliuustc0106 190aef7
update logo
hsliuustc0106 4f1c747
update logo
hsliuustc0106 16b3625
update logo
hsliuustc0106 11bb5a8
update logo
hsliuustc0106 d71d4ce
update pngs
hsliuustc0106 06c9ffb
Apply suggestion from @ywang96
hsliuustc0106 23d74a2
Apply suggestion from @ywang96
hsliuustc0106 a015fb7
Apply suggestion from @ywang96
hsliuustc0106 d8f1195
Apply suggestion from @ywang96
hsliuustc0106 e0ddbaf
Apply suggestion from @ywang96
hsliuustc0106 cf0aaa9
Apply suggestion from @ywang96
hsliuustc0106 0241604
Apply suggestion from @ywang96
hsliuustc0106 874da7f
Apply suggestion from @ywang96
hsliuustc0106 9865478
update omni-modality model arch & fix meeting time to PDT & fix typos
hsliuustc0106 5e32cb8
fix pdt time
hsliuustc0106 390c498
update
ywang96 642129f
cleanup
ywang96 b40de50
update
ywang96 33844ff
refine user-interface
hsliuustc0106 c62099f
update async stage png
hsliuustc0106 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| --- | ||
| layout: post | ||
| title: "Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving" | ||
| author: "vLLM-Omni Team" | ||
| image: /assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png | ||
| --- | ||
|
|
||
| We are excited to announce the official release of [**vLLM-Omni**](https://github.com/vllm-project/vllm-omni), a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models. | ||
|
|
||
| <p align="center"> | ||
| <img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-logo-text-dark.png" alt="vllm-omni logo" width="80%"> | ||
| </p> | ||
|
|
||
|
|
||
| Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures. | ||
|
|
||
| **vLLM-Omni** is one of the first open source frameworks to support omni-modality model serving that extends vLLM’s exceptional performance to the world of multi-modal and non-autoregressive inference. | ||
|
|
||
| <p align="center"> | ||
| <img src="/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png" alt="omni-modality model architecture" width="80%"> | ||
| </p> | ||
|
|
||
| ## **Why vLLM-Omni?** | ||
|
|
||
| Traditional serving engines were optimized for text-based Autoregressive (AR) tasks. As models evolve into "omni" agents—capable of seeing, hearing, and speaking—the serving infrastructure must evolve with them. | ||
|
|
||
| vLLM-Omni addresses three critical shifts in model architecture: | ||
|
|
||
| 1. **True Omni-Modality:** Processing and generating Text, Image, Video, and Audio seamlessly. | ||
| 2. **Beyond Autoregression:** Extending vLLM's efficient memory management to **Diffusion Transformers (DiT)** and other parallel generation models. | ||
| 3. **Heterogeneous Model Pipeline:** Orchestrating complex model workflows where a single request can invoke multiple heterogeneous model components. (e.g, multimodal encoding, AR reasoning, diffusion-based multimodal generation, etc). | ||
|
|
||
| ## **Inside the Architecture** | ||
|
|
||
| vLLM-Omni is more than a wrapper; it is a re-imagining of data flow within and beyond vLLM. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation. As shown above, the architecture unifies distinct phases: | ||
|
|
||
| * **Modality Encoders:** Efficiently encoding multimodal inputs (ViT, Whisper, etc.) | ||
| * **LLM Core:** Leveraging vLLM for autoregressive text & hidden states generation with one or more language models. | ||
| * **Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs. | ||
|
|
||
| ### **Key Features** | ||
|
|
||
| <p align="center"> | ||
| <img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png" alt="vllm-omni user interface" width="80%"> | ||
| </p> | ||
|
|
||
| * **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server. | ||
|
|
||
| * **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various omni-modality models including Qwen-Omni, Qwen-Image, and other state-of-the-art models. | ||
|
|
||
| * **Performance:** We utilize pipelined stage execution to overlap computation for high throughput performance, ensuring that while one stage is processing, others aren't idle. | ||
|
|
||
| <p align="center"> | ||
| <img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-pipeline-async-stage.png" alt="vllm-omni pipelined stage execution" width="80%"> | ||
| </p> | ||
|
|
||
| We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving. | ||
|
|
||
| <p align="center"> | ||
| <img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-vs-hf.png" alt="vLLM-Omni against Hugging Face Transformers" width="80%"> | ||
| </p> | ||
|
|
||
|
|
||
| ## **Future Roadmap** | ||
|
|
||
| vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support and pushing the boundaries of efficient inference even further as well as building the right framework to empower future research on omni-modality models. | ||
|
|
||
| * **Expanded Model Support:** We plan to support a wider range of open-source omni-models and diffusion transformers as they emerge. | ||
| * **Adaptive Framework Refinement**: We will continue to evolve and improve the framework to support emerging omni-modality models and execution patterns, ensuring that it remains a reliable foundation for both production workloads and cutting-edge research. | ||
| * **Deeper vLLM Integration:** merging core omni-features upstream to make multi-modality a first-class citizen in the entire vLLM ecosystem. | ||
| * **Diffusion Acceleration:** parallel inference(DP/TP/SP/USP...), cache acceleration(TeaCache/DBCache...) and compute acceleration(quantization/sparse attention...). | ||
| * **Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency. | ||
| * **Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere. | ||
|
|
||
|
|
||
| ## **Getting Started** | ||
|
|
||
| Getting started with vLLM-Omni is straightforward. The initial vllm-omni v0.11.0rc release is built on top of vLLM v0.11.0. | ||
|
|
||
| ### **Installation** | ||
|
|
||
| Check out our [Installation Doc](https://vllm-omni.readthedocs.io/en/latest/getting_started/installation/) for details. | ||
|
|
||
| ### **Serving the omni-modality models** | ||
|
|
||
| Check out our [examples directory](https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows. vLLM-Omni also provides the gradio support to improve user experience, below is a demo example for serving Qwen-Image: | ||
|
|
||
| <p align="center"> | ||
| <img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png" alt="vllm-omni serving qwen-image with gradio" width="80%"> | ||
| </p> | ||
|
|
||
| ## **Join the Community** | ||
|
|
||
| This is just the beginning for omni-modality serving. We are actively developing support for more architectures and invite the community to help shape the future of vLLM-Omni. | ||
|
|
||
| * **Code & Docs:** [GitHub Repository](https://github.com/vllm-project/vllm-omni) - [Documentation](https://vllm-omni.readthedocs.io/en/latest/) | ||
| * **Slack:** Ask questions and provide feedbacks in `#sig-omni` slack channel at [slack.vllm.ai](https://slack.vllm.ai). | ||
| * **Weekly Meeting:** Join us every Tuesday at 19:30 PDT time to discuss roadmap and features. [Join here](https://tinyurl.com/vllm-omni-meeting). | ||
|
|
||
| Let's build the future of omni-modal serving together\! | ||
Binary file added
BIN
+52.8 KB
assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+1.06 MB
assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+107 KB
assets/figures/2025-11-30-vllm-omni/vllm-omni-pipeline-async-stage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
hsliuustc0106 marked this conversation as resolved.
Show resolved
Hide resolved
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.