v0.23.0rc1
Pre-releaseHighlights
This release candidate features 79 commits from 68 contributors, including 12 new contributors.
vLLM-Omni v0.23.0rc1 is a release candidate aligned with upstream vLLM v0.23.0. It focuses on validating the next release line by expanding TTS/audio model coverage, improving speech serving latency and correctness, strengthening diffusion/image/video generation paths, and broadening quantization and hardware backend readiness across CUDA, Blackwell, ROCm, NPU, and XPU.
This release candidate is intended to validate the vLLM 0.23 integration, the refreshed TTS serving adapter path, Higgs Audio V3 and Qwen3-TTS production behavior, and diffusion/video quantization and platform coverage before the final v0.23.0 release.
Key Improvements
- Rebased to upstream vLLM v0.23.0, refreshing the base runtime for the v0.23 release line. (#4286)
- Expanded speech and TTS model coverage, adding Higgs Audio V3 TTS, Step-Audio2, Ming-omni-tts dense 0.5B, broader Qwen3-TTS language support, and LoRA support for SenseNova-U1. (#4169, #464, #2906, #4210, #3971)
- Improved TTS serving performance and reliability, with Higgs Audio V3 reference-audio caching and prefix caching, Qwen3-TTS hot-path optimizations and prefix-cache fixes, TensorRT acceleration for CosyVoice, and CUDA Graph support for MOSS-TTS. (#4200, #4199, #4204, #3689, #4317, #4168, #4157)
- Strengthened diffusion, image, and video generation, including Cosmos3 video-to-video, WAN2.2-S2V image+audio server API support, HunyuanImage3 resolution and latency improvements, LTX-2.3 VAE decode parallelism, and Wan2.2 sequence-parallel performance fixes. (#4266, #3394, #4004, #4333, #4277, #3763)
- Expanded quantization and hardware coverage, including ModelOpt FP8 for Wan2.2 and HunyuanVideo-1.5, Blackwell FP8 GEMM fused-bias kernels, XPU sage attention, NPU VoxCPM2 support, and Ascend 310P support for Qwen3-TTS. (#3305, #4245, #4241, #3785, #4310, #4283)
- Improved runtime architecture and developer experience, with the TTS serving adapter framework, multimodal output-channel separation, audio-in-video refactoring, Diffusers pipeline cleanup, and stronger CI/test gating. (#4330, #2744, #3566, #1932, #4313, #4365)
Core Architecture & Runtime
- Rebased vLLM-Omni to upstream vLLM
v0.23.0, keeping the Omni runtime aligned with the latest upstream release line. (#4286) - Introduced the TTS serving adapter framework and migrated TTS models onto the refreshed adapter path, making speech model serving easier to extend and maintain. (#4330)
- Continued multimodal output processor work by separating multimodal output channels, improving the architecture for heterogeneous text/audio/video outputs. (#2744)
- Cleaned up diffusion pipeline loading by removing dead legacy
pipeline.yamlloading paths and duplicate diffusion logic, and by simplifyingDiffusersPipelineLoader. (#4023, #1932) - Refactored audio-in-video implementation and added the MoriIO transfer engine, improving maintainability for multimodal data routing and media-transfer workflows. (#3566, #1742)
- Refined guardrail error handling with explicit 400-level error behavior for invalid requests. (#4297)
Model Support
- Added support for
bosonai/higgs-audio-v3-tts-4b, bringing Higgs Audio V3 TTS into the vLLM-Omni serving stack. (#4169) - Added Step-Audio2 support and Step-Audio R1 reasoning parser support. (#464, #2846)
- Added Ming-omni-tts dense 0.5B pipeline support and follow-up compatibility fixes for Mapping-style input checks. (#2906, #4397)
- Expanded Qwen3-TTS language support for fine-tuned checkpoints and added
non_streaming_modefor Qwen3-TTS base models during online inference. (#4210, #4198) - Added LoRA support for SenseNova-U1. (#3971)
- Added more resolution support for HunyuanImage3.0. (#4004)
- Added Cosmos3 video-to-video generation and Cosmos3-Nano baselines. (#4266, #4301)
- Added WAN2.2-S2V server API support for image + audio input workflows. (#3394)
Audio, Speech & Omni Production Optimization
- Improved Higgs Audio V3 serving with LRU caching for voice-clone reference-audio encoding, Stage-0 prefix caching enabled by default, and serving-path optimizations. (#4200, #4199, #4204)
- Fixed Higgs Audio V3 Stage0 talker ramp-down and buffer-state crashes, improving stability for long-running speech serving. (#4219)
- Optimized Qwen3-TTS hot paths with prefix-cache OOM guards, orchestrator/talker micro-optimizations, and fixes for prefix-cache corruption and cross-request
codes_refleakage. (#3689, #4317, #4373) - Improved Qwen3-TTS Gradio streaming TTFP by using
audio/pcm. (#4346) - Optimized CosyVoice TTFP and throughput using TensorRT. (#4168)
- Added CUDA Graph support for the MOSS-TTS codec decoder and made MOSS-TTS-Nano eager-init compatible with
load_format: dummy. (#4157, #3230) - Fixed Fish Speech serving issues, including a Gradio default-voice 400 error and prefix-cache collisions from missing
cache_salt. (#3941, #4008) - Improved VoxCPM2 robustness with KV-cache pinning for smaller GPUs and fixes for concurrent speech quality. (#4279, #4319)
Diffusion, Image & Video Generation
- Added Cosmos3 video-to-video generation support and Cosmos3-Nano baseline coverage. (#4266, #4301)
- Added WAN2.2-S2V server API support for image+audio generation workflows. (#3394)
- Improved HunyuanImage3 behavior with offline CoT fixes, CoT truncation fixes, stream-mode accuracy fixes, additional resolution support, and a
prepare_attention_maskoptimization that reduces end-to-end latency. (#4174, #4260, #4265, #4004, #4333) - Improved HunyuanVideo-1.5 quantization propagation so I2V transformer FP8 layers can be enabled correctly. (#4245)
- Improved LTX-2.3 production behavior by keeping auxiliary modules resident by default, adding VAE decode parallelism, and fixing RMSNorm identity-weight registration. (#4144, #4277, #4278)
- Optimized Wan2.2 sequence-parallel behavior by skipping attention masks for zero-padded SP sequences to avoid the varlen path. (#3763)
- Improved Lance text-to-image and image-to-image performance. (#4214)
- Fixed BAGEL SP denoise indentation behavior. (#4328)
Quantization & Memory Efficiency
- Added ModelOpt FP8 support for Wan2.2 and HunyuanVideo-1.5 video generation. (#3305)
- Propagated quantization configuration into HunyuanVideo-1.5 I2V transformer paths to enable FP8 layers. (#4245)
- Defaulted Blackwell FP8 GEMM to the quack CuteDSL fused-bias kernel, improving the default quantized execution path on Blackwell. (#4241)
- Pinned VoxCPM2 KV cache to reduce OOM risk on smaller GPUs. (#4279)
Platforms, Distributed Execution & Hardware Coverage
- Added XPU
sage_attnbackend support. (#3785) - Removed CUDA hardcoding in Cosmos3 XPU paths and made
VLLM_VIDEO_SYNC_TIMEOUTtunable. (#4360) - Updated DreamZero to support non-CUDA hardware paths. (#4399)
- Added NPU support for VoxCPM2 and adapted the VoxCPM2 audio encoder for non-CUDA backends. (#4310, #4374)
- Adapted Qwen3-TTS for Ascend 310P. (#4283)
- Added ROCm CI group/env features. (#4208)
Reliability, Tooling & Developer Experience
- Fixed chunk-transfer zombie cleanup on every scheduler tick to keep engine-core alive after request aborts. (#3774)
- Removed the
pydubdependency for Python 3.13 compatibility. (#4035) - Surfaced all-rank diffusion RPC failures so distributed diffusion errors are easier to diagnose. (#4403)
- Added automatic cleanup for generated audio files in tests and expanded realtime invalid-parameter coverage. (#4294)
- Improved CI efficiency with diff-aware L2/L3 gating, skipped unrelated merge/ready tests, single-GPU queue migration, and temporary skips for unstable OOM/MOSS-TTS-Nano cases. (#4291, #4313, #4365, #4311, #4391)
- Added Voxtral TTS tests. (#3738)
CI, Benchmarks & Documentation
- Updated README and supported model documentation for TTS and diffusion categories. (#4233, #4300)
- Added a Stable Diffusion 3.5 recipe for 1× RTX A6000 48GB. (#4052)
- Added CUDA verification notes for
inclusionAI/Ming-omni-tts-0.5B. (#4324) - Added failure-mode documentation and cleaned up the PR template. (#3926, #4336)
- Added Claude skills for precheck-PR and quantization workflows. (#4216, #4252)
What's Changed
- [TTS][New Model] support bosonai/higgs-audio-v3-tts-4b by @yuekaizhang in #4169
- [Perf][Higgs-Audio-V3] LRU cache for voice-clone ref-audio encode by @linyueqian in #4200
- [Perf][Higgs-Audio-V3] Turn on Stage-0 prefix caching by default by @linyueqian in #4199
- [Doc] Add Stable-Diffusion-3.5 recipe for 1x RTX A6000 48GB (#2645) by @yangyonggit in #4052
- [Bugfix] Fish Speech Gradio hardcoded default voice causes 400 error by @nagelanping in #3941
- [Model] Support languages added by fine-tuned Qwen3-TTS checkpoints by @n0n4m39911 in #4210
- [Quant] ModelOpt FP8 for Wan2.2 & HunyuanVideo-1.5 video-gen by @lishunyang12 in #3305
- [BugFix] fix hunyuan image3 offline cot by @BLANKETusers in #4174
- [skip ci] docs: update WeChat QR code by @david6666666 in #4242
- [Refactor]Refactoring audio_in_video implementation by @amy-why-3459 in #3566
- [Doc] Add precheck-pr Claude Code skill by @hsliuustc0106 in #4216
- [Perf] Keep LTX2.3 auxiliary modules resident by default by @mglyn in #4144
- [Fix][HiggsAudioV3] Fix ramp-down off-by-one crash and buffer state bugs in Stage0 talker by @Sy0307 in #4219
- Cosmos3 video to video generation by @MaciejBalaNV in #4266
- [Quant][Perf] Default Blackwell FP8 GEMM to quack CuteDSL fused-bias kernel by @lishunyang12 in #4241
- [Refactor] Remove dead legacy pipeline.yaml loader and duplicate diff… by @AbelSara in #4023
- [Core] Cleanup DiffusersPipelineLoader by @alex-jw-brooks in #1932
- docs: update README and supported models for v0.22.0 release by @hsliuustc0106 in #4233
- fix: Propagate Quantization Configuration to HunyuanVideo-1.5 I2V Transformer to Enable FP8 Layers by @weizhoublue in #4245
- [Perf] lance perf optimize (t2i & i2i) by @yangjianjuan in #4214
- [Bugfix] Register LTX RMSNorm identity weight as buffer by @mglyn in #4278
- Fix README.md typo by @SamitHuang in #4292
- [XPU] Add sage_attn backend by @xuechendi in #3785
- [Feat] Add vae-decode-parallel for LTX-2.3 by @mglyn in #4277
- [ROCm] [CI] Add group feature and envs feature by @tjtanaa in #4208
- [Feature] LoRA support for SenseNova-U1 by @leohuang257 in #3971
- [BugFix] Fix HunyuanImage3 CoT truncation when stop token stripped by detokenizer by @zengchuang-hw in #4260
- [Perf][Bugfix] qwen3-tts hot path: prefix-cache OOM guards + talker/orchestrator micro-opts by @JuanPZuluaga in #3689
- add moriio transfer engine by @inkcherry in #1742
- [skip ci] fix(config): pin VoxCPM2 KV cache to avoid OOM on small GPUs by @linyueqian in #4279
- [Bugfix] purge chunk-transfer zombies on every schedule tick to keep engine-core alive on aborts (fixes #3736) by @abinggo in #3774
- [BugFix] Remove pydub dependency for Python 3.13 compatibility by @FED4 in #4035
- [CI] skip unrelated L3 merge tests via diff-aware upload by @yenuo26 in #4291
- [Perf][TTS] Optimize cosyvoice TTFP and throughput using TensorRT by @yuekaizhang in #4168
- [Bugfix] Fix Fish Speech prefix cache collision from missing cache_salt by @nagelanping in #4008
- [Test] skip oom test case for issue #4285 by @zhumingjue138 in #4311
- [Docs] Update README supported models section with TTS and Diffusion categories by @hsliuustc0106 in #4300
- [Model] Add Ming-omni-tts dense 0.5B pipeline by @akshatvishu in #2906
- [CI/Build] Voxtral TTS Tests by @clodaghwalsh17 in #3738
- [BugFix] moss_tts_nano: eager-init lm + audio_tokenizer in init so load_format: dummy works by @leohuang257 in #3230
- [skip ci] Add width and height args to offline i2i example script by @fhfuih in #4031
- [Bugfix] qwen3-tts prefix cache: drop per-key size cap that corrupted… by @JuanPZuluaga in #4317
- Step audio R1 reasoning parser by @QiuMike in #2846
- [Test] Automatically clean up audio files generated from requests, and realtime invalid-param coverage by @yenuo26 in #4294
- [Skills] Add quantization Claude skill by @david6666666 in #4252
- [Docs] add doc for failure mode by @zhumingjue138 in #3926
- [Bugfix][HunyuanImage] fix accuracy in stream mode by @Bounty-hunter in #4265
- [Bugfix][VoxCPM2] Fix VoxCPM2 concurrent speech quality by @Shirley125 in #4319
- [Bugfix] Fix SP denoise indentation bug on BAGEL by @kushanam in #4328
- [CI] skip unrelated L2 ready tests via diff-aware upload, aligned L3 tweak and fix issue 4334 by @yenuo26 in #4313
- [NPU] Support VoxCPM2 model by @tanhaoan333 in #4310
- [Doc] Clean up PR template by @hsliuustc0106 in #4336
- [Bugfix] Add Cosmos3-Nano baselines and fix USP gather by @david6666666 in #4301
- [WAN2.2-S2V] Add server API for image + audio by @xuechendi in #3394
- [HunyuanImage][Perf] opt prepare_attention_mask for e2e latency 6% reduction by @Bounty-hunter in #4333
- [Perf] Optimize Higgs Audio v3 serving by @Sy0307 in #4204
- [refactor] Refactor guardrail error handling - add 400 error code by @MaciejBalaNV in #4297
- Feat:
non_streaming_modefor Qwen3-TTS Base Models During Online Inference by @nagisa-kunhah in #4198 - feat(moss-tts): add CUDA Graph support for codec decoder by @yangyonggit in #4157
- [Bugfix] Fix Qwen3-TTS gradio streaming TTFP by using audio/pcm by @yuxinyuan in #4346
- [XPU][COSMOS3] removing cuda hardcode and make VLLM_VIDEO_SYNC_TIMEOUT a tunable config by @xuechendi in #4360
- [Model] Add more resolution support for HunyuanImage3.0 by @Semmer2 in #4004
- [Refactor] Output Processor Phase 2: separate multimodal output channel (#1601) by @meghaagr13 in #2744
- [platform] fix: set UnspecifiedOmniPlatform device_type to cpu by @SamitHuang in #4357
- [bugfix]VoxCPM2 audio encoder adapt other than CUDA by @tanhaoan333 in #4374
- [ci]skip voxcpm2 pcm hnr test by @Shirley125 in #4375
- [Hardware][Ascend] Adapt Qwen3 TTS for 310P by @zyz111222 in #4283
- [CI] Diff-gate L2/L3 E2E jobs and migrate single-GPU tests to gpu_1_queue by @yenuo26 in #4365
- [CI] skip MOSS-TTS-Nano E2E tests pending issue#4361 by @yenuo26 in #4391
- [Perf][Wan2.2] Skip attention mask for zero-padded SP sequences to avoid varlen path by @jjuvonen-amd in #3763
- [Bugfix] Fix cross-request codes.ref leak in Qwen3-TTS make_omni_output by @henryj in #4373
- [New Model] Step-Audio2 by @wuli666 in #464
- [Bugfix] Ming-omni-tts: generalize dict checks to Mapping by @akshatvishu in #4397
- [XPU][CI] Fix docker build slowness by @xuechendi in #4402
- [Refactor] TTS serving adapter framework + migrate TTS models (excl. ming_flash_omni) by @linyueqian in #4330
- [Bugfix] Surface all-rank diffusion RPC failures by @hsliuustc0106 in #4403
- [Doc][Recipe] Update CUDA verifications for inclusionAI/Ming-omni-tts-0.5B by @yuanheng-zhao in #4324
- [XPU] update Dreamzero to support any HW by @xuechendi in #4399
- [Rebase] Rebase to vllm 0.23.0 by @tzhouam in #4286
New Contributors
- @yangyonggit made their first contribution in #4052
- @nagelanping made their first contribution in #3941
- @n0n4m39911 made their first contribution in #4210
- @inkcherry made their first contribution in #1742
- @abinggo made their first contribution in #3774
- @FED4 made their first contribution in #4035
- @clodaghwalsh17 made their first contribution in #3738
- @kushanam made their first contribution in #4328
- @yuxinyuan made their first contribution in #4346
- @zyz111222 made their first contribution in #4283
- @jjuvonen-amd made their first contribution in #3763
- @henryj made their first contribution in #4373
Full Changelog: v0.22.0...v0.23.0rc1