Skip to content

v0.24.0rc1

Pre-release
Pre-release

Choose a tag to compare

@david6666666 david6666666 released this 30 Jun 11:45
a560ed1

Highlights

This release candidate features 158 merged changes from 82 contributors, including 21 new contributors.

vLLM-Omni v0.24.0rc1 is a release candidate aligned with the vLLM 0.24 release line. It expands diffusion, video, TTS, and robot-serving coverage while tightening the multistage runtime, request batching, quantization, cache behavior, and hardware backend support. This release candidate is intended to validate the vLLM 0.24 rebase and the new production-serving paths before the final cut.

Key Improvements

  • Aligned with the vLLM 0.24 release line, including the vLLM 0.24.0 rebase and the new v0.24.0rc1 release branch/tag baseline. (#4709)
  • Expanded model and modality coverage, adding or improving SDXL, GR00T-N1.7 with OpenPI serving, IndexTTS2, MOSS-TTS-local-v1.5, Ming-omni-tts MoE 16.8B-A3B, soulx-singer, Aura, and JoyAI-VL-Interaction. (#4331, #3798, #3838, #4664, #4341, #3862, #4257, #4575, #4623)
  • Strengthened diffusion, image, and video generation, with major Cosmos3, HunyuanImage3, DreamZero, BAGEL, Wan, streaming-video, and image-response improvements. (#4379, #4514, #4627, #4467, #4614, #4041, #4213, #4098, #4667, #3737, #1673)
  • Made audio and TTS serving more production-ready, including Qwen3-TTS correctness and streaming fixes, word-level timestamps, SSE audio streaming, CosyVoice3 prompt handling, MOSS-TTS stability, and new TTS model support. (#4429, #4465, #4525, #4673, #4731, #4034, #4490, #4679, #4756, #4415, #3838)
  • Improved core runtime and serving infrastructure, with the Omni stage runtime control plane, orchestrator output-path fixes, async output materialization, diffusion request-level batching, HF-config-based pipeline resolution, structured config, and video storage backends. (#3855, #4527, #4476, #4079, #3760, #4425, #2531)
  • Expanded quantization, cache, and hardware coverage, including Qwen3-Omni NVFP4 W4A4 on Blackwell, Qwen-Image AutoRound W4A16, HSDP+FP8 compatibility, CacheDiT cleanup, prefix-cache fixes, memory metrics, and NPU/ROCm/XPU fixes. (#4025, #4528, #3588, #4494, #2527, #4449, #4106, #4477, #4454, #4712, #4574)

Core Architecture & Runtime

  • Refactored the Omni stage runtime and distributed replica control plane, then reduced orchestrator bottlenecks by separating inter-stage outputs from client-facing outputs. These changes improve the runtime foundation for multistage omni serving. (#3855, #4527)
  • Added async Omni output materialization and request-level batching for diffusion pipelines, giving serving paths a cleaner way to handle generated artifacts and batch compatible diffusion work at the request level. (#4476, #4079)
  • Improved pipeline and configuration resolution with HF-config-based pipeline selection, eager pipeline registry behavior, structured VllmOmniConfig classes, deploy-config restoration in StageConfigFactory, and component discovery for offload. (#3760, #4425, #4729, #3076)
  • Cleaned up diffusion and stage boundaries by extracting output formatting, removing dead custom stage input hooks, fixing multi-replica stage identity, and aligning multimodal generation output construction. (#4407, #4531, #4410, #4579)

Model Support

  • Added or enabled new model families and serving paths, including SDXL, GR00T-N1.7 with OpenPI serving, IndexTTS2, MOSS-TTS-local-v1.5, Ming-omni-tts MoE 16.8B-A3B with CFM CUDA Graph, soulx-singer, Aura non-async-chunk serving, and JoyAI-VL-Interaction streaming interaction serving. (#4331, #3798, #3838, #4664, #4341, #3862, #4257, #4575)
  • Migrated or refreshed examples for Helios, Magi human, VACE, MammothModa2-Preview, AudioX, Cosmos3, and Qwen image recipes, making more model entries follow the standard task-example and model_extra pattern. (#4569, #4572, #4648, #4691, #4607, #4581, #4236)
  • Improved existing model behavior across HunyuanImage3, LTX-2.3, Qwen-Image, Qwen-Image-Edit, MammothModa2, Fish Speech, Voxtral, and SenseNova-U1. (#4416, #4439, #4474, #4293, #4299, #4428, #4380, #4188)

Audio, Speech & Omni Production Optimization

  • Improved Qwen3-TTS serving correctness across custom voice handling, degenerate full-payload completion, Code2Wav CUDA Graph output length, WebSocket input handling, and /v1/audio/speech token accounting. (#4429, #4465, #4525, #4731, #4673)
  • Added word-level timestamps through a forced aligner and expanded speech streaming behavior with SSE stream-format support and SSE as the default speech-streaming mode. (#4034, #4490, #4679)
  • Stabilized other speech and audio paths, including MOSS-TTS cross-request audio correctness, MOSS-TTS talker performance, MOSS-TTS loading compatibility, CosyVoice3 reference-text templating, Higgs Audio v3 lazy codec loading, and Fish Speech KV-cache unpacking for vLLM 0.23.0 compatibility. (#4415, #4230, #4398, #4756, #4368, #4428)
  • Updated TTS and audio recipes and baselines for Qwen3-TTS, VoxCPM, Voxtral, Stable-Audio-Open, AudioX, and related recipe references. (#4026, #4521, #4051, #3664, #4607, #4567)

Diffusion, Image & Video Generation

  • Improved Cosmos3 quality, transfer behavior, performance, and distributed correctness, including Cosmos3 transfer, v2v quality, regression fixes, I2V conditioning transfer optimization, skipped unused I2V conditioning latents, HSDP regional compile, sequence-parallel sound latent padding, and Cosmos3 L2 serving tests. (#4379, #4514, #4627, #4467, #4614, #4485, #4678, #4535)
  • Improved HunyuanImage3 performance and correctness with DiT grouped step batching, AR RGB conversion alignment, sync removal, timestep scalar-sync avoidance, prefetch KV, and streaming CoT display for AR generation. (#4041, #4502, #4401, #4363, #4448, #4148)
  • Advanced DreamZero and BAGEL generation paths with DreamZero TP and cross-attention cache fixes, CUDA Graph/torch.compile/DiT caching support, BAGEL denoising schedule fixes, reimplemented batched CFG forward, and CFG-parallel mixin reuse. (#4154, #4213, #4509, #4098, #4768)
  • Improved Wan and video-serving behavior with Wan2.2 graph-break and CacheDiT/Ulysses fixes, Wan2.2-VACE-Fun cache and lifecycle fixes, Wan VAE spatially sharded decode, Wan S2V SP/HSDP support, and streaming diffusion video output. (#4053, #3927, #4667, #4620, #4276, #4458, #3737)
  • Expanded image-generation serving and correctness with file-style image response_format, Qwen-Image RoPE and edit-path performance fixes, GLM-Image CFG-parallel dtype fixes, MammothModa2 text-to-image fixes, and SenseNova-U1 CFG parallel, fused RMSNorm+3D RoPE, and TeaCache support. (#1673, #4474, #4293, #3956, #4299, #4188, #4669, #4164)

Quantization & Memory Efficiency

  • Added Qwen3-Omni NVFP4 W4A4 serving on Blackwell and fixed Thinker lm_head prefix handling so NVFP4 exclude lists are honored. (#4025, #4528)
  • Expanded diffusion/image quantization coverage with Qwen-Image AutoRound W4A16, Qwen2.5-Omni AutoRound loading fixes, HSDP plus FP8 online quantization compatibility, and Cosmos3-Nano/Super online FP8 validation docs. (#3588, #4781, #4494, #4393, #4584)
  • Improved cache and memory behavior by simplifying CacheDiT integration, fixing DFlash prefix-cache corruption, avoiding per-step blocking writes in OmniTensorPrefix, adding Bagel memory metrics, and using module discovery for nested HSDP DiT sharding. (#2527, #4449, #4106, #4477, #3456)

RL, Serving & Integrations

  • Added DROID policy server support for Cosmos3 OpenPI and GR00T-N1.7 pipeline support with OpenPI serving, extending robot-policy serving coverage in this release line. (#4282, #3798)
  • Improved frontend and OpenAI-compatible serving behavior with configurable video storage backends and TTL, correct HTTP status codes for audio voice endpoints, vLLM-aligned Omni request signatures, and fixes for multimodal generation-stage outputs. (#2531, #3969, #4568, #4579)
  • Added async output materialization and streaming diffusion-video output so artifact-producing requests can be surfaced through more flexible serving flows. (#4476, #3737)

Platforms, Distributed Execution & Hardware Coverage

  • Improved Ascend NPU coverage with model-runner unpacking fixes, OmniMRotaryEmbedding support, diffusion-worker Ascend config initialization, custom-op registration, and restored pre-#9572 graph behavior by capping NPU CUDA-graph mode to piecewise. (#4454, #3609, #4386, #4712, #4674)
  • Improved ROCm, CUDA, and XPU stability through Voxtral and diffusion test fixes, XPU marker fixes, and replacement of hardcoded CUDA device selection with platform-agnostic APIs. (#4380, #4574, #4451, #4496)
  • Expanded distributed diffusion paths with HSDP/SP-related fixes and features across Cosmos3, Wan S2V, Wan VAE, and nested DiT sharding. (#4452, #4678, #4276, #4458, #4620, #3456)

CI, Benchmarks & Documentation

  • Strengthened rebase, ready/merge, and nightly coverage with full E2E rebase pipelines, fixed pipeline upload behavior, nightly L2/L3 E2E groups, split diffusion X2V tests, Cosmos3 L2 serving tests, and reduced ready-CI duration. (#4478, #4532, #4693, #4734, #4744, #4535, #4354)
  • Updated performance baselines and reliability coverage for VoxCPM, Qwen3-TTS, HunyuanImage3, VoxCPM2, and related perf JSONs. (#4289, #4521, #4600)
  • Improved contributor tooling and docs with a vLLM-Omni test-agent skill, code-quality guidance for the PR precheck skill, bug-report field IDs, CUDA custom Docker build guidance, and refreshed model recipes. (#4434, #4697, #4605, #1386, #4567)

What's Changed

  • [BugFix] Fix HunyuanImage3 size match issue by @Semmer2 in #4416
  • [Perf] Restore parallel stage initialization for AR+DiT pipelines by @zengchuang-hw in #3641
  • [Doc] validate Cosmos3-Nano online FP8 by @leohuang257 in #4393
  • [Perf] change omniconnector info log to debug log to avoid mass printing by @R2-Y in #4426
  • Update WeChat QR code by @david6666666 in #4431
  • [Bugfix] Qwen3-TTS: trim reference audio in no_async_chunk voice clone by @linyueqian in #4429
  • [Bugfix] Fix LTX-2.3 tensor-parallel gated attention by @mglyn in #4439
  • [Feature] Add HunyuanImage3 DiT grouped step batching by @TaffyOfficial in #4041
  • Support DROID policy server for Cosmos3 OpenPI by @yuzhudong in #4282
  • [CI/Build] Skip L2/L3 CI for pytest skip-mark changes by @yenuo26 in #4422
  • [Refactor] Extract diffusion output formatting boundary by @hsliuustc0106 in #4407
  • [Perf][HunyuanImage] remove sync logic by @Bounty-hunter in #4401
  • [XPU][CI] Fix ERROR ...test_glm_image_sp.py - Failed: 'sp' not found in markers by @xuechendi in #4451
  • [Enhancement] Introduce High-Performance MoT (Mixture-of-Tokens) Kernels: Triton Implementation by @princepride in #3960
  • [Bugfix] Cosmos3: fix HSDP dtype mismatch on time_embedder via _hsdp_ignored_modules by @KwokhoTsui in #4452
  • [BugFix] higgs-audio-v3: load codec lazily under load_format=dummy by @yuekaizhang in #4368
  • [Bugfix] Fix multi-replica stage identity and clean up redundant deploy config by @ZhengWG in #4410
  • fix(models): remove invalid sampling_metadata parameter from logits_processor by @akshatvishu in #4456
  • [Bugfix] [CI] [ROCm] [CUDA] Fix voxtral test by @tjtanaa in #4380
  • [CI][Rebase]: run full e2e suite in rebase Ready/Merge pipelines in the rebase pipeline[skip-ci] by @tzhouam in #4478
  • [BugFix][NPU]: Fix npu model runner too many values to unpack ($[1][0][4]) by @FayeSpica in #4454
  • [BugFix][Qwen-Image] align txt_seq_lens RoPE width with padded embeds by @SamitHuang in #4474
  • [Bugfix] Fix DFlash prefix cache corruption due to missing lookahead block by @NickCao in #4449
  • [SDXL] SDXL model enabling by @xuechendi in #4331
  • [Perf][qwen-image-edit] Skip attention-mask to avoid varlen path by @kTorp in #4293
  • [fix bug] add default limit for image generation by @LJH-LBJ in #3381
  • [Doc] Add guides for custom docker image build on NVIDIA CUDA [Skip-CI] by @loveysuby in #1386
  • [BugFix] Fix the issue of incorrect generated tokens by @amy-why-3459 in #4461
  • [CI] Update perf json baselines by @congw729 in #4289
  • [Bugfix] Fix missing memory metrics info for bagel (ar->dit diffusion) by @natureofnature in #4477
  • [Refactor] Omni Stage Runtime and Distributed Replica Control Plane by @yinpeiqi in #3855
  • [Wan2.2] Fix "uninitialized nn.Module of type RotaryEmbeddingWan" graph-break by @kTorp in #4053
  • [Perf][PrefixCache] Avoid per-step blocking write in OmniTensorPrefix… by @LHXuuu in #4106
  • [Doc] Qwen image 2512 recipe by @AbelSara in #4236
  • [bugfix] Shim PreTrainedModel._tp_plan to fix MOSS-TTS-Nano load crash by @akshatvishu in #4398
  • [Docs] Fix hallucinated stage CLI flags in documentation and comments by @akshatvishu in #4512
  • [Bugfix] Wire log_stats to AsyncOmni and add missing token metrics for non-text requests by @blondeCS in #4482
  • [Refactor] Migrate existing pipelines to use SupportsComponentDiscovery for offload discovery by @NickCao in #3076
  • Refactor: extract OmniStreamingVideoHandler base and QwenOmniStreamingVideoHandler by @NumberWan in #4424
  • [DEBUG]Add has_preprocess branch to _dummy_run for models with preprocess stage by @Wallbreazzz in #4189
  • [Quantization][Qwen3-Omni] Enable NVFP4 W4A4 serving on Blackwell by @YIHONG-JIN in #4025
  • [BugFix]: Fix Bagel generate_image denoising schedule by @princepride in #4509
  • Remove unregister_vllm_metrics no-op patch by @vraiti in #4396
  • [Perf]Enable regional compile for Cosmos3 HSDP blocks by @bjf-frz in #4485
  • Update VoxCPM and Qwen3TTS perf baseline. by @congw729 in #4521
  • Add NPU support for OmniMRotaryEmbedding by @Wallbreazzz in #3609
  • [SKILLS] Add vllm-omni-test agent skill for CI-aligned test generation by @yenuo26 in #4434
  • [Refactor]Base class for dit pipelines with unified parameter declaration[1/N] by @princepride in #4225
  • [Doc] Qwen3-TTS: add 0.6B on 1x RTX 4090 and fix broken links by @zeningc in #4026
  • [CI] skip ci for issue 4537 by @zhumingjue138 in #4538
  • [Feat] DreamZero fix tp & cross attn cache by @AbelSara in #4154
  • [Bugfix] MOSS-TTS: fix cross-request audio corruption under batching by @yangyonggit in #4415
  • [Doc] Add 1x RTX A6000 48GB section to Voxtral TTS recipe (#2645) by @yangyonggit in #4051
  • [Cleanup] Remove duplicate HiDreamI1ImagePipeline (fixes #4009) by @yangyonggit in #4045
  • [Perf] Stacked audio ops: batched matmul and embedding gather in MOSS-TTS talker by @yangyonggit in #4230
  • [Perf][DreamZero]dreamzero supports CUDA graph, torch.compile, and DiT Caching. by @amy-why-3459 in #4213
  • [Chore][TTS] Reference recipes.vllm.ai recipes, remove dead code, and fix stale references by @linyueqian in #4567
  • [Bugfix] Fix HSDP + FP8 online quantization compatibility by @KwokhoTsui in #4494
  • [Frontend] Configurable video storage backends with TTL by @ieaves in #2531
  • [Cache Refactor 1/N] Simplify CacheDiT Integration by @alex-jw-brooks in #2527
  • [ROCm] [CI] Fix diffusion test by @tjtanaa in #4574
  • [Diffusion] add GR00T-N1.7 pipeline with OpenPI serving by @timzsu in #3798
  • [Bugfix] Align vLLM Omni Request input signature with vLLM. by @timzsu in #4568
  • [Bugfix][Qwen3-Omni] Fix Code2Wav CUDA-graph output length surplus (#4466) by @linyueqian in #4525
  • Cosmos3 transfer by @MaciejBalaNV in #4379
  • [Feature]Migrate helios example by @princepride in #4569
  • [Feature] JoyAI-VL-Interaction streaming interaction serving layer by @lishunyang12 in #4575
  • [BugFix]Fix diffusion profile RPC None results by @yixiaoer in #4372
  • [Model] Aura support - Non async chunk path by @R2-Y in #4257
  • Improve Cosmos3 v2v quality by @MaciejBalaNV in #4514
  • [Perf/Fix] Reimplement Batched CFG Forward for Bagel by @alex-jw-brooks in #4098
  • [Bugfix] Fix Fish Speech kvcache attention KV cache unpacking for vLLM 0.23.0 by @nagelanping in #4428
  • [Cleanup] Remove dead/stale modules with no repo references (fixes #4009) by @yangyonggit in #4044
  • [Perf][Cosmos3] Optimize I2V conditioning transfer (8.63% faster x-inference) by @david6666666 in #4467
  • [chore][skills]: tightens the add-diffusion-model skill under .claude/skills to better match the current vllm-omni codebase and repo conventions by @RuixiangMa in #4110
  • [Doc] validate Cosmos3-Super online FP8 by @KwokhoTsui in #4584
  • Update WeChat group QR code by @david6666666 in #4601
  • [Bugfix][Qwen3-Omni] Pass Thinker lm_head prefix so NVFP4 exclude-list is honored by @YIHONG-JIN in #4528
  • [Feat] Streaming TTS word-level timestamps via forced aligner by @wjinxu in #4034
  • Add SSE stream_format support for audio speech streaming by @syd520zy in #4490
  • [Feature]: Migrate magi human example by @princepride in #4572
  • [Perf]perf: skip unused Cosmos3 I2V conditioning latents by @bjf-frz in #4614
  • [CI][NPU] Update VLLM ASCEND IMAGE by @FayeSpica in #4602
  • [CI][Rebase] Fix Ready/Merge pipeline upload by stripping source_file_dependencies by @tzhouam in #4532
  • [Feature] Support Async Omni output materialization by @fake0fan in #4476
  • [Bugfix][HunyuanImage3] Align AR RGB conversion with official semantics by @TaffyOfficial in #4502
  • [Model] Add IndexTTS2 text-to-speech support by @BeatSeat in #3838
  • fix: return proper HTTP status codes for audio voice endpoints by @ieaves in #3969
  • [NFC] Update misleading seed-tts-eval optional dependency guidance by @yuanheng-zhao in #4619
  • Qwen3-TTS full-payload: emit empty-finished payload on degenerate take instead of dropping (fixes Stage-1 300s hang) by @henryj in #4465
  • [Doc] Stability AI/Stable-Audio-Open by @Ronnie-Rui in #3664
  • [BugFix] Use ModuleDiscovery for HSDP DiT sharding to support nested pipelines by @NickCao in #3456
  • [NPU] Initialize Ascend config for diffusion workers by @Fishermanykx in #4386
  • [Refactor] Replace hardcoded CUDA device selection with platform-agnostic API by @NickCao in #4496
  • [Refactor] Migrate cosmos3 example to standard task example + model_extra by @leohuang257 in #4581
  • [Bugfix] Fix make_request_output TypeError on multimodal generation-stage outputs by @QianCyrus in #4579
  • [feat] Streaming diffusion video generation output by @fhfuih in #3737
  • Add field IDs to bug report issue template by @yenuo26 in #4605
  • [Fix] JoyVL serving: align with reference engine (bounded long-term memory, timestamps, max_pixels) + recipe updates by @lishunyang12 in #4623
  • Fixed Cosmos3 regression by @MaciejBalaNV in #4627
  • [Feature] add a new response_format is file type when image generate by @lengrongfu in #1673
  • [BugFix] Delete unnecessary log printing by @amy-why-3459 in #4566
  • [Feat] Support Pipeline Resolution Based on HF Config & Make Pipeline Registry Eager by @alex-jw-brooks in #3760
  • [perf] avoid HunyuanImage3 timestep scalar sync by @TaffyOfficial in #4363
  • [Model] add soulx-singer support by @MrDongsls in #3862
  • [S2V] Add sequence parallelism support to WanS2VTransformer3DModel by @xuechendi in #4276
  • [refactor] Flatten Ming-flash-omni-2.0 imagegen args by @yuanheng-zhao in #4587
  • [Bugfix] [AURA] Correct aura results when use qwen3-tts in custom voice mode by @R2-Y in #4650
  • [Refactor] Migrate VACE example to standard task examples + model_extras by @ForestWisdom in #4648
  • [Feat] Enable streaming CoT display for HunyuanImage-3.0 AR generation by @zengchuang-hw in #4148
  • [Bugfix] Use MediaConnector for image/video URL fetching to prevent SSRF by @NickCao in #2565
  • [Test] Skip MOSS TTS offline E2E tests (issue #4700) by @yenuo26 in #4701
  • [Bugfix][CI]Fix HSDP regional compile boundary by @bjf-frz in #4668
  • [BugFix] Fix Wan2.2-VACE-Fun cache and serving lifecycle by @david6666666 in #4667
  • [Test] add reliability test case for HunyuanImage-3.0-Instruct and reliability/stability test for VoxCPM2 by @zhumingjue138 in #4600
  • [BugFix] Fix MammothModa2 Text-to-Image Bug by @menjiantong in #4299
  • [Bugfix] Fix CFG-parallel broadcast dtype mismatch in GLM-Image by @dsocek in #3956
  • [BugFix] Remove prefix-cache test cases by @amy-why-3459 in #4703
  • [Bugfix]Unified import of PretrainedConfig by @R2-Y in #4692
  • [CI] Run L2/L3 E2E group on main nightly builds by @yenuo26 in #4693
  • [Model] Support MOSS-TTS-local-v1.5 by @gcanlin in #4664
  • [Config] Add structured VllmOmniConfig classes by @Acerak01-fy in #4425
  • [Bugfix][NPU] Register vllm-ascend custom ops in NPUOmniPlatform.set_device by @FayeSpica in #4712
  • [Core] hunyuan-image prefetch kv by @AbelSara in #4448
  • [Model] Add Ming-omni-tts MoE 16.8B-A3B + CFM CUDAGraph by @LHXuuu in #4341
  • [skip ci][Misc] LVSA showcase (training-free block-sparse attention) by @gglorian in #4192
  • Wan2.2: Fix the bug of using cache-dit with ulysses in t2v and i2v by @mengker33 in #3927
  • [Bugfix] Add back resolving from deploy config written pipeline in StageConfigFactory by @yuanheng-zhao in #4729
  • [Bugfix] Stream Qwen3-TTS WebSocket input as one request by @Sy0307 in #4731
  • [Doc] Add code-quality dimension to precheck-pr skill by @hsliuustc0106 in #4697
  • [CI][Bugfix] Fix nightly L2/L3 E2E-only upload and Buildkite EXTRA escaping by @yenuo26 in #4734
  • Revert "[Bugfix] Use MediaConnector for image/video URL fetching to prevent SSRF" by @Gaohan123 in #4751
  • [Bugfix] Fix /v1/audio/speech usage token accounting for Qwen3-TTS (#4646) by @linyueqian in #4673
  • Updated Cosmos3 docstrings by @MaciejBalaNV in #4727
  • [Feature] Spatially-sharded (SP) decode for the Wan VAE by @rahul-steiger-nv in #4620
  • [Test] Skip Qwen3-TTS batch E2E tests by @yenuo26 in #4758
  • [Test] Un-skip Qwen3-TTS batch E2E; match documented omit-null response shape (#4757) by @linyueqian in #4759
  • [Bugfix] CosyVoice3: wrap ref_text in instruction template (#4644) by @linyueqian in #4756
  • [WAN_S2V] Enable HSDP for wan s2v by @xuechendi in #4458
  • [CI][Bugfix]Split nightly Diffusion X2V function tests by T2V/I2V to fix timeout by @yenuo26 in #4744
  • [Refactor] Reuse CFGParallelMixin in Bagel for CFG-parallel denoising by @suyanli220 in #4768
  • [CI]Add Cosmos3 L2 serving tests by @bjf-frz in #4535
  • [Core] Remove dead custom_process_input_func hooks in stage input processors by @Nughm3 in #4531
  • Update WeChat QR code asset by @david6666666 in #4778
  • [CI] Reducing mithril-h100-pool in ready&merge CI, optimize ready CI duration by @yenuo26 in #4354
  • Fix Qwen2.5-Omni AutoRound loading by @ayaka14732 in #4781
  • [BugFix] Restore pre-#9572 NPU graph behavior (cap cudagraph_mode to PIECEWISE) by @FayeSpica in #4674
  • [Core] Performance fix for orchestrator bottleneck: separates inter-stage from client outputs by @yinpeiqi in #4527
  • [Core][Frontend] Support request-level batching for diffusion pipelines by @yJader in #4079
  • Make speech streaming default to SSE by @syd520zy in #4679
  • [Model] Migrate MammothModa2-Preview example to standard task example + model_extras by @moisf56 in #4691
  • Migrate audiox examples by @zzehli in #4607
  • [Feat] Support CFG-Parallel For Sensenova-U1-8B-MoT by @sphinxkkkbc in #4188
  • [Quant][AutoRound] Add W4A16 support for Qwen-Image by @yiliu30 in #3588
  • [Perf]: Fused RMSNorm+3D Rope kernel in Sensenova-U1 by @sphinxkkkbc in #4669
  • [BugFix][Cosmos3] Pad sound latents so video+sound runs under sequence parallelism by @lishunyang12 in #4678
  • [Feat] Support teacache for SenseNova-U1 by @fywc in #4164
  • [Rebase] Rebase to vllm v0.24.0 by @tzhouam in #4709

New Contributors

Full Changelog: v0.23.0rc1...v0.24.0rc1