v0.21.0
This release brings several new features and improvements for vLLM TPU Inference.
Highlights
Features
- RPA Kernel Heuristic: Adds a heuristic for selecting optimal block sizes for RPA v3. Provides better performance out-of-the-box for all GQA-based models. (#2282)
Models
- Gemma 4:
- Add initial support for Gemma 4 31B and 26B-A4.
- Add support for Gemma’s new MTP drafter.
- Kimi-K2.6 (text-only): Add initial support for the newest Kimi K2.x models
What's Changed
- Revert "Add model loading test for int4 MoE models (#2424)" by @richardsliu in #2480
- [Spec decoding] Hardcoded env variables to fix spec decoding e2e tests. by @Lumosis in #2476
- [Gemma4] Fix for the updated interface (due to #2343) by @lk-chen in #2483
- [Fix][Jax] Qwen3 perf benchmarks: implement named_modules() on JaxModule by @QiliangCui in #2487
- Revert changes related to disabling weight tracking by @richardsliu in #2488
- Unjit unpack_array by @pv97 in #2492
- [CI] Rename host scale descriptions to Small and Large scale queue by @boe20211 in #2491
- [CI] Handle upload failures and skip dependent matrix steps by @boe20211 in #2445
- [FIX][TPU Offloading] avoid recompile when
MODEL_IMPL_TYPE=vllmby @juncgu-google in #2484 - [CI] Enhance model pipeline generation and fix Python compatibility by @meiyeh123 in #2455
- Bump RunAI model streamer to 0.15.9 to pick up GRPC GCS client support by @amacaskill in #2496
- Revert "Add env variable for overriding rpa block sizes" by @kyuyeunk in #2490
- Enable bf16 SSM state in fused GDN kernels by default on TPU by @qizzzh in #2482
- Add bm test case to ci pipeline by @CienetStingLin in #2494
- Skip post-rank-1 matmul in fused GDN kernels via algebraic identity by @qizzzh in #2498
- Support StepPool and Qwen3-Embedding (8B) for long-sequence workloads by @anthonsu in #2420
- [Spec decoding] Fix flaky test by @gxd3 in #2511
- add qwen3.5 397b prefill benchmark by @yiw-wang in #2512
- Populate support matrices for v0.19.0 by @jcyang43 in #2514
- [Spec decoding] Fix e2e tests to make MODEL_IMPL_TYPE and DRAFT_MODEL_IMPL_TYPE consistent. by @Lumosis in #2515
- update docs for v0.19.0 by @jcyang43 in #2516
- [CI] Enhance pipeline metadata validation logic by @meiyeh123 in #2389
- [FIX][TPU Offload] rm dummy code by @juncgu-google in #2520
- [MTP] Fix conv_state shape handling in gdn_attention_core_tpu for MTP by @Lumosis in #2522
- [FP8] Resolving HBM spike by forcing loading on CPU before processing by @aashishrampal-lab in #2413
- [TPU KV Offload][Feat] Option to override host memory kind by @sangam-jindal in #2426
- Prevent double patch for routed_expert IDs by @pv97 in #2519
- [Multimodal] Refactor multimodal_manager and embed_input_ids to align with vLLM by @kwang3939 in #2504
- Pipeline psum and sc kernel by @clee1994 in #2083
- [Batched RPA] Fix "Reshape should have supported layout" during Gemma4 inference by @lk-chen in #2506
- Fix crash in phased profiling due to composition stats by @helloworld1 in #2528
- Add benchmark case type for seperate case folder by type by @CienetStingLin in #2495
- Fix the potential duplicate group key and step key issue in the Benchmark Buildkite pipeline by @CienetStingLin in #2428
- Fix upstream vllm break for MLA attention by @richardsliu in #2532
- [Spec decoding] make the mtp drafter propose() as 1 jited fn by @gxd3 in #2533
- Profiler: Automatically merge multi-host JAX profiling subdirectories… by @junyanxu in #2521
- [CI] refactor: replace pipeline name with CI_TARGET in Buildkite and remove pipeline-type by @boe20211 in #2446
- [Qwen3.5] Using approx top k for expert selection by @wyzhang in #2518
- [CI] Upload the yml based on MODEL_IMPL_TYPE by @meiyeh123 in #2456
- [Spec decoding] make drafter's prepare_inputs() a jited fn by @gxd3 in #2535
- [CI] Group benchmark cases and resolve Buildkite key length limits by @boe20211 in #2540
- [FIX][KV Cache Offloading] Fix async-scheduling placeholder token issue by @juncgu-google in #2449
- Fix security bug in daily_run_disagg script by @richardsliu in #2544
- Add a check for benchmark file changes by @CienetStingLin in #2524
- [Gemma4] bump timeout in nightly performance test by @lk-chen in #2545
- [MTP] Support speculative tokens in BlockTable and fix shared layer check by @Lumosis in #2536
- Parameterize sql for nightly_benchmark.sh by @jcyang43 in #2548
- Escape single quotes for sql in parse_gke_results.sh by @jcyang43 in #2549
- [Bench] Add Gemma-4 to benchmark by @lk-chen in #2534
- [GMM_V2] account for full VMEM footprint in calculate_tiling by @AahilA in #2550
- [benchmark] Fix Total input tokens reporting for chat/multimodal datasets by @lk-chen in #2526
- Gdn scan kernel by @coolkp in #2432
- [FP8 TorchAX Quantization] Add support for bypassing requantization for FP8 2D pre-quantized checkpoints by @jrplatin in #2405
- [Model Loading] Make Qwen3.5 default to TorchAX path by @jrplatin in #2559
- Nightly CI Fix: Stabilize Qwen3-Embedding with Sharding-Aware Pre-warming and E2E test Initialization by @anthonsu in #2554
- Implement SHA256 integrity verification and automatic Pickle-to-Parquet conversion for MLPerf dataset by @boe20211 in #2555
- Fix SQL security issues in report_result.sh by @CienetStingLin in #2557
- Remove expert routing monkey patch and return routed_experts_dict from TPU runner by @weiyu0824 in #2552
- Implement Prometheus metrics collection for TPU KV connector by @richardsliu in #2562
- Update JAX to 0.10.0 by @kyuyeunk in #2291
- [MoE] Extend MOE_REQUANTIZE_WEIGHT_DTYPE for unquantized checkpoint by @lk-chen in #2509
- Add tpu field into ables to track tpu config that kernel tuning uses by @patrickji2014 in #2477
- Fix XLA Compilation warning by @kyuyeunk in #2402
- Replace empty with zeros by @kyuyeunk in #2556
- Replace deprecated pltpu by @kyuyeunk in #2558
- Use ragged-gather-reduce in MOE layer by @gxd3 in #2564
- Update owners and add gxd3 by @kyuyeunk in #2565
- Use jnp.tile instead of the deprecated pltpu.repeat by @QiliangCui2023 in #2566
- [MOE] Split w1/w3 before padding for quantization by @AahilA in #2560
- Ignore benchamrk bk pipeline generation error by @CienetStingLin in #2575
- Add README content for BM_CASE_TYPE by @CienetStingLin in #2539
- Move Qwen3.5_397B_prefill.json to daily/ dir by @meiyeh123 in #2578
- [CI] Correct regex for Buildkite pipeline keys by @meiyeh123 in #2576
- Fix upstream vllm integration break for KVCacheConfig by @richardsliu in #2580
- [Multimodal] [Torchax] Jit wrap vision encoder for Qwen3-VL by @muskansh-google in #2561
- generate the dataset to do decode only benchmark by @yiw-wang in #2508
- update qwen3.5 397B prefill bm setup by @yiw-wang in #2582
- recurrent_scan_v2: fix has_initial_state and Mosaic tile alignment by @qizzzh in #2574
- Unify
prepare_inputs_dpandprepare_inputs_non_dpby @wenxindongwork in #2583 - jitted _select_kv_caches_jit for qwen 3.5 by @caojx-google in #2573
- Fix upstream vllm integration break in MLAAttention by @weiyu0824 in #2586
- [Gemma4] Fix K/V_proj sharding by @lk-chen in #2585
- [DSv3] Reduce MLA copy latency with custom call transposes by @gpolovets1 in #2551
- Add support for mistral 3 large by @karan in #2422
- [Fix nightly] Fix nightly DP performance regression by @wenxindongwork in #2598
- [MLA] Add option to bypass Q activation quantization by @jrplatin in #2593
- Ignore intentional exit for benchmark error logging by @yiw-wang in #2599
- [spec decoding] drafter input preparation logic no need to retrieve sampled tokens from the main model to host first by @gxd3 in #2581
- Add Hierarchical Reduce-Scatter kernel by @dawnhan1111 in #2500
- [runner] kv_cache_manager: handle None text_config attributes by @QiliangCui in #2602
- NVFP4 dequant-in-VMEM by @QiliangCui in #2503
- [Benchmark] Fix dtype in Gemma4 nightly run by @lk-chen in #2594
- Add DEV_MODE to streamline the development process by @theminghuang in #2563
- Fix upstream vllm break during image building by @richardsliu in #2595
- Add support for bucketized req size for attention metadata by @helloworld1 in #2513
- [CI] Skip DB reporting on benchmark failure by @meiyeh123 in #2579
- [DSv3] Update accuracy coverage in support matrix by @gpolovets1 in #2616
- [Perf] flax_nnx: pre-flatten nnx.State to skip per-dispatch _variable_flatten by @lk-chen in #2615
- [MLA] backfill unit test for MLA v2 by @gxd3 in #2618
- [kernel] RPA v3: add update_kv_cache=False for KV-shared layers by @QiliangCui in #2601
- Remove Attention page size override for hybrid models by @pritha90 in #2627
- Support Yield Kernel Tuning Job to Higher Priority Jobs by @patrickji2014 in #2587
- fix the flaky single host P/D issue by @mrjunwan-lang in #2629
- Fix routed_expert performance by @pv97 in #2626
- [CI] Fix
test_multi_modal_inference.pyby @ShobhitBehl in #2630 - [Clean Up] Remove unused
tpu_int8quant method by @jrplatin in #2637 - Disable d2h by default and remove sudo as it not exist in buildkite by @mrjunwan-lang in #2636
- [Spec decoding] make spec decoding compatible with async scheduling by @gxd3 in #2608
- Enable correct DP attention for hybrid attention+mamba models by @qizzzh in #2577
- Revert "[kernel] RPA v3: add update_kv_cache=False for KV-shared layers (#2601)" by @QiliangCui2023 in #2628
- Bump vLLM LKG by @ShobhitBehl in #2638
- [NVFP4] Add NVFP4 for non-standard config models by @jrplatin in #2640
- Cherry-pick: Update installation to specify --no-build-isolation (#2624) by @weiyu0824 in #2658
- Cherry-pick: Revert "Update JAX to 0.10.0" (#2648) by @lk-chen in #2655
- [releases/v0.21.0] Cherry-pick #2686: Add EP to Kimi-K2.6 unit test by @QiliangCui2023 in #2715
- [releases/v0.21.0][CI] Bump gemma-4-31B-it perf benchmark startup timeout to 1800s by @QiliangCui2023 in #2722
New Contributors
- @aashishrampal-lab made their first contribution in #2413
- @sangam-jindal made their first contribution in #2426
- @caojx-google made their first contribution in #2573
- @theminghuang made their first contribution in #2563
Full Changelog: v0.20.0...v0.21.0