Release v0.21.0 · vllm-project/tpu-inference

This release brings several new features and improvements for vLLM TPU Inference.

PyPI
Docker

Highlights

Features

RPA Kernel Heuristic: Adds a heuristic for selecting optimal block sizes for RPA v3. Provides better performance out-of-the-box for all GQA-based models. (#2282)

Models

Gemma 4:
- Add initial support for Gemma 4 31B and 26B-A4.
- Add support for Gemma’s new MTP drafter.
Kimi-K2.6 (text-only): Add initial support for the newest Kimi K2.x models

What's Changed

Revert "Add model loading test for int4 MoE models (#2424)" by @richardsliu in #2480
[Spec decoding] Hardcoded env variables to fix spec decoding e2e tests. by @Lumosis in #2476
[Gemma4] Fix for the updated interface (due to #2343) by @lk-chen in #2483
[Fix][Jax] Qwen3 perf benchmarks: implement named_modules() on JaxModule by @QiliangCui in #2487
Revert changes related to disabling weight tracking by @richardsliu in #2488
Unjit unpack_array by @pv97 in #2492
[CI] Rename host scale descriptions to Small and Large scale queue by @boe20211 in #2491
[CI] Handle upload failures and skip dependent matrix steps by @boe20211 in #2445
[FIX][TPU Offloading] avoid recompile when MODEL_IMPL_TYPE=vllm by @juncgu-google in #2484
[CI] Enhance model pipeline generation and fix Python compatibility by @meiyeh123 in #2455
Bump RunAI model streamer to 0.15.9 to pick up GRPC GCS client support by @amacaskill in #2496
Revert "Add env variable for overriding rpa block sizes" by @kyuyeunk in #2490
Enable bf16 SSM state in fused GDN kernels by default on TPU by @qizzzh in #2482
Add bm test case to ci pipeline by @CienetStingLin in #2494
Skip post-rank-1 matmul in fused GDN kernels via algebraic identity by @qizzzh in #2498
Support StepPool and Qwen3-Embedding (8B) for long-sequence workloads by @anthonsu in #2420
[Spec decoding] Fix flaky test by @gxd3 in #2511
add qwen3.5 397b prefill benchmark by @yiw-wang in #2512
Populate support matrices for v0.19.0 by @jcyang43 in #2514
[Spec decoding] Fix e2e tests to make MODEL_IMPL_TYPE and DRAFT_MODEL_IMPL_TYPE consistent. by @Lumosis in #2515
update docs for v0.19.0 by @jcyang43 in #2516
[CI] Enhance pipeline metadata validation logic by @meiyeh123 in #2389
[FIX][TPU Offload] rm dummy code by @juncgu-google in #2520
[MTP] Fix conv_state shape handling in gdn_attention_core_tpu for MTP by @Lumosis in #2522
[FP8] Resolving HBM spike by forcing loading on CPU before processing by @aashishrampal-lab in #2413
[TPU KV Offload][Feat] Option to override host memory kind by @sangam-jindal in #2426
Prevent double patch for routed_expert IDs by @pv97 in #2519
[Multimodal] Refactor multimodal_manager and embed_input_ids to align with vLLM by @kwang3939 in #2504
Pipeline psum and sc kernel by @clee1994 in #2083
[Batched RPA] Fix "Reshape should have supported layout" during Gemma4 inference by @lk-chen in #2506
Fix crash in phased profiling due to composition stats by @helloworld1 in #2528
Add benchmark case type for seperate case folder by type by @CienetStingLin in #2495
Fix the potential duplicate group key and step key issue in the Benchmark Buildkite pipeline by @CienetStingLin in #2428
Fix upstream vllm break for MLA attention by @richardsliu in #2532
[Spec decoding] make the mtp drafter propose() as 1 jited fn by @gxd3 in #2533
Profiler: Automatically merge multi-host JAX profiling subdirectories… by @junyanxu in #2521
[CI] refactor: replace pipeline name with CI_TARGET in Buildkite and remove pipeline-type by @boe20211 in #2446
[Qwen3.5] Using approx top k for expert selection by @wyzhang in #2518
[CI] Upload the yml based on MODEL_IMPL_TYPE by @meiyeh123 in #2456
[Spec decoding] make drafter's prepare_inputs() a jited fn by @gxd3 in #2535
[CI] Group benchmark cases and resolve Buildkite key length limits by @boe20211 in #2540
[FIX][KV Cache Offloading] Fix async-scheduling placeholder token issue by @juncgu-google in #2449
Fix security bug in daily_run_disagg script by @richardsliu in #2544
Add a check for benchmark file changes by @CienetStingLin in #2524
[Gemma4] bump timeout in nightly performance test by @lk-chen in #2545
[MTP] Support speculative tokens in BlockTable and fix shared layer check by @Lumosis in #2536
Parameterize sql for nightly_benchmark.sh by @jcyang43 in #2548
Escape single quotes for sql in parse_gke_results.sh by @jcyang43 in #2549
[Bench] Add Gemma-4 to benchmark by @lk-chen in #2534
[GMM_V2] account for full VMEM footprint in calculate_tiling by @AahilA in #2550
[benchmark] Fix Total input tokens reporting for chat/multimodal datasets by @lk-chen in #2526
Gdn scan kernel by @coolkp in #2432
[FP8 TorchAX Quantization] Add support for bypassing requantization for FP8 2D pre-quantized checkpoints by @jrplatin in #2405
[Model Loading] Make Qwen3.5 default to TorchAX path by @jrplatin in #2559
Nightly CI Fix: Stabilize Qwen3-Embedding with Sharding-Aware Pre-warming and E2E test Initialization by @anthonsu in #2554
Implement SHA256 integrity verification and automatic Pickle-to-Parquet conversion for MLPerf dataset by @boe20211 in #2555
Fix SQL security issues in report_result.sh by @CienetStingLin in #2557
Remove expert routing monkey patch and return routed_experts_dict from TPU runner by @weiyu0824 in #2552
Implement Prometheus metrics collection for TPU KV connector by @richardsliu in #2562
Update JAX to 0.10.0 by @kyuyeunk in #2291
[MoE] Extend MOE_REQUANTIZE_WEIGHT_DTYPE for unquantized checkpoint by @lk-chen in #2509
Add tpu field into ables to track tpu config that kernel tuning uses by @patrickji2014 in #2477
Fix XLA Compilation warning by @kyuyeunk in #2402
Replace empty with zeros by @kyuyeunk in #2556
Replace deprecated pltpu by @kyuyeunk in #2558
Use ragged-gather-reduce in MOE layer by @gxd3 in #2564
Update owners and add gxd3 by @kyuyeunk in #2565
Use jnp.tile instead of the deprecated pltpu.repeat by @QiliangCui2023 in #2566
[MOE] Split w1/w3 before padding for quantization by @AahilA in #2560
Ignore benchamrk bk pipeline generation error by @CienetStingLin in #2575
Add README content for BM_CASE_TYPE by @CienetStingLin in #2539
Move Qwen3.5_397B_prefill.json to daily/ dir by @meiyeh123 in #2578
[CI] Correct regex for Buildkite pipeline keys by @meiyeh123 in #2576
Fix upstream vllm integration break for KVCacheConfig by @richardsliu in #2580
[Multimodal] [Torchax] Jit wrap vision encoder for Qwen3-VL by @muskansh-google in #2561
generate the dataset to do decode only benchmark by @yiw-wang in #2508
update qwen3.5 397B prefill bm setup by @yiw-wang in #2582
recurrent_scan_v2: fix has_initial_state and Mosaic tile alignment by @qizzzh in #2574
Unify prepare_inputs_dp and prepare_inputs_non_dp by @wenxindongwork in #2583
jitted _select_kv_caches_jit for qwen 3.5 by @caojx-google in #2573
Fix upstream vllm integration break in MLAAttention by @weiyu0824 in #2586
[Gemma4] Fix K/V_proj sharding by @lk-chen in #2585
[DSv3] Reduce MLA copy latency with custom call transposes by @gpolovets1 in #2551
Add support for mistral 3 large by @karan in #2422
[Fix nightly] Fix nightly DP performance regression by @wenxindongwork in #2598
[MLA] Add option to bypass Q activation quantization by @jrplatin in #2593
Ignore intentional exit for benchmark error logging by @yiw-wang in #2599
[spec decoding] drafter input preparation logic no need to retrieve sampled tokens from the main model to host first by @gxd3 in #2581
Add Hierarchical Reduce-Scatter kernel by @dawnhan1111 in #2500
[runner] kv_cache_manager: handle None text_config attributes by @QiliangCui in #2602
NVFP4 dequant-in-VMEM by @QiliangCui in #2503
[Benchmark] Fix dtype in Gemma4 nightly run by @lk-chen in #2594
Add DEV_MODE to streamline the development process by @theminghuang in #2563
Fix upstream vllm break during image building by @richardsliu in #2595
Add support for bucketized req size for attention metadata by @helloworld1 in #2513
[CI] Skip DB reporting on benchmark failure by @meiyeh123 in #2579
[DSv3] Update accuracy coverage in support matrix by @gpolovets1 in #2616
[Perf] flax_nnx: pre-flatten nnx.State to skip per-dispatch _variable_flatten by @lk-chen in #2615
[MLA] backfill unit test for MLA v2 by @gxd3 in #2618
[kernel] RPA v3: add update_kv_cache=False for KV-shared layers by @QiliangCui in #2601
Remove Attention page size override for hybrid models by @pritha90 in #2627
Support Yield Kernel Tuning Job to Higher Priority Jobs by @patrickji2014 in #2587
fix the flaky single host P/D issue by @mrjunwan-lang in #2629
Fix routed_expert performance by @pv97 in #2626
[CI] Fix test_multi_modal_inference.py by @ShobhitBehl in #2630
[Clean Up] Remove unused tpu_int8 quant method by @jrplatin in #2637
Disable d2h by default and remove sudo as it not exist in buildkite by @mrjunwan-lang in #2636
[Spec decoding] make spec decoding compatible with async scheduling by @gxd3 in #2608
Enable correct DP attention for hybrid attention+mamba models by @qizzzh in #2577
Revert "[kernel] RPA v3: add update_kv_cache=False for KV-shared layers (#2601)" by @QiliangCui2023 in #2628
Bump vLLM LKG by @ShobhitBehl in #2638
[NVFP4] Add NVFP4 for non-standard config models by @jrplatin in #2640
Cherry-pick: Update installation to specify --no-build-isolation (#2624) by @weiyu0824 in #2658
Cherry-pick: Revert "Update JAX to 0.10.0" (#2648) by @lk-chen in #2655
[releases/v0.21.0] Cherry-pick #2686: Add EP to Kimi-K2.6 unit test by @QiliangCui2023 in #2715
[releases/v0.21.0][CI] Bump gemma-4-31B-it perf benchmark startup timeout to 1800s by @QiliangCui2023 in #2722

New Contributors

@aashishrampal-lab made their first contribution in #2413
@sangam-jindal made their first contribution in #2426
@caojx-google made their first contribution in #2573
@theminghuang made their first contribution in #2563

Full Changelog: v0.20.0...v0.21.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.21.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Features

Models

What's Changed

New Contributors

Contributors

Uh oh!