Skip to content

v0.21.0

Choose a tag to compare

@CienetStingLin CienetStingLin released this 05 Jun 06:15
· 461 commits to main since this release

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Features

  • RPA Kernel Heuristic: Adds a heuristic for selecting optimal block sizes for RPA v3. Provides better performance out-of-the-box for all GQA-based models. (#2282)

Models

  • Gemma 4:
    • Add initial support for Gemma 4 31B and 26B-A4.
    • Add support for Gemma’s new MTP drafter.
  • Kimi-K2.6 (text-only): Add initial support for the newest Kimi K2.x models

What's Changed

  • Revert "Add model loading test for int4 MoE models (#2424)" by @richardsliu in #2480
  • [Spec decoding] Hardcoded env variables to fix spec decoding e2e tests. by @Lumosis in #2476
  • [Gemma4] Fix for the updated interface (due to #2343) by @lk-chen in #2483
  • [Fix][Jax] Qwen3 perf benchmarks: implement named_modules() on JaxModule by @QiliangCui in #2487
  • Revert changes related to disabling weight tracking by @richardsliu in #2488
  • Unjit unpack_array by @pv97 in #2492
  • [CI] Rename host scale descriptions to Small and Large scale queue by @boe20211 in #2491
  • [CI] Handle upload failures and skip dependent matrix steps by @boe20211 in #2445
  • [FIX][TPU Offloading] avoid recompile when MODEL_IMPL_TYPE=vllm by @juncgu-google in #2484
  • [CI] Enhance model pipeline generation and fix Python compatibility by @meiyeh123 in #2455
  • Bump RunAI model streamer to 0.15.9 to pick up GRPC GCS client support by @amacaskill in #2496
  • Revert "Add env variable for overriding rpa block sizes" by @kyuyeunk in #2490
  • Enable bf16 SSM state in fused GDN kernels by default on TPU by @qizzzh in #2482
  • Add bm test case to ci pipeline by @CienetStingLin in #2494
  • Skip post-rank-1 matmul in fused GDN kernels via algebraic identity by @qizzzh in #2498
  • Support StepPool and Qwen3-Embedding (8B) for long-sequence workloads by @anthonsu in #2420
  • [Spec decoding] Fix flaky test by @gxd3 in #2511
  • add qwen3.5 397b prefill benchmark by @yiw-wang in #2512
  • Populate support matrices for v0.19.0 by @jcyang43 in #2514
  • [Spec decoding] Fix e2e tests to make MODEL_IMPL_TYPE and DRAFT_MODEL_IMPL_TYPE consistent. by @Lumosis in #2515
  • update docs for v0.19.0 by @jcyang43 in #2516
  • [CI] Enhance pipeline metadata validation logic by @meiyeh123 in #2389
  • [FIX][TPU Offload] rm dummy code by @juncgu-google in #2520
  • [MTP] Fix conv_state shape handling in gdn_attention_core_tpu for MTP by @Lumosis in #2522
  • [FP8] Resolving HBM spike by forcing loading on CPU before processing by @aashishrampal-lab in #2413
  • [TPU KV Offload][Feat] Option to override host memory kind by @sangam-jindal in #2426
  • Prevent double patch for routed_expert IDs by @pv97 in #2519
  • [Multimodal] Refactor multimodal_manager and embed_input_ids to align with vLLM by @kwang3939 in #2504
  • Pipeline psum and sc kernel by @clee1994 in #2083
  • [Batched RPA] Fix "Reshape should have supported layout" during Gemma4 inference by @lk-chen in #2506
  • Fix crash in phased profiling due to composition stats by @helloworld1 in #2528
  • Add benchmark case type for seperate case folder by type by @CienetStingLin in #2495
  • Fix the potential duplicate group key and step key issue in the Benchmark Buildkite pipeline by @CienetStingLin in #2428
  • Fix upstream vllm break for MLA attention by @richardsliu in #2532
  • [Spec decoding] make the mtp drafter propose() as 1 jited fn by @gxd3 in #2533
  • Profiler: Automatically merge multi-host JAX profiling subdirectories… by @junyanxu in #2521
  • [CI] refactor: replace pipeline name with CI_TARGET in Buildkite and remove pipeline-type by @boe20211 in #2446
  • [Qwen3.5] Using approx top k for expert selection by @wyzhang in #2518
  • [CI] Upload the yml based on MODEL_IMPL_TYPE by @meiyeh123 in #2456
  • [Spec decoding] make drafter's prepare_inputs() a jited fn by @gxd3 in #2535
  • [CI] Group benchmark cases and resolve Buildkite key length limits by @boe20211 in #2540
  • [FIX][KV Cache Offloading] Fix async-scheduling placeholder token issue by @juncgu-google in #2449
  • Fix security bug in daily_run_disagg script by @richardsliu in #2544
  • Add a check for benchmark file changes by @CienetStingLin in #2524
  • [Gemma4] bump timeout in nightly performance test by @lk-chen in #2545
  • [MTP] Support speculative tokens in BlockTable and fix shared layer check by @Lumosis in #2536
  • Parameterize sql for nightly_benchmark.sh by @jcyang43 in #2548
  • Escape single quotes for sql in parse_gke_results.sh by @jcyang43 in #2549
  • [Bench] Add Gemma-4 to benchmark by @lk-chen in #2534
  • [GMM_V2] account for full VMEM footprint in calculate_tiling by @AahilA in #2550
  • [benchmark] Fix Total input tokens reporting for chat/multimodal datasets by @lk-chen in #2526
  • Gdn scan kernel by @coolkp in #2432
  • [FP8 TorchAX Quantization] Add support for bypassing requantization for FP8 2D pre-quantized checkpoints by @jrplatin in #2405
  • [Model Loading] Make Qwen3.5 default to TorchAX path by @jrplatin in #2559
  • Nightly CI Fix: Stabilize Qwen3-Embedding with Sharding-Aware Pre-warming and E2E test Initialization by @anthonsu in #2554
  • Implement SHA256 integrity verification and automatic Pickle-to-Parquet conversion for MLPerf dataset by @boe20211 in #2555
  • Fix SQL security issues in report_result.sh by @CienetStingLin in #2557
  • Remove expert routing monkey patch and return routed_experts_dict from TPU runner by @weiyu0824 in #2552
  • Implement Prometheus metrics collection for TPU KV connector by @richardsliu in #2562
  • Update JAX to 0.10.0 by @kyuyeunk in #2291
  • [MoE] Extend MOE_REQUANTIZE_WEIGHT_DTYPE for unquantized checkpoint by @lk-chen in #2509
  • Add tpu field into ables to track tpu config that kernel tuning uses by @patrickji2014 in #2477
  • Fix XLA Compilation warning by @kyuyeunk in #2402
  • Replace empty with zeros by @kyuyeunk in #2556
  • Replace deprecated pltpu by @kyuyeunk in #2558
  • Use ragged-gather-reduce in MOE layer by @gxd3 in #2564
  • Update owners and add gxd3 by @kyuyeunk in #2565
  • Use jnp.tile instead of the deprecated pltpu.repeat by @QiliangCui2023 in #2566
  • [MOE] Split w1/w3 before padding for quantization by @AahilA in #2560
  • Ignore benchamrk bk pipeline generation error by @CienetStingLin in #2575
  • Add README content for BM_CASE_TYPE by @CienetStingLin in #2539
  • Move Qwen3.5_397B_prefill.json to daily/ dir by @meiyeh123 in #2578
  • [CI] Correct regex for Buildkite pipeline keys by @meiyeh123 in #2576
  • Fix upstream vllm integration break for KVCacheConfig by @richardsliu in #2580
  • [Multimodal] [Torchax] Jit wrap vision encoder for Qwen3-VL by @muskansh-google in #2561
  • generate the dataset to do decode only benchmark by @yiw-wang in #2508
  • update qwen3.5 397B prefill bm setup by @yiw-wang in #2582
  • recurrent_scan_v2: fix has_initial_state and Mosaic tile alignment by @qizzzh in #2574
  • Unify prepare_inputs_dp and prepare_inputs_non_dp by @wenxindongwork in #2583
  • jitted _select_kv_caches_jit for qwen 3.5 by @caojx-google in #2573
  • Fix upstream vllm integration break in MLAAttention by @weiyu0824 in #2586
  • [Gemma4] Fix K/V_proj sharding by @lk-chen in #2585
  • [DSv3] Reduce MLA copy latency with custom call transposes by @gpolovets1 in #2551
  • Add support for mistral 3 large by @karan in #2422
  • [Fix nightly] Fix nightly DP performance regression by @wenxindongwork in #2598
  • [MLA] Add option to bypass Q activation quantization by @jrplatin in #2593
  • Ignore intentional exit for benchmark error logging by @yiw-wang in #2599
  • [spec decoding] drafter input preparation logic no need to retrieve sampled tokens from the main model to host first by @gxd3 in #2581
  • Add Hierarchical Reduce-Scatter kernel by @dawnhan1111 in #2500
  • [runner] kv_cache_manager: handle None text_config attributes by @QiliangCui in #2602
  • NVFP4 dequant-in-VMEM by @QiliangCui in #2503
  • [Benchmark] Fix dtype in Gemma4 nightly run by @lk-chen in #2594
  • Add DEV_MODE to streamline the development process by @theminghuang in #2563
  • Fix upstream vllm break during image building by @richardsliu in #2595
  • Add support for bucketized req size for attention metadata by @helloworld1 in #2513
  • [CI] Skip DB reporting on benchmark failure by @meiyeh123 in #2579
  • [DSv3] Update accuracy coverage in support matrix by @gpolovets1 in #2616
  • [Perf] flax_nnx: pre-flatten nnx.State to skip per-dispatch _variable_flatten by @lk-chen in #2615
  • [MLA] backfill unit test for MLA v2 by @gxd3 in #2618
  • [kernel] RPA v3: add update_kv_cache=False for KV-shared layers by @QiliangCui in #2601
  • Remove Attention page size override for hybrid models by @pritha90 in #2627
  • Support Yield Kernel Tuning Job to Higher Priority Jobs by @patrickji2014 in #2587
  • fix the flaky single host P/D issue by @mrjunwan-lang in #2629
  • Fix routed_expert performance by @pv97 in #2626
  • [CI] Fix test_multi_modal_inference.py by @ShobhitBehl in #2630
  • [Clean Up] Remove unused tpu_int8 quant method by @jrplatin in #2637
  • Disable d2h by default and remove sudo as it not exist in buildkite by @mrjunwan-lang in #2636
  • [Spec decoding] make spec decoding compatible with async scheduling by @gxd3 in #2608
  • Enable correct DP attention for hybrid attention+mamba models by @qizzzh in #2577
  • Revert "[kernel] RPA v3: add update_kv_cache=False for KV-shared layers (#2601)" by @QiliangCui2023 in #2628
  • Bump vLLM LKG by @ShobhitBehl in #2638
  • [NVFP4] Add NVFP4 for non-standard config models by @jrplatin in #2640
  • Cherry-pick: Update installation to specify --no-build-isolation (#2624) by @weiyu0824 in #2658
  • Cherry-pick: Revert "Update JAX to 0.10.0" (#2648) by @lk-chen in #2655
  • [releases/v0.21.0] Cherry-pick #2686: Add EP to Kimi-K2.6 unit test by @QiliangCui2023 in #2715
  • [releases/v0.21.0][CI] Bump gemma-4-31B-it perf benchmark startup timeout to 1800s by @QiliangCui2023 in #2722

New Contributors

Full Changelog: v0.20.0...v0.21.0