Skip to content

v0.22.1

Choose a tag to compare

@CienetStingLin CienetStingLin released this 16 Jun 08:21
· 276 commits to main since this release
0045517

Highlights

  • MTP (Multi-Token Prediction) Speculative Decoding (#2588) — Torch/vLLM MTP implementation. Also: spec decoding is now compatible with attention DP (#2745).
  • Prompt Logprobs support (#2750) — return token-level logprobs over the prompt, not just the generated tokens.
  • MPMD DP support (#2700) — multi-program multi-data parallelism for data-parallel deployments, including merged xprof captures across ranks (#2705, #2723, #2738).
  • GDN decode kernel optimization (#2671, #2741) — reduced spill, decode-side throughput improvement on GDN-based models.
  • Punica LoRA ops + custom module parsing (#2471) — fills in previously missing Punica LoRA ops.

New features

Model support

  • Gemma-4 E2B-it on the JAX path with PLE + KV-share + double-wide MLP (#2572)
  • Gemma-4 E4B-it on the JAX path (#2690)
  • Qwen3-VL Deepstack stateless JIT precompilation on TPU (#2673)
  • Qwen3-VL-Embedding-8B via meta-tensor materialization + Deepstack device patching (#2605)
  • Qwen3-VL multimodal video: video_grid_thw plumbing (#2404)

Capabilities

  • Prompt logprobs (#2750)
  • MTP (Multi-Token Prediction) speculative decoding (#2588), with a draft-token profiler tweak (#2614) and a drafter-padding fix (#2610)
  • Punica LoRA ops + custom module parsing (#2471)
  • JaxLmHead layer added (#2693)
  • MPMD DP support (#2700) with xprof merging across ranks (#2705, #2723, #2738)
  • Bucketized request size with attention DP (#2670)
  • Per-request profiling: request-ID tracing in the model forward pass (#2622); SparseCore tracing limited to 1 core / 1 tile (#2698); xprof advanced options via env (#2538); AggregatedStatsLogger decoupled from PhaseBasedProfiler (#2662); batch stats instrumentation (#2283)
  • Removed forced --disable-chunked-mm-input for multimodal models (#2660)

Performance & kernel improvements

Linear / matmul

  • Switch blockwise_matmul → gmm_v2 and establish canonical (k, n) weight layout (#2706)
  • QKVParallelLinear: 2D mesh of attn_dp / attn_dp (#2702); derive TP size from sharding_config instead of parallel_config (#2682)
  • flax_nnx: extend pre-flatten to compute_logits / embed_input_ids / combine_hidden_states / embed_multimodal / run_draft_model (#2657)
  • Qwen3.5 small-batch permute/unpermute via one-hot + matmul (#2674)
  • Replicate KV head when TP > kv_head (#2661)

Attention / MLA

  • MLA transpose via pad-and-slice so the Pallas-kernel version is triggered (#2725)
  • MLA kernel-tuner + unit tests for kernel tuners (#2672)
  • MLA_XPOSE_NTILE option exposed; w4a8 requant support in TorchAX (#2649)
  • RPA v3 + batched: support update_kv_cache=False (KV-share) on both kernels (#2632); KV reshape optimization in prepare_inputs (#2653)

GDN

  • GDN decode kernel optimization to reduce spill (#2671, #2741)
  • GDN implementation cleanup + reorganization (#2699)

MoE / DP

  • Optimize attn-DP + MoE-EP ReduceScatter using psum_scatter, avoid padding (#2679)
  • dp_scheduler / batch_prefill: flush timeout (#2651)
  • MPMD DP (#2700) and rank-0-elected canonical timestamps for MPMD xprof merge (#2723)

Quantization

  • NVFP4: fast sharding / requant path (#2719)
  • CompressedTensors w4a8: faster sharding + requantization (#2659)
  • FP8/CT: MoE weight re-quantization on TPU (#2612)

Caching

  • Centralized JAX persistent compilation cache (#2644); moved to attached disk (#2697)
  • Pre-warm JAX compilation cache to stabilize cache identity in tests (#2688)

Misc

  • SparseCore: offload index gathering in the ragged gather-reduce path (#2634; reverted by #2648 mid-window and re-introduced via #2683 — see Note)

Bug fixes

Correctness

  • Kimi-K2.6 / DeepSeek MLA: transpose unquantized kv_b_proj for upstream layout (#2789 / #2786) — fixes the kv_b_proj_weight.shape assertion
  • Gemma-4 multimodal DP: fix DP in multi-modal input (#2643)
  • Clamp scratch-ref indices in rpa_body for padded steps (#2734)
  • Onehot regression: pass scatter_r… after merge with onehot PR (#2740)
  • Spec-decoding drafter padding: seq-len in drafter's input for padding seq should be 0 to avoid wasteful paged-attn work (#2610)
  • Kimi-K2.6 EP unit test correction (#2686)
  • Qwen3-VL fused weight loading without shard_id (#2647)

Infra, CI, docs

  • Harden pipeline-upload scripts against YAML / command injection (#2732)
  • Validate TIMEOUT_SECONDS and HEAD_NODE_ADDRESS early (#2735)
  • Enforce e2el in percentile-metrics for vllm_bench_serve (#2724)
  • Add kernel_tuner_runnner_test into CI (#2726)
  • Add batched-RPA E2E test using Qwen Coder (#2694)
  • Add MMLU test case via lm_eval (#2429)
  • Documentation: nightly README + support-matrix snippets (#2765); installation flag --no-build-isolation clarified (#2624); v0.20.0 release docs (#2695)