Releases
v0.22.1
Compare
Sorry, something went wrong.
No results found
Highlights
MTP (Multi-Token Prediction) Speculative Decoding (#2588 ) — Torch/vLLM MTP implementation. Also: spec decoding is now compatible with attention DP (#2745 ).
Prompt Logprobs support (#2750 ) — return token-level logprobs over the prompt, not just the generated tokens.
MPMD DP support (#2700 ) — multi-program multi-data parallelism for data-parallel deployments, including merged xprof captures across ranks (#2705 , #2723 , #2738 ).
GDN decode kernel optimization (#2671 , #2741 ) — reduced spill, decode-side throughput improvement on GDN-based models.
Punica LoRA ops + custom module parsing (#2471 ) — fills in previously missing Punica LoRA ops.
New features
Model support
Gemma-4 E2B-it on the JAX path with PLE + KV-share + double-wide MLP (#2572 )
Gemma-4 E4B-it on the JAX path (#2690 )
Qwen3-VL Deepstack stateless JIT precompilation on TPU (#2673 )
Qwen3-VL-Embedding-8B via meta-tensor materialization + Deepstack device patching (#2605 )
Qwen3-VL multimodal video: video_grid_thw plumbing (#2404 )
Capabilities
Prompt logprobs (#2750 )
MTP (Multi-Token Prediction) speculative decoding (#2588 ), with a draft-token profiler tweak (#2614 ) and a drafter-padding fix (#2610 )
Punica LoRA ops + custom module parsing (#2471 )
JaxLmHead layer added (#2693 )
MPMD DP support (#2700 ) with xprof merging across ranks (#2705 , #2723 , #2738 )
Bucketized request size with attention DP (#2670 )
Per-request profiling: request-ID tracing in the model forward pass (#2622 ); SparseCore tracing limited to 1 core / 1 tile (#2698 ); xprof advanced options via env (#2538 ); AggregatedStatsLogger decoupled from PhaseBasedProfiler (#2662 ); batch stats instrumentation (#2283 )
Removed forced --disable-chunked-mm-input for multimodal models (#2660 )
Performance & kernel improvements
Linear / matmul
Switch blockwise_matmul → gmm_v2 and establish canonical (k, n) weight layout (#2706 )
QKVParallelLinear: 2D mesh of attn_dp / attn_dp (#2702 ); derive TP size from sharding_config instead of parallel_config (#2682 )
flax_nnx: extend pre-flatten to compute_logits / embed_input_ids / combine_hidden_states / embed_multimodal / run_draft_model (#2657 )
Qwen3.5 small-batch permute/unpermute via one-hot + matmul (#2674 )
Replicate KV head when TP > kv_head (#2661 )
Attention / MLA
MLA transpose via pad-and-slice so the Pallas-kernel version is triggered (#2725 )
MLA kernel-tuner + unit tests for kernel tuners (#2672 )
MLA_XPOSE_NTILE option exposed; w4a8 requant support in TorchAX (#2649 )
RPA v3 + batched: support update_kv_cache=False (KV-share) on both kernels (#2632 ); KV reshape optimization in prepare_inputs (#2653 )
GDN
GDN decode kernel optimization to reduce spill (#2671 , #2741 )
GDN implementation cleanup + reorganization (#2699 )
MoE / DP
Optimize attn-DP + MoE-EP ReduceScatter using psum_scatter, avoid padding (#2679 )
dp_scheduler / batch_prefill: flush timeout (#2651 )
MPMD DP (#2700 ) and rank-0-elected canonical timestamps for MPMD xprof merge (#2723 )
Quantization
NVFP4: fast sharding / requant path (#2719 )
CompressedTensors w4a8: faster sharding + requantization (#2659 )
FP8/CT: MoE weight re-quantization on TPU (#2612 )
Caching
Centralized JAX persistent compilation cache (#2644 ); moved to attached disk (#2697 )
Pre-warm JAX compilation cache to stabilize cache identity in tests (#2688 )
Misc
SparseCore: offload index gathering in the ragged gather-reduce path (#2634 ; reverted by #2648 mid-window and re-introduced via #2683 — see Note)
Bug fixes
Correctness
Kimi-K2.6 / DeepSeek MLA: transpose unquantized kv_b_proj for upstream layout (#2789 / #2786 ) — fixes the kv_b_proj_weight.shape assertion
Gemma-4 multimodal DP: fix DP in multi-modal input (#2643 )
Clamp scratch-ref indices in rpa_body for padded steps (#2734 )
Onehot regression: pass scatter_r… after merge with onehot PR (#2740 )
Spec-decoding drafter padding: seq-len in drafter's input for padding seq should be 0 to avoid wasteful paged-attn work (#2610 )
Kimi-K2.6 EP unit test correction (#2686 )
Qwen3-VL fused weight loading without shard_id (#2647 )
Infra, CI, docs
Harden pipeline-upload scripts against YAML / command injection (#2732 )
Validate TIMEOUT_SECONDS and HEAD_NODE_ADDRESS early (#2735 )
Enforce e2el in percentile-metrics for vllm_bench_serve (#2724 )
Add kernel_tuner_runnner_test into CI (#2726 )
Add batched-RPA E2E test using Qwen Coder (#2694 )
Add MMLU test case via lm_eval (#2429 )
Documentation: nightly README + support-matrix snippets (#2765 ); installation flag --no-build-isolation clarified (#2624 ); v0.20.0 release docs (#2695 )
You can’t perform that action at this time.