Release v0.22.1 · vllm-project/tpu-inference

PyPI
Docker

Highlights

MTP (Multi-Token Prediction) Speculative Decoding (#2588) — Torch/vLLM MTP implementation. Also: spec decoding is now compatible with attention DP (#2745).
Prompt Logprobs support (#2750) — return token-level logprobs over the prompt, not just the generated tokens.
MPMD DP support (#2700) — multi-program multi-data parallelism for data-parallel deployments, including merged xprof captures across ranks (#2705, #2723, #2738).
GDN decode kernel optimization (#2671, #2741) — reduced spill, decode-side throughput improvement on GDN-based models.
Punica LoRA ops + custom module parsing (#2471) — fills in previously missing Punica LoRA ops.

New features

Model support

Gemma-4 E2B-it on the JAX path with PLE + KV-share + double-wide MLP (#2572)
Gemma-4 E4B-it on the JAX path (#2690)
Qwen3-VL Deepstack stateless JIT precompilation on TPU (#2673)
Qwen3-VL-Embedding-8B via meta-tensor materialization + Deepstack device patching (#2605)
Qwen3-VL multimodal video: video_grid_thw plumbing (#2404)

Capabilities

Prompt logprobs (#2750)
MTP (Multi-Token Prediction) speculative decoding (#2588), with a draft-token profiler tweak (#2614) and a drafter-padding fix (#2610)
Punica LoRA ops + custom module parsing (#2471)
JaxLmHead layer added (#2693)
MPMD DP support (#2700) with xprof merging across ranks (#2705, #2723, #2738)
Bucketized request size with attention DP (#2670)
Per-request profiling: request-ID tracing in the model forward pass (#2622); SparseCore tracing limited to 1 core / 1 tile (#2698); xprof advanced options via env (#2538); AggregatedStatsLogger decoupled from PhaseBasedProfiler (#2662); batch stats instrumentation (#2283)
Removed forced --disable-chunked-mm-input for multimodal models (#2660)

Performance & kernel improvements

Linear / matmul

Switch blockwise_matmul → gmm_v2 and establish canonical (k, n) weight layout (#2706)
QKVParallelLinear: 2D mesh of attn_dp / attn_dp (#2702); derive TP size from sharding_config instead of parallel_config (#2682)
flax_nnx: extend pre-flatten to compute_logits / embed_input_ids / combine_hidden_states / embed_multimodal / run_draft_model (#2657)
Qwen3.5 small-batch permute/unpermute via one-hot + matmul (#2674)
Replicate KV head when TP > kv_head (#2661)

Attention / MLA

MLA transpose via pad-and-slice so the Pallas-kernel version is triggered (#2725)
MLA kernel-tuner + unit tests for kernel tuners (#2672)
MLA_XPOSE_NTILE option exposed; w4a8 requant support in TorchAX (#2649)
RPA v3 + batched: support update_kv_cache=False (KV-share) on both kernels (#2632); KV reshape optimization in prepare_inputs (#2653)

GDN

GDN decode kernel optimization to reduce spill (#2671, #2741)
GDN implementation cleanup + reorganization (#2699)

MoE / DP

Optimize attn-DP + MoE-EP ReduceScatter using psum_scatter, avoid padding (#2679)
dp_scheduler / batch_prefill: flush timeout (#2651)
MPMD DP (#2700) and rank-0-elected canonical timestamps for MPMD xprof merge (#2723)

Quantization

NVFP4: fast sharding / requant path (#2719)
CompressedTensors w4a8: faster sharding + requantization (#2659)
FP8/CT: MoE weight re-quantization on TPU (#2612)

Caching

Centralized JAX persistent compilation cache (#2644); moved to attached disk (#2697)
Pre-warm JAX compilation cache to stabilize cache identity in tests (#2688)

Misc

SparseCore: offload index gathering in the ragged gather-reduce path (#2634; reverted by #2648 mid-window and re-introduced via #2683 — see Note)

Bug fixes

Correctness

Kimi-K2.6 / DeepSeek MLA: transpose unquantized kv_b_proj for upstream layout (#2789 / #2786) — fixes the kv_b_proj_weight.shape assertion
Gemma-4 multimodal DP: fix DP in multi-modal input (#2643)
Clamp scratch-ref indices in rpa_body for padded steps (#2734)
Onehot regression: pass scatter_r… after merge with onehot PR (#2740)
Spec-decoding drafter padding: seq-len in drafter's input for padding seq should be 0 to avoid wasteful paged-attn work (#2610)
Kimi-K2.6 EP unit test correction (#2686)
Qwen3-VL fused weight loading without shard_id (#2647)

Infra, CI, docs

Harden pipeline-upload scripts against YAML / command injection (#2732)
Validate TIMEOUT_SECONDS and HEAD_NODE_ADDRESS early (#2735)
Enforce e2el in percentile-metrics for vllm_bench_serve (#2724)
Add kernel_tuner_runnner_test into CI (#2726)
Add batched-RPA E2E test using Qwen Coder (#2694)
Add MMLU test case via lm_eval (#2429)
Documentation: nightly README + support-matrix snippets (#2765); installation flag --no-build-isolation clarified (#2624); v0.20.0 release docs (#2695)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.22.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New features

Model support

Capabilities

Performance & kernel improvements

Linear / matmul

Attention / MLA

GDN

MoE / DP

Quantization

Caching

Misc

Bug fixes

Correctness

Infra, CI, docs

Uh oh!