Skip to content

v1.6.0 — vLLM 0.18-0.20.x modernization + routed_experts refactor

Choose a tag to compare

@github-actions github-actions released this 05 May 03:10
· 13 commits to master since this release

vLLM 0.18 — 0.20.x integration (PRD #20, #21)

abliterix's vLLM backend is now version-aware: refuses to start against vllm < 0.18 (VLLM_ALLOW_INSECURE_SERIALIZATION is required for MoE editing) and warns on >= 0.21 until smoke-tested.

New [model] config fields, all with safe defaults so existing recipes keep working unchanged:

  • attention_backend — explicit override; None (default) auto-routes MLA models (DeepSeek-V2/V3, MiniMax-M2.x) to FLASH_ATTN_MLA, gpt-oss to TRITON_ATTN, rest to vLLM's default.
  • moe_backend (default "triton") — replaces the deprecated VLLM_FUSED_MOE_UNQUANTIZED_BACKEND env var. Skips the FlashInfer cutlass per-expert-group JIT compile that costs 30+ minutes on first sm_90 cold start.
  • disable_custom_all_reduceNone auto-detects Blackwell PCIe (sm_120) deadlock case; NVLink Hopper / SXM Blackwell keep the perf win.
  • limit_mm_per_prompt — defaults to {image:0, video:0, audio:0} so the Punica LoRA wrapper accepts hybrid VLM/MoE architectures without crashing on visual.* modules.
  • vllm_max_loras / vllm_max_lora_rank / lora_target_modules — LoRA pool + restriction knobs (--lora-target-modules from vLLM PR #34984, v0.19.0+).
  • vllm_compile_mode"eager" (default) / "full_compile" selector for vLLM's compilation_config. The third option "moe_eager_rest_compile" is rejected at config load (closed wontfix in #23 after GPU smoke confirmed PIECEWISE bypasses module hooks).

New deep modules:

  • abliterix.core.vllm_compat — version gating + idempotent env var setup (FLASHINFER_DISABLE_VERSION_CHECK=1 always, VLLM_ALLOW_INSECURE_SERIALIZATION=1 only when needed).
  • abliterix.core.vllm_compilation_config — single-purpose builder for the compilation_config dict; centralises schema knowledge so the rest of the codebase expresses intent ("eager" vs "full_compile") instead of juggling mode / cudagraph_mode integers.

_build_llm_kwargs extracted from VLLMGenerator.__init__ as a pure function (PR #21 review #7) — every conditional kwarg branch is now unit-testable without importing vLLM. Coverage: 51 new CPU-only tests across tests/test_vllm_compat.py, tests/test_vllm_compilation_config.py, tests/test_vllm_backend_kwargs.py.

Architectural cleanup — routed_experts replaces collective_rpc MoE probe (#22, #24)

vLLM 0.20.x exposes per-token routed expert IDs on RequestOutput.outputs[0].routed_experts (numpy ndarray of shape (prompt_tokens, n_layers, top_k)). abliterix's MoE safety-expert profiler now reads this directly and aggregates driver-side.

  • Deletes 4 worker rpc functions + ~150 LoC of forward-hook plumbing.
  • New [model].vllm_return_routed_experts (default True); set false to fall back to the legacy hook path.
  • Removes one of the two reasons abliterix needed VLLM_ALLOW_INSECURE_SERIALIZATION=1 (the other — VLLMMoEEditor.apply() per-trial suppression — still requires it).

GPU-verified parity on DeepSeek-V2-Lite-Chat (issue #22 Phase B7b on H100 NVL): top-1 expert match 22/26 layers (84.6%), top-3 set match 23/26 (88.5%). All 3 divergent layers share top-1; differences appear only in near-tie 3rd slot positions, consistent with the new path reading post-tie-break selections vs the old hook reading raw router logits.

Recipes & docs

  • 110+ new ModelSpec entries in scripts/generate_configs.py (Gemma-4-31B 3/100 vLLM in-place, MiniMax-M2.7 LoRA, Qwen3.6-27B variants, etc.)
  • New docs/vllm.md § "vLLM 0.18 — 0.20.x Integration Knobs" documenting auto-set env vars, attention resolver, moe_backend, compile mode, LoRA pool, custom-all-reduce auto-detect, and the new routed_experts metadata flag.

Closed wontfix (documented limitation)

  • #23 (moe_eager_rest_compile) — GPU smoke confirmed vLLM 0.20.x PIECEWISE compile captures router/expert nn.Module forward calls into graph segments, bypassing PyTorch register_forward_hook. splitting_ops operates at the op level, hooks operate at the Module level — incompatible. The promised throughput recovery for MoE-editor users is unattainable under vLLM 0.20.x's compilation model. enforce_eager=true (or vllm_compile_mode='eager') remains required for any run using VLLMMoEEditor.

Housekeeping

  • pre-commit ruff bumped v0.14.5 → v0.14.8 + pinned to python3.12
  • Standalone ruff format pass over scripts/ + tests/ (no semantic change)
  • README link/metric fixes for Qwen3.6-27B-abliterated

Full Changelog: v1.5.0...v1.6.0