v1.6.0 — vLLM 0.18-0.20.x modernization + routed_experts refactor
vLLM 0.18 — 0.20.x integration (PRD #20, #21)
abliterix's vLLM backend is now version-aware: refuses to start against vllm < 0.18 (VLLM_ALLOW_INSECURE_SERIALIZATION is required for MoE editing) and warns on >= 0.21 until smoke-tested.
New [model] config fields, all with safe defaults so existing recipes keep working unchanged:
attention_backend— explicit override;None(default) auto-routes MLA models (DeepSeek-V2/V3, MiniMax-M2.x) toFLASH_ATTN_MLA, gpt-oss toTRITON_ATTN, rest to vLLM's default.moe_backend(default"triton") — replaces the deprecatedVLLM_FUSED_MOE_UNQUANTIZED_BACKENDenv var. Skips the FlashInfer cutlass per-expert-group JIT compile that costs 30+ minutes on first sm_90 cold start.disable_custom_all_reduce—Noneauto-detects Blackwell PCIe (sm_120) deadlock case; NVLink Hopper / SXM Blackwell keep the perf win.limit_mm_per_prompt— defaults to{image:0, video:0, audio:0}so the Punica LoRA wrapper accepts hybrid VLM/MoE architectures without crashing onvisual.*modules.vllm_max_loras/vllm_max_lora_rank/lora_target_modules— LoRA pool + restriction knobs (--lora-target-modulesfrom vLLM PR #34984, v0.19.0+).vllm_compile_mode—"eager"(default) /"full_compile"selector for vLLM'scompilation_config. The third option"moe_eager_rest_compile"is rejected at config load (closed wontfix in #23 after GPU smoke confirmed PIECEWISE bypasses module hooks).
New deep modules:
abliterix.core.vllm_compat— version gating + idempotent env var setup (FLASHINFER_DISABLE_VERSION_CHECK=1always,VLLM_ALLOW_INSECURE_SERIALIZATION=1only when needed).abliterix.core.vllm_compilation_config— single-purpose builder for thecompilation_configdict; centralises schema knowledge so the rest of the codebase expresses intent ("eager"vs"full_compile") instead of jugglingmode/cudagraph_modeintegers.
_build_llm_kwargs extracted from VLLMGenerator.__init__ as a pure function (PR #21 review #7) — every conditional kwarg branch is now unit-testable without importing vLLM. Coverage: 51 new CPU-only tests across tests/test_vllm_compat.py, tests/test_vllm_compilation_config.py, tests/test_vllm_backend_kwargs.py.
Architectural cleanup — routed_experts replaces collective_rpc MoE probe (#22, #24)
vLLM 0.20.x exposes per-token routed expert IDs on RequestOutput.outputs[0].routed_experts (numpy ndarray of shape (prompt_tokens, n_layers, top_k)). abliterix's MoE safety-expert profiler now reads this directly and aggregates driver-side.
- Deletes 4 worker rpc functions + ~150 LoC of forward-hook plumbing.
- New
[model].vllm_return_routed_experts(defaultTrue); setfalseto fall back to the legacy hook path. - Removes one of the two reasons abliterix needed
VLLM_ALLOW_INSECURE_SERIALIZATION=1(the other —VLLMMoEEditor.apply()per-trial suppression — still requires it).
GPU-verified parity on DeepSeek-V2-Lite-Chat (issue #22 Phase B7b on H100 NVL): top-1 expert match 22/26 layers (84.6%), top-3 set match 23/26 (88.5%). All 3 divergent layers share top-1; differences appear only in near-tie 3rd slot positions, consistent with the new path reading post-tie-break selections vs the old hook reading raw router logits.
Recipes & docs
- 110+ new ModelSpec entries in
scripts/generate_configs.py(Gemma-4-31B 3/100 vLLM in-place, MiniMax-M2.7 LoRA, Qwen3.6-27B variants, etc.) - New
docs/vllm.md§ "vLLM 0.18 — 0.20.x Integration Knobs" documenting auto-set env vars, attention resolver, moe_backend, compile mode, LoRA pool, custom-all-reduce auto-detect, and the new routed_experts metadata flag.
Closed wontfix (documented limitation)
- #23 (
moe_eager_rest_compile) — GPU smoke confirmed vLLM 0.20.x PIECEWISE compile captures router/expertnn.Moduleforward calls into graph segments, bypassing PyTorchregister_forward_hook.splitting_opsoperates at the op level, hooks operate at the Module level — incompatible. The promised throughput recovery for MoE-editor users is unattainable under vLLM 0.20.x's compilation model.enforce_eager=true(orvllm_compile_mode='eager') remains required for any run usingVLLMMoEEditor.
Housekeeping
- pre-commit ruff bumped v0.14.5 → v0.14.8 + pinned to python3.12
- Standalone ruff format pass over
scripts/+tests/(no semantic change) - README link/metric fixes for Qwen3.6-27B-abliterated
Full Changelog: v1.5.0...v1.6.0