Skip to content

v1.7.0 — custom encoder framework + DeepSeek-V4-Flash + Granite 4.1 8B recipes

Choose a tag to compare

@github-actions github-actions released this 26 May 03:54
· 10 commits to master since this release

Custom encoder framework (#52)

Models that ship a Python encoding_*.py instead of a Jinja chat_template are now first-class. Two new ModelConfig fields, both None-defaulted so existing recipes keep working unchanged:

  • custom_encoder_module — filesystem path to a Python module exporting encode_messages(messages: list[dict], **kw) -> str.
  • custom_encoder_kwargs — kwargs forwarded to encode_messages (e.g. {thinking_mode = "non-thinking"} for DeepSeek-V4).

On engine init, abliterix imports the module and monkey-patches tokenizer.apply_chat_template so the rest of the pipeline (tokenisation, hidden-state extraction, steering) stays unchanged. Fails fast on a missing file or missing encode_messages symbol.

DeepSeek-V4-Flash recipe (#52)

DSV4-Flash (284B / 13B active, 256 routed experts top-6, 43 layers) end-to-end:

  • configs/deepseek_v4_flash.toml — HF + EGA path, 4× B200 sizing (165 GiB/card cap per b200_real_vram.md), experts_implementation="eager" per blackwell_grouped_mm_eager.md.
  • quick_start/deploy_dsv4_flash.sh — pod prereq checks, resolves BF16 source (prefers unsloth/DeepSeek-V4-Flash, falls back to deepseek-ai/DeepSeek-V4-Flash + offline dequant), pre-flight 1-token smoke test, launch.
  • quick_start/_dsv4_dequant_fp4_experts.py — FP4 expert dequant helper. Strategy 1: trust the modeling code (Mxfp4Config(dequantize=True)). Strategy 2: manual NVFP4 e2m1 unpack on safetensors shards. Strategy 3: diagnostic dump.
  • quick_start/probe_dsv4_residual.py — mHC residual sanity probe answering one go/no-go question (does Sinkhorn-iterated residual mixing preserve a clean rank-1 refusal direction?) before burning 50 trials of optimiser budget. Reports per-layer residual delta, SVD top-1 concentration ratio, and consecutive-layer cosine drift with a reference profile from MiniMax-M2.7.

Engine-level detection: when transformers' AutoConfig reports quant_method="fp8" AND expert_dtype="fp4", abliterix prints a yellow warning pointing the user at the pre-dequant path (the hybrid layout can't be dequanted in-memory and EGA needs writable BF16 expert tensors).

Why HF + BF16 + DIRECT + EGA, not vLLM (verified 2026-05-05): vLLM 0.20.0+ registers DeepseekV4ForCausalLM and runs inference natively in FP8+FP4 (vllm-project/vllm#40760), but its Unsupported list excludes LoRA, EP, PP, in-place weight editing, AND hooks — every abliterix steering primitive. HF on BF16-dequanted weights remains the only working path.

Granite 4.1 8B recipes (#53)

Six BF16 LoRA configs for ibm-granite/granite-4.1-8b (dense GraniteForCausalLM, 40 layers, Apache-2.0) covering the V1 broad sweep + five quality-tuned Pareto-frontier strategies:

  • granite4.1_8b.toml — baseline 50-trial mean direction
  • granite4.1_8b_lowkl.toml — narrow search around V1's trial-7 Pareto point (13/200 refusals @ KL 0.3185)
  • granite4.1_8b_klfirst.toml — KL-first SRA below the refusal cliff (residual-output path only)
  • granite4.1_8b_sra_quality.toml — SRA (Surgical Refusal Ablation) over the mean signal
  • granite4.1_8b_ot_quality.toml — optimal_transport direction with k/v pruning
  • granite4.1_8b_cosmic_quality.toml — COSMIC cosine-similarity direction selection

quick_start/deploy_granite41_8b.sh runs on a single GPU (≥24 GiB VRAM, 48/80 GiB recommended) with GPU/disk/dataset pre-flight checks before installing deps.

Housekeeping

  • Fix: stale scripts/dequant_dsv4_fp4.py path in the DSV4 hybrid-quant warning corrected to quick_start/_dsv4_dequant_fp4_experts.py.
  • .gitignore: added /.scratch/ for one-off scratch dirs.

Full Changelog: v1.6.0...v1.7.0