Skip to content

v1.1.0 — Direct Steering, EGA, Gemma 4

Choose a tag to compare

@wuwangzhang1216 wuwangzhang1216 released this 10 Apr 02:37
· 58 commits to master since this release

What's New

Direct Weight Editing (for double-norm architectures)

  • Norm-preserving orthogonal projection in float32 — required for Gemma 4's 4× RMSNorm + PLE architecture where LoRA and hook-based steering are completely ineffective
  • Q/K/V/O projections as steerable targets (5 components per layer vs 2 previously)
  • Wider strength search ranges [1.0, 6.0] to push through low-KL plateaus

Expert-Granular Abliteration (EGA)

  • Projects refusal direction from all expert down_proj slices in every MoE layer
  • Unlike top-N approaches, EGA addresses refusal signal distributed across all experts
  • Gemma 4 26B-A4B support: router.proj + experts.down_proj paths

Gemma 4 Results

Honest Evaluation Methodology

  • Fixed eval detector: uses max_gen_tokens from config instead of hardcoded 50 tokens
  • Documented the "delayed refusal" problem that inflates scores across the community
  • Cross-validated: a prominent "3/100 refusals" model actually scores 60/100 under our pipeline

Other

  • SGLang backend support
  • vLLM native hidden state extraction
  • Step-3.5-Flash MTP layer_types patch
  • Removed sglang extra from pyproject (imageio conflict)
  • CI fully green: lint, typecheck, tests (3.10/3.11/3.12), build
  • Cleaned up obsolete configs, scripts, deploy files (-3000 lines)