v1.1.0 — Direct Steering, EGA, Gemma 4

wuwangzhang1216 released this 10 Apr 02:37

· 58 commits to master since this release

6664a4a

What's New

Direct Weight Editing (for double-norm architectures)

Norm-preserving orthogonal projection in float32 — required for Gemma 4's 4× RMSNorm + PLE architecture where LoRA and hook-based steering are completely ineffective
Q/K/V/O projections as steerable targets (5 components per layer vs 2 previously)
Wider strength search ranges [1.0, 6.0] to push through low-KL plateaus

Expert-Granular Abliteration (EGA)

Projects refusal direction from all expert down_proj slices in every MoE layer
Unlike top-N approaches, EGA addresses refusal signal distributed across all experts
Gemma 4 26B-A4B support: router.proj + experts.down_proj paths

Gemma 4 Results

Gemma 4 31B: 18/100 refusals (baseline 99/100) with only 20 warmup trials
Uploaded to HuggingFace: wangzhang/gemma-4-31B-it-abliterated

Honest Evaluation Methodology

Fixed eval detector: uses max_gen_tokens from config instead of hardcoded 50 tokens
Documented the "delayed refusal" problem that inflates scores across the community
Cross-validated: a prominent "3/100 refusals" model actually scores 60/100 under our pipeline

Other

SGLang backend support
vLLM native hidden state extraction
Step-3.5-Flash MTP layer_types patch
Removed sglang extra from pyproject (imageio conflict)
CI fully green: lint, typecheck, tests (3.10/3.11/3.12), build
Cleaned up obsolete configs, scripts, deploy files (-3000 lines)

Assets 2