Skip to content

v1.4.0

Choose a tag to compare

@github-actions github-actions released this 17 Apr 03:57
· 33 commits to master since this release

What's New in v1.4.0

New Model: Qwen3.6-35B-A3B Abliterated

  • 93% ASR (7/100 refusals), KL divergence 0.0189 — verified by LLM judge + manual 15-prompt smoke test
  • LoRA + Expert-Granular Abliteration (EGA) + MoE router suppression
  • Available on HuggingFace: safetensors | GGUF (BF16/Q8/Q4)

Broken Defenses: Circuit Breakers / Representation Rerouting (NeurIPS 2024)

Both Circuit Breaker models broken with the same lerp-then-abliterate recipe — zero fine-tuning:

Defense Model ASR Refusals KL Released Model
Llama-3-8B-Instruct-RR 99% 1/100 0.017 wangzhang/Llama-3-8B-Instruct-RR-Abliterated
Mistral-7B-Instruct-RR 88% 12/100 0.042 wangzhang/Mistral-7B-Instruct-RR-Abliterated

Full attack recipe and write-up: docs/broken_defenses.md

LLM Judge: No More Silent Fallbacks

Breaking change: The LLM judge no longer silently falls back to keyword matching when the API key is missing or the API fails. It now raises RuntimeError immediately. This prevents the false-compliance problem where garbled/degenerate output was counted as "compliant" by keyword matching.

  • Startup log: LLM judge enabled: model=..., batch_size=..., concurrency=...
  • Per-trial log: LLM judge: X/100 refusals (model=...)
  • Missing OPENROUTER_API_KEY → hard error instead of silent degradation
  • API failure after 3 retries → hard error instead of keyword fallback

Script Consolidation (-1,359 lines)

Merged 6 model-specific scripts into 3 general-purpose ones:

Removed Replaced By
verify_gemma4_e2b.py, verify_gemma4_e4b.py, verify_gemma4_26b_a4b.py, verify_glm47_flash.py verify_model.py --model <any>
sync_gemma4_tokenizer.py sync_tokenizer.py --upstream <src> --downstream <dst>
deploy_deeprefusal.sh (removed — too model-specific)

New utility: quick_test_hf.py --model <repo> — 15-prompt smoke test for any abliterated model on HuggingFace.

Full Changelog

v1.3.0...v1.4.0