Release v1.4.0 · wuwangzhang1216/abliterix

What's New in v1.4.0

New Model: Qwen3.6-35B-A3B Abliterated

93% ASR (7/100 refusals), KL divergence 0.0189 — verified by LLM judge + manual 15-prompt smoke test
LoRA + Expert-Granular Abliteration (EGA) + MoE router suppression
Available on HuggingFace: safetensors | GGUF (BF16/Q8/Q4)

Broken Defenses: Circuit Breakers / Representation Rerouting (NeurIPS 2024)

Both Circuit Breaker models broken with the same lerp-then-abliterate recipe — zero fine-tuning:

Defense Model	ASR	Refusals	KL	Released Model
Llama-3-8B-Instruct-RR	99%	1/100	0.017	wangzhang/Llama-3-8B-Instruct-RR-Abliterated
Mistral-7B-Instruct-RR	88%	12/100	0.042	wangzhang/Mistral-7B-Instruct-RR-Abliterated

Full attack recipe and write-up: docs/broken_defenses.md

LLM Judge: No More Silent Fallbacks

Breaking change: The LLM judge no longer silently falls back to keyword matching when the API key is missing or the API fails. It now raises RuntimeError immediately. This prevents the false-compliance problem where garbled/degenerate output was counted as "compliant" by keyword matching.

Startup log: LLM judge enabled: model=..., batch_size=..., concurrency=...
Per-trial log: LLM judge: X/100 refusals (model=...)
Missing OPENROUTER_API_KEY → hard error instead of silent degradation
API failure after 3 retries → hard error instead of keyword fallback

Script Consolidation (-1,359 lines)

Merged 6 model-specific scripts into 3 general-purpose ones:

Removed	Replaced By
`verify_gemma4_e2b.py`, `verify_gemma4_e4b.py`, `verify_gemma4_26b_a4b.py`, `verify_glm47_flash.py`	`verify_model.py --model <any>`
`sync_gemma4_tokenizer.py`	`sync_tokenizer.py --upstream <src> --downstream <dst>`
`deploy_deeprefusal.sh`	(removed — too model-specific)

New utility: quick_test_hf.py --model <repo> — 15-prompt smoke test for any abliterated model on HuggingFace.

Full Changelog

v1.3.0...v1.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0

Choose a tag to compare

Sorry, something went wrong.