v1.3.0
abliterix v1.3.0 — DeepRefusal Broken
The headline: DeepRefusal (arXiv:2509.15202), which claimed 0.2–0.4% refusal-attack ASR and was described as uncircumventable in its own paper, is broken. Attenuate its rank-16 LoRA delta, run standard single-pass abliteration, get 89% ASR with 14/15 hardcore prompts fully compliant at KL=0.05.
Released model: wuwangzhang1216/Llama-3-8B-DeepRefusal-Broken
Highlights
Breaking DeepRefusal
DeepRefusal publishes a Llama-3-8B checkpoint hardened against direction-ablation attacks via probabilistic direction ablation during fine-tuning. Table 1 of the paper reports every known attack failing against it:
| Attack | Paper ASR |
|---|---|
| heretic | fails (unable to circumvent) |
| Refusal Ablation | 0.4% |
| Refusal-Transfer | 0.4% |
| GCG | 2.0% |
| Prefilling | 0.4% |
| CodeAttack | 0.2% |
| abliterix 1.3.0 | 89% |
How the attack works — in scripts/deeprefusal_attenuate.py:
- Diff the defended weights against the Llama-3-8B base:
ΔW = W_def − W_base. - SVD analysis confirms the paper's rank-16 LoRA claim —
v_proj,gate_proj, ando_projall show a sharp singular-value cliff at rank 16. - Attenuate:
W_attenuated = W_base + λ·ΔWwithλ ≈ 0.5. - Run standard single-pass abliteration on the attenuated model.
Result on 100-prompt hardcore eval: 11 refusals / 100, 14/15 adversarial prompts (bomb synthesis, malware source, phishing templates, drug synthesis) fully compliant. Full suite 164/164 tests passing.
Winning config: configs/llama3_8b_deeprefusal_attenuated.toml.
Iterative multi-pass framework
Generalized framework introduced for hardened models (single-pass was enough for DeepRefusal, but this framework is now the basis for attacking future defenses):
src/abliterix/iterative.py— extract → ablate → re-extract loop with per-pass direction orthogonalisation and QR subspace accumulation.src/abliterix/vectors.py—orthogonalize_against()andbuild_subspace_basis()helpers.src/abliterix/core/steering.py—_apply_direct_steering()gains a subspace-projection branch for 3D steering tensors(n_dirs, layers+1, hidden_dim);_detect_discriminative_layers()handles 3D input;apply_steering()forcesglobal_vector=Nonefor 3D subspace tensors.IterativeConfigadded tosrc/abliterix/settings.py.
Critical detector bug fix ⚠️
All refusal counts in prior versions were inflated by roughly 33 percentage points of false positives.
RefusalDetector.is_obvious_refusal in src/abliterix/eval/detector.py was short-circuiting through _is_degenerate before consulting the LLM judge. That degeneracy check flagged ~33% of compliant responses — specifically, long markdown-formatted outputs — as refusals. Every benchmark across the project was affected. The short-circuit is removed; degenerate output is still handled inside the LLM judge pipeline.
If you compared abliterix results against other tools or your own runs prior to v1.3.0, the delta is noise from this bug, not a methodology difference. Re-run your evals.
vLLM & SGLang: KV-cache budget knobs (10–100× batch size for short prompts)
ModelConfig (in src/abliterix/settings.py) gains two new fields:
max_model_len— caps the backend's reserved context window.max_num_seqs— caps concurrent sequences per batch.
Abliteration prompts are short (a few hundred tokens). vLLM's default reserves 128K tokens of KV cache per sequence — for abliteration this is ~32× overprovisioned. Dropping max_model_len to 4K frees enough KV cache for batches 10–100× larger, which is the difference between a 6-hour extraction and a 6-minute one.
Both backends respect the new fields:
Example:
[model]
name = "meta-llama/Llama-3-8B-Instruct"
backend = "vllm"
max_model_len = 4096
max_num_seqs = 256MiniMax M2 / M2.5 / M2.7 via vLLM fast path
src/abliterix/core/vllm_hidden_states.py now registers minimax_m2 in the supported model-types set, so M2 / M2.5 / M2.7 use vLLM's native extract_hidden_states fast path instead of the slower fallback.
New configs:
New scripts
scripts/deeprefusal_attenuate.py— LoRA-delta attenuation for DeepRefusal-style defenses, with SVD rank analysis.scripts/deploy_deeprefusal.sh— end-to-end deploy script for the attack pipeline.scripts/sync_gemma4_tokenizer.py— helper for the Gemma 4 tokenizer quirk.
New configs
configs/llama3_8b_deeprefusal_attenuated.toml— the winning config (11/100 refusals, KL=0.05).configs/llama3_8b_deeprefusal.toml— iterative-only variant (did not break DeepRefusal on its own, retained for comparison).configs/llama3_8b_base_control.toml— diagnostic baseline on NousResearch Llama-3 mirror.configs/minimax_m2.7_vllm.toml,configs/minimax_m2.7_sglang.toml.
Docs
- README hero banner promoting the DeepRefusal break and linking the released model.
- New "Broken Defenses — DeepRefusal" section with the full head-to-head comparison against heretic and every Table 1 attack from the DeepRefusal paper.
- Results table:
DeepRefusal-Brokenrow as the top entry.
Tests
- 8 new unit tests covering
orthogonalize_against,build_subspace_basis, andIterativeConfigdefaults. - Full suite: 164/164 passing.
Upgrading
pip install -U abliterix==1.3.0If you had prior ModelConfig objects written against 1.1.0 / 1.2.0, they remain compatible — max_model_len and max_num_seqs are optional. Do re-run any evals if you're comparing refusal rates against prior versions, because of the detector bug fix above.
Notes on the versioning gap
Tag v1.2.0 existed in git but pyproject.toml was never bumped past 1.1.0, so PyPI never received a 1.2.0 build. v1.3.0 corrects this and contains everything that would have been in 1.2.0 plus the DeepRefusal work.
Commits
030f0a7docs: promote DeepRefusal victory in README + ruff format fixes699d327feat: add max_model_len / max_num_seqs for vLLM & SGLang, MiniMax M2.7 configsac2197cfeat: break DeepRefusal via LoRA attenuation + standard abliterationf243d71style: ruff format settings.py