abliterix v1.3.0 — DeepRefusal Broken

The headline: DeepRefusal (arXiv:2509.15202), which claimed 0.2–0.4% refusal-attack ASR and was described as uncircumventable in its own paper, is broken. Attenuate its rank-16 LoRA delta, run standard single-pass abliteration, get 89% ASR with 14/15 hardcore prompts fully compliant at KL=0.05.

Released model: wuwangzhang1216/Llama-3-8B-DeepRefusal-Broken

Highlights

Breaking DeepRefusal

DeepRefusal publishes a Llama-3-8B checkpoint hardened against direction-ablation attacks via probabilistic direction ablation during fine-tuning. Table 1 of the paper reports every known attack failing against it:

Attack	Paper ASR
heretic	fails (unable to circumvent)
Refusal Ablation	0.4%
Refusal-Transfer	0.4%
GCG	2.0%
Prefilling	0.4%
CodeAttack	0.2%
abliterix 1.3.0	89%

How the attack works — in scripts/deeprefusal_attenuate.py:

Diff the defended weights against the Llama-3-8B base: ΔW = W_def − W_base.
SVD analysis confirms the paper's rank-16 LoRA claim — v_proj, gate_proj, and o_proj all show a sharp singular-value cliff at rank 16.
Attenuate: W_attenuated = W_base + λ·ΔW with λ ≈ 0.5.
Run standard single-pass abliteration on the attenuated model.

Result on 100-prompt hardcore eval: 11 refusals / 100, 14/15 adversarial prompts (bomb synthesis, malware source, phishing templates, drug synthesis) fully compliant. Full suite 164/164 tests passing.

Winning config: configs/llama3_8b_deeprefusal_attenuated.toml.

Iterative multi-pass framework

Generalized framework introduced for hardened models (single-pass was enough for DeepRefusal, but this framework is now the basis for attacking future defenses):

src/abliterix/iterative.py — extract → ablate → re-extract loop with per-pass direction orthogonalisation and QR subspace accumulation.
src/abliterix/vectors.py — orthogonalize_against() and build_subspace_basis() helpers.
src/abliterix/core/steering.py — _apply_direct_steering() gains a subspace-projection branch for 3D steering tensors (n_dirs, layers+1, hidden_dim); _detect_discriminative_layers() handles 3D input; apply_steering() forces global_vector=None for 3D subspace tensors.
IterativeConfig added to src/abliterix/settings.py.

Critical detector bug fix ⚠️

All refusal counts in prior versions were inflated by roughly 33 percentage points of false positives.

RefusalDetector.is_obvious_refusal in src/abliterix/eval/detector.py was short-circuiting through _is_degenerate before consulting the LLM judge. That degeneracy check flagged ~33% of compliant responses — specifically, long markdown-formatted outputs — as refusals. Every benchmark across the project was affected. The short-circuit is removed; degenerate output is still handled inside the LLM judge pipeline.

If you compared abliterix results against other tools or your own runs prior to v1.3.0, the delta is noise from this bug, not a methodology difference. Re-run your evals.

vLLM & SGLang: KV-cache budget knobs (10–100× batch size for short prompts)

ModelConfig (in src/abliterix/settings.py) gains two new fields:

max_model_len — caps the backend's reserved context window.
max_num_seqs — caps concurrent sequences per batch.

Abliteration prompts are short (a few hundred tokens). vLLM's default reserves 128K tokens of KV cache per sequence — for abliteration this is ~32× overprovisioned. Dropping max_model_len to 4K frees enough KV cache for batches 10–100× larger, which is the difference between a 6-hour extraction and a 6-minute one.

Both backends respect the new fields:

Example:

[model]
name = "meta-llama/Llama-3-8B-Instruct"
backend = "vllm"
max_model_len = 4096
max_num_seqs = 256

MiniMax M2 / M2.5 / M2.7 via vLLM fast path

src/abliterix/core/vllm_hidden_states.py now registers minimax_m2 in the supported model-types set, so M2 / M2.5 / M2.7 use vLLM's native extract_hidden_states fast path instead of the slower fallback.

New configs:

New scripts

scripts/deeprefusal_attenuate.py — LoRA-delta attenuation for DeepRefusal-style defenses, with SVD rank analysis.
scripts/deploy_deeprefusal.sh — end-to-end deploy script for the attack pipeline.
scripts/sync_gemma4_tokenizer.py — helper for the Gemma 4 tokenizer quirk.

New configs

configs/llama3_8b_deeprefusal_attenuated.toml — the winning config (11/100 refusals, KL=0.05).
configs/llama3_8b_deeprefusal.toml — iterative-only variant (did not break DeepRefusal on its own, retained for comparison).
configs/llama3_8b_base_control.toml — diagnostic baseline on NousResearch Llama-3 mirror.
configs/minimax_m2.7_vllm.toml, configs/minimax_m2.7_sglang.toml.

Docs

README hero banner promoting the DeepRefusal break and linking the released model.
New "Broken Defenses — DeepRefusal" section with the full head-to-head comparison against heretic and every Table 1 attack from the DeepRefusal paper.
Results table: DeepRefusal-Broken row as the top entry.

Tests

8 new unit tests covering orthogonalize_against, build_subspace_basis, and IterativeConfig defaults.
Full suite: 164/164 passing.

Upgrading

pip install -U abliterix==1.3.0

If you had prior ModelConfig objects written against 1.1.0 / 1.2.0, they remain compatible — max_model_len and max_num_seqs are optional. Do re-run any evals if you're comparing refusal rates against prior versions, because of the detector bug fix above.

Notes on the versioning gap

Tag v1.2.0 existed in git but pyproject.toml was never bumped past 1.1.0, so PyPI never received a 1.2.0 build. v1.3.0 corrects this and contains everything that would have been in 1.2.0 plus the DeepRefusal work.

Commits

030f0a7 docs: promote DeepRefusal victory in README + ruff format fixes
699d327 feat: add max_model_len / max_num_seqs for vLLM & SGLang, MiniMax M2.7 configs
ac2197c feat: break DeepRefusal via LoRA attenuation + standard abliteration
f243d71 style: ruff format settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0

Choose a tag to compare

Sorry, something went wrong.