Carrier-probe benchmark that measures the 'flinch' — how much a model shrinks the probability of a charged word when it is the obvious next token in a sentence.
-
Updated
Apr 20, 2026 - Python
Carrier-probe benchmark that measures the 'flinch' — how much a model shrinks the probability of a charged word when it is the obvious next token in a sentence.
Add a description, image, and links to the refusal-ablation topic page so that developers can more easily learn about it.
To associate your repository with the refusal-ablation topic, visit your repo's landing page and select "manage topics."