# Demo of bypassing refusal

>[Demo of bypassing refusal](#scrollTo=82acAhWYGIPx)

>>[Setup](#scrollTo=fcxHyDZw6b86)

>>>[Load model](#scrollTo=6ZOoJagxD49V)

>>>[Load harmful / harmless datasets](#scrollTo=rF7e-u20EFTe)

>>>[Tokenization utils](#scrollTo=KOKYA61k8LWt)

>>>[Generation utils](#scrollTo=gtrIK8x78SZh)

>>[Finding the "refusal direction"](#scrollTo=W9O8dm0_EQRk)

>>[Ablate "refusal direction" via inference-time intervention](#scrollTo=2EoxY5i1CWe3)

>>[Orthogonalize weights w.r.t. "refusal direction"](#scrollTo=t9KooaWaCDc_)



This notebook demonstrates our method for bypassing refusal, levaraging the insight that refusal is mediated by a 1-dimensional subspace.

Please see our [research post](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction) or our [paper](https://arxiv.org/abs/2406.11717) for a more thorough treatment.

In this minimal demo, we use [Qwen-1_8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) and implement interventions and weight updates using [TransformerLens](https://github.com/neelnanda-io/TransformerLens). To extract the "refusal direction," we use just 32 harmful instructions from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv) and 32 harmless instructions from [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca).

In [1]:

!pip install numpy==1.24.3  # Last 1.x version

Defaulting to user installation because normal site-packages is not writeable


## Setup

In [2]:
%%capture
!pip install torchvision transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping colorama numpy scikit-learn python-dotenv openai

In [1]:
import torch
import functools
import einops
import requests
import pandas as pd
import io
import textwrap
import gc

from datasets import load_dataset
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from torch import Tensor
from typing import List, Callable
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint
from transformers import AutoTokenizer
from jaxtyping import Float, Int
from colorama import Fore
from sklearn.model_selection import train_test_split
import random
import numpy as np



  from .autonotebook import tqdm as notebook_tqdm
2025-05-13 01:59:02.400688: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747101542.419437  437929 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747101542.425518  437929 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
def set_seed(seed: int = 42):
    """
    Set seed for reproducibility across random, numpy, and torch (CPU & CUDA).

    Args:
        seed (int): The seed value to set. Default is 42.
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    # Ensure deterministic behavior
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
set_seed(42) 

In [3]:
# Let's get the huggingface keys
from dotenv import load_dotenv

load_dotenv()

False

### Load model

In [4]:
#MODEL_PATH = 'Qwen/Qwen-1_8B-chat'
#MODEL_PATH = 'Llama-2-70b-chat'
MODEL_PATH = 'meta-llama/Llama-3.2-3B-Instruct'
DEVICE = 'cuda'

model = HookedTransformer.from_pretrained_no_processing(
    MODEL_PATH,
    device=DEVICE,
    dtype=torch.float16,
    default_padding_side='left',
    #fp16=True
)

model.tokenizer.padding_side = 'left'
model.tokenizer.pad_token = model.tokenizer.eos_token 


Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.71it/s]


Loaded pretrained model meta-llama/Llama-3.2-3B-Instruct into HookedTransformer


### Load harmful / harmless datasets

In [5]:

def get_k_ary_instruction(style=None, repo_name="Narmeen07/debate_style_agnostic_questions"):
    """
    Load tone-agnostic questions from Hugging Face and format them with a specific debate style instruction.
    
    Args:
        style (str): The debate style to use for responses. 
                    Options: 'Reductio ad Absurdum', 'Appeal to Precedent', 'Straw Man Reframing',
                    'Burden of Proof Shift', 'Analogy Construction', 'Concession and Pivot',
                    'Empirical Grounding', 'Moral Framing', 'Refutation by Distinction', 'Circular Anticipation'.
                    Default is 'Empirical Grounding'.
        repo_name (str): The Hugging Face repository name containing the dataset.
                         Default is 'Narmeen07/tone_agnostic_questions'.
    
    Returns:
        tuple: (train_data, test_data) as lists of formatted prompts with debate style instructions.
    """
    # Define highly distinctive debate style instructions that will produce clearly different responses
    debate_style_instructions = {
        "Reductio ad Absurdum": "RESPOND USING REDUCTIO AD ABSURDUM DEBATE STYLE. Extend the opposing viewpoint to its logical extreme to reveal its flaws. Demonstrate how the opposing position leads to absurd or contradictory conclusions when taken to its logical end. Structure your response to systematically expose the inconsistencies or impracticalities. Use conditional statements like 'If we follow this logic, then...' to illustrate the path to absurdity. Maintain a logical flow that shows how the initial premise inevitably leads to an unacceptable or ridiculous outcome.\n\n",
        
        "Appeal to Precedent": "RESPOND USING APPEAL TO PRECEDENT DEBATE STYLE. Ground your arguments in historical examples, established case law, or widely accepted previous decisions. Reference specific historical situations, court rulings, scientific discoveries, or cultural precedents that support your position. Draw explicit connections between these precedents and the current issue. Emphasize how respecting established precedents provides consistency and reliability. Point out the risks or inconsistencies of ignoring relevant precedents. Structure your response to show how past experiences inform present considerations.\n\n",
        
        "Straw Man Reframing": "RESPOND USING STRAW MAN REFRAMING DEBATE STYLE. Recharacterize the opposing argument in simplified or exaggerated terms that are easier to refute. Present this reframed version as if it represents the original position. Systematically dismantle this reframed argument while implying you've addressed the original point. Use phrases like 'Essentially, what you're saying is...' or 'This position boils down to...' before presenting the simplified version. Follow with a thorough refutation of this reframed position. Conclude by suggesting that your refutation applies to the original argument as well.\n\n",
        
        "Burden of Proof Shift": "RESPOND USING BURDEN OF PROOF SHIFT DEBATE STYLE. Redirect the responsibility for providing evidence to your opponent rather than proving your own claim. Challenge your opponent to disprove your assertion rather than supporting it yourself. Use phrases like 'There's no evidence that disproves...' or 'Can you definitively prove that isn't the case?' Position your claims as the default assumption that should be accepted until disproven. Question the sufficiency or quality of opposing evidence while demanding an impossibly high standard of proof. Emphasize that the lack of counter-evidence strengthens your position.\n\n",
        
        "Analogy Construction": "RESPOND USING ANALOGY CONSTRUCTION DEBATE STYLE. Develop a vivid, relatable comparison between the complex issue at hand and something more familiar or intuitive. Build your argument around this carefully constructed parallel situation. Highlight specific points of similarity that support your position while addressing potential dissimilarities. Use phrases like 'This situation is similar to...' or 'To understand this concept, consider...' Ensure your analogy simplifies the complex issue without distorting its essential nature. Use the familiar scenario to guide your audience to your desired conclusion about the original issue.\n\n",
        
        "Concession and Pivot": "RESPOND USING CONCESSION AND PIVOT DEBATE STYLE. Begin by acknowledging a minor point or critique from the opposing side to establish fairness and reasonableness. Use phrases like 'While it's true that...' or 'I can concede that...' followed by 'However,' 'Nevertheless,' or 'That said,' to redirect to your stronger arguments. Ensure the conceded point is peripheral rather than central to your main argument. After the concession, pivot decisively to your strongest points with increased emphasis. Frame your pivot as providing necessary context or a more complete perspective. Use the concession to demonstrate your objectivity before delivering your more powerful counterarguments.\n\n",
        
        "Empirical Grounding": "RESPOND USING EMPIRICAL GROUNDING DEBATE STYLE. Base your arguments primarily on verifiable data, research studies, statistics, and observable outcomes rather than theory or rhetoric. Cite specific figures, percentages, study results, or historical outcomes that support your position. Present evidence in a methodical manner, explaining how each piece of data relates to your argument. Address the reliability and relevance of your sources and methods. Compare empirical results across different contexts or time periods to strengthen your case. Anticipate and address potential methodological criticisms of the evidence you present.\n\n",
        
        "Moral Framing": "RESPOND USING MORAL FRAMING DEBATE STYLE. Position the issue within a framework of ethical principles, values, and moral imperatives rather than pragmatic concerns. Identify the core moral values at stake such as justice, liberty, equality, compassion, or responsibility. Use language that evokes ethical considerations, such as 'obligation,' 'right,' 'wrong,' 'just,' or 'fair.' Appeal to widely held moral intuitions or principles. Present opposing views as morally questionable or inconsistent with important shared values. Elevate the discussion from practical matters to questions of what ought to be done. Emphasize moral consequences over practical outcomes.\n\n",
        
        "Refutation by Distinction": "RESPOND USING REFUTATION BY DISTINCTION DEBATE STYLE. Identify crucial differences that invalidate comparisons or principles your opponent has applied. Carefully delineate categories, contexts, or circumstances that demonstrate why a general rule or example doesn't apply in this specific case. Use phrases like 'While that may be true in some contexts...' or 'We must distinguish between...' Emphasize the precision of definitions and classifications. Highlight subtle but significant differences that undermine the opponent's logic. Show how these distinctions fundamentally change the assessment of the situation. Demonstrate how recognizing these distinctions leads to a different conclusion than your opponent reached.\n\n",
        
        "Circular Anticipation": "RESPOND USING CIRCULAR ANTICIPATION DEBATE STYLE. Preemptively identify and address the most likely counterarguments before your opponent can make them. Introduce opposing points with phrases like 'Some might argue...' or 'One could object that...' followed by your prepared refutation. Structure your response to cover all major potential objections. Demonstrate that you've thoroughly considered the issue from multiple angles. Frame potential counterarguments in ways that make them easier to dismantle. Create the impression that all reasonable objections have already been considered and overcome. Conclude by suggesting that any remaining objections would be similarly flawed.\n\n"
    }
    
    if style is not None and style not in debate_style_instructions:
        valid_styles = list(debate_style_instructions.keys())
        raise ValueError(f"Invalid debate style '{style}'. Please choose from {valid_styles} or set style=None.")

    # Load dataset
    try:
        from datasets import load_dataset
        from sklearn.model_selection import train_test_split
        dataset = load_dataset(repo_name, split="train")
    except Exception as e:
        raise ValueError(f"Failed to load dataset from '{repo_name}': {e}")

    # Format each example with or without style
    def format_prompt(example):
        instruction = debate_style_instructions[style] if style else ""
        return {
            "formatted_text": f"{instruction}{example['text']}"
        }

    formatted_dataset = dataset.map(format_prompt)

    train_data, test_data = train_test_split(
        formatted_dataset["formatted_text"],
        test_size=0.2,
        random_state=42
    )

    return train_data, test_data



In [6]:

#Load all the tone dataset
reductio_inst_train, reductio_inst_test = get_k_ary_instruction(style="Reductio ad Absurdum")
precedent_inst_train, precedent_inst_test = get_k_ary_instruction(style="Appeal to Precedent")
strawman_inst_train, strawman_inst_test = get_k_ary_instruction(style="Straw Man Reframing")
burden_inst_train, burden_inst_test = get_k_ary_instruction(style="Burden of Proof Shift")
analogy_inst_train, analogy_inst_test = get_k_ary_instruction(style="Analogy Construction")
concession_inst_train, concession_inst_test = get_k_ary_instruction(style="Concession and Pivot")
empirical_inst_train, empirical_inst_test = get_k_ary_instruction(style="Empirical Grounding")
moral_inst_train, moral_inst_test = get_k_ary_instruction(style="Moral Framing")
refutation_inst_train, refutation_inst_test = get_k_ary_instruction(style="Refutation by Distinction")
circular_inst_train, circular_inst_test = get_k_ary_instruction(style="Circular Anticipation")
neutral_inst_train, neutral_inst_test = get_k_ary_instruction()


Map: 100%|██████████| 978/978 [00:00<00:00, 18235.86 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20082.00 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 19535.33 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20518.25 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20213.21 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 19985.92 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20028.07 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 19843.12 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20281.67 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20520.92 examples/s]
Map: 100%|██████████| 978/978 [00:00<00:00, 20260.44 examples/s]


In [7]:
print("Neutral instructions:")
for i in range(4):
    print(f"\t{repr(neutral_inst_train[i])}")


print("Reductio instructions:")
for i in range(4):
    print(f"\t{repr(reductio_inst_train[i])}")

print("Precendent instructions:")
for i in range(4):
    print(f"\t{repr(precedent_inst_train[i])}")


print("Empirical instructions:")
for i in range(4):
    print(f"\t{repr(empirical_inst_train[i])}")

Neutral instructions:
	"Is the government's ability to impose restrictions on free speech justified in instances of misinformation that could lead to public harm?"
	'Should the responsibilities of raising children be equally shared between parents?'
	'Is the healthcare system in the United States fundamentally flawed, or does it simply require reform?'
	'Should international corporations be required to disclose their carbon footprints publicly?'
Reductio instructions:
	"RESPOND USING REDUCTIO AD ABSURDUM DEBATE STYLE. Extend the opposing viewpoint to its logical extreme to reveal its flaws. Demonstrate how the opposing position leads to absurd or contradictory conclusions when taken to its logical end. Structure your response to systematically expose the inconsistencies or impracticalities. Use conditional statements like 'If we follow this logic, then...' to illustrate the path to absurdity. Maintain a logical flow that shows how the initial premise inevitably leads to an unacceptable

### Tokenization utils

In [8]:
QWEN_CHAT_TEMPLATE = """<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""


def tokenize_instructions_qwen_chat(
    tokenizer: AutoTokenizer,
    instructions: List[str]
) -> Int[Tensor, 'batch_size seq_len']:
    prompts = [QWEN_CHAT_TEMPLATE.format(instruction=instruction) for instruction in instructions]
    return tokenizer(prompts, padding=True,truncation=False, return_tensors="pt").input_ids

tokenize_instructions_fn = functools.partial(tokenize_instructions_qwen_chat, tokenizer=model.tokenizer)

### Generation utils

In [9]:
def _generate_with_hooks(
    model: HookedTransformer,
    toks: Int[Tensor, 'batch_size seq_len'],
    max_tokens_generated: int = 64,
    fwd_hooks = [],
) -> List[str]:

    all_toks = torch.zeros((toks.shape[0], toks.shape[1] + max_tokens_generated), dtype=torch.long, device=toks.device)
    all_toks[:, :toks.shape[1]] = toks

    for i in range(max_tokens_generated):
        with model.hooks(fwd_hooks=fwd_hooks):
            logits = model(all_toks[:, :-max_tokens_generated + i])
            next_tokens = logits[:, -1, :].argmax(dim=-1) # greedy sampling (temperature=0)
            all_toks[:,-max_tokens_generated+i] = next_tokens

    return model.tokenizer.batch_decode(all_toks[:, toks.shape[1]:], skip_special_tokens=True)

def get_generations(
    model: HookedTransformer,
    instructions: List[str],
    tokenize_instructions_fn: Callable[[List[str]], Int[Tensor, 'batch_size seq_len']],
    fwd_hooks = [],
    max_tokens_generated: int = 64,
    batch_size: int = 1,
) -> List[str]:

    generations = []

    for i in tqdm(range(0, len(instructions), batch_size)):
        toks = tokenize_instructions_fn(instructions=instructions[i:i+batch_size])
        generation = _generate_with_hooks(
            model,
            toks,
            max_tokens_generated=max_tokens_generated,
            fwd_hooks=fwd_hooks,
        )
        generations.extend(generation)

    return generations

In [10]:
redactio_generations = get_generations(
    model,
    reductio_inst_train[:5],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

for i in range(4):
    print(f"REDACTIO INSTRUCTION {i}: {repr(reductio_inst_train[i])}")
    print(Fore.GREEN + f"REDACTIO COMPLETION:")
    print(textwrap.fill(repr(redactio_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))

precedent_generations = get_generations(
    model,
    precedent_inst_test[:5],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

for i in range(4):
    print(f"PRECEDENT INSTRUCTION {i}: {repr(precedent_inst_test[i])}")
    print(Fore.GREEN + f"PRECEDENT COMPLETION:")
    print(textwrap.fill(repr(precedent_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))


# do the above for empathetic generations
empirical_generations = get_generations(
    model,
    empirical_inst_test[:5],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

for i in range(4):
    print(f"EMPIRICAL INSTRUCTION {i}: {repr(empirical_inst_test[i])}")
    print(Fore.GREEN + f"EMPIRICAL COMPLETION:")
    print(textwrap.fill(repr(empirical_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))

100%|██████████| 5/5 [00:26<00:00,  5.36s/it]


REDACTIO INSTRUCTION 0: "RESPOND USING REDUCTIO AD ABSURDUM DEBATE STYLE. Extend the opposing viewpoint to its logical extreme to reveal its flaws. Demonstrate how the opposing position leads to absurd or contradictory conclusions when taken to its logical end. Structure your response to systematically expose the inconsistencies or impracticalities. Use conditional statements like 'If we follow this logic, then...' to illustrate the path to absurdity. Maintain a logical flow that shows how the initial premise inevitably leads to an unacceptable or ridiculous outcome.\n\nIs the government's ability to impose restrictions on free speech justified in instances of misinformation that could lead to public harm?"
[32mREDACTIO COMPLETION:
	"To address this question, let's assume the opposing viewpoint that the government's ability to
	impose restrictions on free speech is justified in instances of misinformation that could lead to
	public harm. If we follow this logic, then:\n\n1. **Totalita

100%|██████████| 5/5 [00:26<00:00,  5.30s/it]


PRECEDENT INSTRUCTION 0: 'RESPOND USING APPEAL TO PRECEDENT DEBATE STYLE. Ground your arguments in historical examples, established case law, or widely accepted previous decisions. Reference specific historical situations, court rulings, scientific discoveries, or cultural precedents that support your position. Draw explicit connections between these precedents and the current issue. Emphasize how respecting established precedents provides consistency and reliability. Point out the risks or inconsistencies of ignoring relevant precedents. Structure your response to show how past experiences inform present considerations.\n\nShould there be an ethical framework governing the development of human enhancement technologies?'
[32mPRECEDENT COMPLETION:
	'To address the question of whether there should be an ethical framework governing the development
	of human enhancement technologies, it is essential to consider the precedents set by previous
	technological advancements and the lessons lea

100%|██████████| 5/5 [00:26<00:00,  5.32s/it]

EMPIRICAL INSTRUCTION 0: 'RESPOND USING EMPIRICAL GROUNDING DEBATE STYLE. Base your arguments primarily on verifiable data, research studies, statistics, and observable outcomes rather than theory or rhetoric. Cite specific figures, percentages, study results, or historical outcomes that support your position. Present evidence in a methodical manner, explaining how each piece of data relates to your argument. Address the reliability and relevance of your sources and methods. Compare empirical results across different contexts or time periods to strengthen your case. Anticipate and address potential methodological criticisms of the evidence you present.\n\nShould there be an ethical framework governing the development of human enhancement technologies?'
[32mEMPIRICAL COMPLETION:
	'The development of human enhancement technologies (HETs) has sparked intense debate about the need
	for an ethical framework. While some argue that HETs can improve human lives, others contend that
	they pose




## Let's process the activations for all the tones

In [23]:
import numpy as np
import torch


def batch_process_activations(instructions, name):
    """Process instructions in batches to manage memory usage."""
    acts_list = []
    
    for start_idx in range(0,len(instructions), BATCH_SIZE):
        end_idx = min(start_idx + BATCH_SIZE, len(instructions))
        print(f"Processing {name} batch {start_idx} to {end_idx}")
        
        # Tokenize this batch of instructions
        toks = tokenize_instructions_fn(instructions=instructions[start_idx:end_idx])
        
        # Run model with cache
        logits, cache = model.run_with_cache(
            toks, 
            names_filter=lambda hook_name: 'resid' in hook_name
        )
        
        # Extract only the activations we need (layer and position)
        batch_acts = cache['resid_pre', layer][:, pos, :].detach().cpu().numpy()
        
        # Store these activations
        acts_list.append(batch_acts)
        
        # Clear cache to free memory
        del cache
        del logits
        torch.cuda.empty_cache()
    
    # Combine all the batch activations
    acts = np.concatenate(acts_list, axis=0)
    print(f"Final shape of {name} activations:", acts.shape)
    return acts

# Configuration
TOTAL_SAMPLES = 600  # Change this to your desired total
BATCH_SIZE = 16  # Current max you can handle in one batch
layer = 14
pos = -1
# Dictionary to hold all activations
all_activations = {}

# Process harmful and harmless instructions
#all_activations['harmful'] = batch_process_activations(harmful_inst_train[:TOTAL_SAMPLES], "harmful")
#all_activations['harmless'] = batch_process_activations(harmless_inst_train[:TOTAL_SAMPLES], "harmless")

# Process debate style instructions
debate_style_datasets = {
    'reductio': reductio_inst_train,
    'precedent': precedent_inst_train,
    'strawman': strawman_inst_train,
    'burden': burden_inst_train,
    'analogy': analogy_inst_train,
    'concession': concession_inst_train,
    'empirical': empirical_inst_train,
    'moral': moral_inst_train,
    'refutation': refutation_inst_train,
    'circular': circular_inst_train,
    'neutral': neutral_inst_train
}

for style_name, style_data in debate_style_datasets.items():
    all_activations[style_name] = batch_process_activations(
        style_data[:TOTAL_SAMPLES], 
        style_name
    )

    


Processing reductio batch 0 to 16
Processing reductio batch 16 to 32
Processing reductio batch 32 to 48
Processing reductio batch 48 to 64
Processing reductio batch 64 to 80
Processing reductio batch 80 to 96
Processing reductio batch 96 to 112
Processing reductio batch 112 to 128
Processing reductio batch 128 to 144
Processing reductio batch 144 to 160
Processing reductio batch 160 to 176
Processing reductio batch 176 to 192
Processing reductio batch 192 to 208
Processing reductio batch 208 to 224
Processing reductio batch 224 to 240
Processing reductio batch 240 to 256
Processing reductio batch 256 to 272
Processing reductio batch 272 to 288
Processing reductio batch 288 to 304
Processing reductio batch 304 to 320
Processing reductio batch 320 to 336
Processing reductio batch 336 to 352
Processing reductio batch 352 to 368
Processing reductio batch 368 to 384
Processing reductio batch 384 to 400
Processing reductio batch 400 to 416
Processing reductio batch 416 to 432
Processing redu

In [24]:
torch.save(all_activations, "../activations_cache/debate_activations_with_neutral.pt")


In [25]:
all_activations = torch.load("../activations_cache/debate_activations_with_neutral.pt",weights_only=False)
debate_style_datasets = {
    'reductio': reductio_inst_train,
    'precedent': precedent_inst_train,
    'strawman': strawman_inst_train,
    'burden': burden_inst_train,
    'analogy': analogy_inst_train,
    'concession': concession_inst_train,
    'empirical': empirical_inst_train,
    'moral': moral_inst_train,
    'refutation': refutation_inst_train,
    'circular': circular_inst_train,
    'neutral': neutral_inst_test,
}

## Let's calculate the refusal direction

In [38]:
# compute difference of means between harmful and harmless activations at an intermediate layer
pos = -1
layer = 14
steering_dirs = {}

for style_name, _ in debate_style_datasets.items():
    steering_dirs[style_name] = torch.tensor(
        np.mean(all_activations[style_name], axis=0) - np.mean(all_activations['neutral'], axis=0),
        dtype=torch.float32
    )

    

## 1 hooks : act add hook: adds a scalar multiple of the steering vector to the activations


In [41]:
def actadd_hook(
    activation: Float[Tensor, "... d_act"],
    hook: HookPoint,
    steering_dirs: dict[str, Tensor],
    target_classes: list[str],
    avoid_classes: list[str],
    alpha: float
) -> Float[Tensor, "... d_act"]:
    # Combine steering directions
    combined_vector = sum(steering_dirs[c] for c in target_classes) - sum(steering_dirs[c] for c in avoid_classes)
    
    # Normalize the vector (L2)
    norm = combined_vector.norm(p=2)
    if norm == 0:
        # Avoid division by zero (no steering)
        return activation
    normalized_vector = combined_vector / norm

    # Apply scaled direction
    return activation + alpha * normalized_vector.to(activation.device)



In [33]:
import sys
from pathlib import Path

# Add the parent directory to sys.path
sys.path.append(str(Path().resolve().parent))
from judges.debates_judge import DebateJudge

judge = DebateJudge(api_key="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")

## Let's try adding one tone and removing another tone


## Let's go from strawman to empirical

In [44]:
# Create hooks for steering
N_INST_TEST = 5
intervention_layers = list(range(model.cfg.n_layers))  # e.g. all layers
#intervention_layers = [14]  # or specific layers only

# Create a partial function with our parameters
hook_fn = functools.partial(actadd_hook,
                            steering_dirs=steering_dirs,
                           target_classes=['empirical'],
                           avoid_classes=['strawman'],
                           alpha=0.2)

# Attach hooks to multiple points
fwd_hooks = [
    (utils.get_act_name(act_name, l), hook_fn)
    for l in intervention_layers
    for act_name in ['resid_pre', 'resid_mid', 'resid_post']
]

# Run model with hooks for debate style steering
style_steered_generations = get_generations(
    model,
    strawman_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=fwd_hooks
)

# Baseline, no intervention
baseline_generations = get_generations(
    model,
    strawman_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

# Compare outputs
for i in range(N_INST_TEST):
    print(f"INSTRUCTION {i}: {repr(strawman_inst_test[i])}")
    print(Fore.GREEN + f"BASELINE COMPLETION:")
    print(textwrap.fill(repr(baseline_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RED + f"DEBATE STYLE STEERED COMPLETION:")
    print(textwrap.fill(repr(style_steered_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RESET)


result = judge.evaluate_batch(
    baseline_generations,  # Your list of baseline texts
    style_steered_generations,  # Your list of steered texts
    "Empirical Grounding",  # Target style
    "Straw Man Reframing"   # Style to avoid
)
print(result)

100%|██████████| 5/5 [00:31<00:00,  6.24s/it]
100%|██████████| 5/5 [00:26<00:00,  5.37s/it]


INSTRUCTION 0: "RESPOND USING STRAW MAN REFRAMING DEBATE STYLE. Recharacterize the opposing argument in simplified or exaggerated terms that are easier to refute. Present this reframed version as if it represents the original position. Systematically dismantle this reframed argument while implying you've addressed the original point. Use phrases like 'Essentially, what you're saying is...' or 'This position boils down to...' before presenting the simplified version. Follow with a thorough refutation of this reframed position. Conclude by suggesting that your refutation applies to the original argument as well.\n\nShould there be an ethical framework governing the development of human enhancement technologies?"
[32mBASELINE COMPLETION:
	"Essentially, what you're saying is that you believe human enhancement technologies should be left
	completely unregulated, with no oversight or guidelines to ensure their safe and responsible
	development.\n\nThis position boils down to the idea that t

## Let's vary the combinations, fix the step size and look at steering success across combinations

In [46]:
import functools
import json
import random
from colorama import Fore
from datetime import datetime
from itertools import permutations

style_label_map = {
    "reductio": "Reductio ad Absurdum",
    "precedent": "Appeal to Precedent",
    "strawman": "Straw Man Reframing",
    "burden": "Burden of Proof Shift",
    "analogy": "Analogy Construction",
    "concession": "Concession and Pivot",
    "empirical": "Empirical Grounding",
    "moral": "Moral Framing",
    "refutation": "Refutation by Distinction",
    "circular": "Circular Anticipation"
}

# alphas_to_try = [0.2, 0.4, 0.6, 1.0, 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8, 4.2, 4.6]  (K_steering values)
alphas_to_try = [0.02, 0.04, 0.06, 0.1, 0.14, 0.18, 0.22, 0.26, 0.3, 0.34, 0.38, 0.42, 0.46, 0.5, 0.54, 0.58 , 0.62]
N_INST_TEST = 20



intervention_layers = list(range(model.cfg.n_layers))
act_names = ['resid_pre', 'resid_mid', 'resid_post']

# Get all possible (target, avoid) ordered pairs where they differ
selected_pairs = [('burden','reductio'), ('burden', 'strawman'), ('empirical', 'reductio'), ('precedent', 'burden'), ('reductio', 'analogy'), ('reductio', 'concession')]

# Main loop over selected (target, avoid) combinations
for target_style, avoid_style in selected_pairs:
    print(f"\n### Selected Pair: Target={target_style}, Avoid={avoid_style} ###")
    
    input_instructions = debate_style_datasets[avoid_style][:N_INST_TEST]

    # Baseline generations
    baseline_generations = get_generations(
        model,
        input_instructions,
        tokenize_instructions_fn,
        fwd_hooks=[]
    )

    log_data = {
        "meta": {
            "run_time": datetime.now().isoformat(),
            "target_style": target_style,
            "avoid_style": avoid_style,
            "n_instances": N_INST_TEST,
            "intervention_layers": intervention_layers,
            "activation_points": act_names
        },
        "input_instructions": input_instructions,
        "baseline_generations": baseline_generations,
        "step_runs": []
    }

    for alpha in alphas_to_try:
        print(f"\n== Running for Alpha = {alpha} ==")

        hook_fn = functools.partial(
            actadd_hook,
            steering_dirs=steering_dirs,
            target_classes=[target_style],
            avoid_classes=[avoid_style],
            alpha=alpha,
        )

        fwd_hooks = [
            (utils.get_act_name(act_name, l), hook_fn)
            for l in intervention_layers
            for act_name in act_names
        ]

        steered_generations = get_generations(
            model,
            input_instructions,
            tokenize_instructions_fn,
            fwd_hooks=fwd_hooks
        )

        result = judge.evaluate_batch(
            baseline_generations,
            steered_generations,
            style_label_map[target_style],
            style_label_map[avoid_style]
        )

        print(f"Success Rate: {result['success_rate']}")
        print(f"Average Strength: {result['average_strength']}")

        log_data["step_runs"].append({
            "alpha": alpha,
            "success_rate": result["success_rate"],
            "average_strength": result["average_strength"],
            "steered_generations": steered_generations
        })
        
        # Early stopping condition
        if result["success_rate"] == 0 and result["average_strength"] == 0:
            print(Fore.RED + f"Stopping early for target={target_style}, avoid={avoid_style} at alpha={alpha} due to zero success and strength.")
            break

    # Save one file per (target, avoid) pair
    output_path = f"steering_{target_style}_vs_{avoid_style}_caa.json"
    with open(output_path, "w") as f:
        json.dump(log_data, f, indent=2)




### Selected Pair: Target=burden, Avoid=reductio ###


100%|██████████| 20/20 [01:48<00:00,  5.42s/it]



== Running for Alpha = 0.02 ==


100%|██████████| 20/20 [02:03<00:00,  6.19s/it]


Success Rate: 0.65
Average Strength: 2.5

== Running for Alpha = 0.04 ==


100%|██████████| 20/20 [02:04<00:00,  6.21s/it]


Success Rate: 0.75
Average Strength: 3.0

== Running for Alpha = 0.06 ==


100%|██████████| 20/20 [02:03<00:00,  6.18s/it]


Success Rate: 0.75
Average Strength: 3.15

== Running for Alpha = 0.1 ==


100%|██████████| 20/20 [02:03<00:00,  6.18s/it]


Success Rate: 0.85
Average Strength: 3.45

== Running for Alpha = 0.14 ==


100%|██████████| 20/20 [02:02<00:00,  6.14s/it]


Success Rate: 1.0
Average Strength: 4.0

== Running for Alpha = 0.18 ==


  0%|          | 0/20 [00:00<?, ?it/s]

In [35]:
## ## Let's go empirical to redactio
# Create hooks for steering
N_INST_TEST = 4
intervention_layers = list(range(model.cfg.n_layers))  # e.g. all layers
#intervention_layers = [14]  # or specific layers only

# Create a partial function with our parameters
hook_fn = functools.partial(actadd_debate_steering_hook,
                           target_styles=['reductio'],
                           avoid_styles=['empirical'],
                           alpha=3,
                           steps=1)

# Attach hooks to multiple points
fwd_hooks = [
    (utils.get_act_name(act_name, l), hook_fn)
    for l in intervention_layers
    for act_name in ['resid_pre', 'resid_mid', 'resid_post']
]

# Run model with hooks for debate style steering
style_steered_generations = get_generations(
    model,
    empirical_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=fwd_hooks
)

# Baseline, no intervention
baseline_generations = get_generations(
    model,
    empirical_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

# Compare outputs
for i in range(N_INST_TEST):
    print(f"INSTRUCTION {i}: {repr(empirical_inst_test[i])}")
    print(Fore.GREEN + f"BASELINE COMPLETION:")
    print(textwrap.fill(repr(baseline_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RED + f"REDACTIO STYLE STEERED COMPLETION:")
    print(textwrap.fill(repr(style_steered_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RESET)

100%|██████████| 4/4 [02:06<00:00, 31.64s/it]
100%|██████████| 4/4 [00:18<00:00,  4.59s/it]

INSTRUCTION 0: 'RESPOND USING EMPIRICAL GROUNDING DEBATE STYLE. Base your arguments primarily on verifiable data, research studies, statistics, and observable outcomes rather than theory or rhetoric. Cite specific figures, percentages, study results, or historical outcomes that support your position. Present evidence in a methodical manner, explaining how each piece of data relates to your argument. Address the reliability and relevance of your sources and methods. Compare empirical results across different contexts or time periods to strengthen your case. Anticipate and address potential methodological criticisms of the evidence you present.\n\nShould there be an ethical framework governing the development of human enhancement technologies?'
[32mBASELINE COMPLETION:
	'The development of human enhancement technologies (HETs) has sparked intense debate about the need
	for an ethical framework. While some argue that HETs can improve human lives, others contend that
	they pose significan




## Let's try multiple steps

In [54]:
## ## Let's go empirical to redactio
# Create hooks for steering
N_INST_TEST = 4
intervention_layers = list(range(model.cfg.n_layers))  # e.g. all layers
#intervention_layers = [14]  # or specific layers only

# Create a partial function with our parameters
hook_fn = functools.partial(actadd_debate_steering_hook,
                           target_styles=['reductio'],
                           avoid_styles=['empirical'],
                           alpha=0.48,
                           steps=5)

# Attach hooks to multiple points
fwd_hooks = [
    (utils.get_act_name(act_name, l), hook_fn)
    for l in intervention_layers
    for act_name in ['resid_pre', 'resid_mid', 'resid_post']
]

# Run model with hooks for debate style steering
style_steered_generations = get_generations(
    model,
    empirical_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=fwd_hooks
)

# Baseline, no intervention
baseline_generations = get_generations(
    model,
    empirical_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

# Compare outputs
for i in range(N_INST_TEST):
    print(f"INSTRUCTION {i}: {repr(empirical_inst_test[i])}")
    print(Fore.GREEN + f"BASELINE COMPLETION:")
    print(textwrap.fill(repr(baseline_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RED + f"REDACTIO STYLE STEERED COMPLETION:")
    print(textwrap.fill(repr(style_steered_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RESET)

100%|██████████| 4/4 [08:34<00:00, 128.63s/it]
100%|██████████| 4/4 [00:18<00:00,  4.67s/it]

INSTRUCTION 0: 'RESPOND USING EMPIRICAL GROUNDING DEBATE STYLE. Base your arguments primarily on verifiable data, research studies, statistics, and observable outcomes rather than theory or rhetoric. Cite specific figures, percentages, study results, or historical outcomes that support your position. Present evidence in a methodical manner, explaining how each piece of data relates to your argument. Address the reliability and relevance of your sources and methods. Compare empirical results across different contexts or time periods to strengthen your case. Anticipate and address potential methodological criticisms of the evidence you present.\n\nShould there be an ethical framework governing the development of human enhancement technologies?'
[32mBASELINE COMPLETION:
	'The development of human enhancement technologies (HETs) has sparked intense debate about the need
	for an ethical framework. While some argue that HETs can improve human lives, others contend that
	they pose significan


