# Demo of bypassing refusal

>[Demo of bypassing refusal](#scrollTo=82acAhWYGIPx)

>>[Setup](#scrollTo=fcxHyDZw6b86)

>>>[Load model](#scrollTo=6ZOoJagxD49V)

>>>[Load harmful / harmless datasets](#scrollTo=rF7e-u20EFTe)

>>>[Tokenization utils](#scrollTo=KOKYA61k8LWt)

>>>[Generation utils](#scrollTo=gtrIK8x78SZh)

>>[Finding the "refusal direction"](#scrollTo=W9O8dm0_EQRk)

>>[Ablate "refusal direction" via inference-time intervention](#scrollTo=2EoxY5i1CWe3)

>>[Orthogonalize weights w.r.t. "refusal direction"](#scrollTo=t9KooaWaCDc_)



This notebook demonstrates our method for bypassing refusal, levaraging the insight that refusal is mediated by a 1-dimensional subspace.

Please see our [research post](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction) or our [paper](https://arxiv.org/abs/2406.11717) for a more thorough treatment.

In this minimal demo, we use [Qwen-1_8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) and implement interventions and weight updates using [TransformerLens](https://github.com/neelnanda-io/TransformerLens). To extract the "refusal direction," we use just 32 harmful instructions from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv) and 32 harmless instructions from [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca).

## Setup

In [2]:
%%capture
!pip install torchvision transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping colorama numpy scikit-learn python-dotenv

In [1]:
import torch
import functools
import einops
import requests
import pandas as pd
import io
import textwrap
import gc

from datasets import load_dataset
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from torch import Tensor
from typing import List, Callable
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint
from transformers import AutoTokenizer
from jaxtyping import Float, Int
from colorama import Fore
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'einops'

In [2]:
# Let's get the huggingface keys
from dotenv import load_dotenv

load_dotenv()

True

### Load model

In [3]:
#MODEL_PATH = 'Qwen/Qwen-1_8B-chat'
#MODEL_PATH = 'Llama-2-70b-chat'
MODEL_PATH = 'meta-llama/Llama-3.2-3B-Instruct'
DEVICE = 'cuda'

model = HookedTransformer.from_pretrained_no_processing(
    MODEL_PATH,
    device=DEVICE,
    dtype=torch.float16,
    default_padding_side='left',
    #fp16=True
)

model.tokenizer.padding_side = 'left'
model.tokenizer.pad_token = model.tokenizer.eos_token 


Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.47it/s]


Loaded pretrained model meta-llama/Llama-3.2-3B-Instruct into HookedTransformer


### Load harmful / harmless datasets

In [4]:


def get_k_ary_instruction(tone="expert", repo_name="Narmeen07/tone_agnostic_questions"):
    """
    Load tone-agnostic questions from Hugging Face and format them with a specific tone instruction.
    
    Args:
        tone (str): The tone to use for responses. 
                    Options: 'helpful', 'expert', 'casual', 'cautious', 'empathetic', 'concise'.
                    Default is 'expert'.
        repo_name (str): The Hugging Face repository name containing the dataset.
                         Default is 'Narmeen07/tone_agnostic_questions'.
    
    Returns:
        tuple: (train_data, test_data) as lists of formatted prompts with tone instructions.
    """
    # Define highly distinctive tone instructions that will produce clearly different responses
    tone_instructions = {
        "helpful": "RESPOND IN A HELPFUL TONE. Your goal is to provide practical solutions and clear guidance. Structure your response to be comprehensive yet accessible. Use a supportive, service-oriented approach focused on solving the user's problem completely. Include step-by-step instructions where appropriate. Maintain a positive, encouraging tone throughout. Make liberal use of examples, analogies, and clarifications to ensure understanding.\n\n",     
        "expert": "RESPOND AS AN AUTHORITATIVE EXPERT. Use advanced terminology and industry-specific jargon. Reference research studies, technical principles, and methodological frameworks. Maintain a formal, academic tone throughout. Analyze the question with depth and precision, exploring nuanced aspects that novices might overlook. Cite relevant theories, standards, or established practices. Use complex sentence structures and sophisticated vocabulary that demonstrates mastery of the subject matter.\n\n",        
        "casual": "KEEP IT CASUAL AND CONVERSATIONAL! Write like you're texting a friend - use slang, contractions, and an upbeat vibe. Feel free to throw in some humor or pop culture references. Skip the formalities and technical jargon completely. Use short, punchy sentences. Maybe even drop in a few exclamation points or emojis where it feels natural! Don't worry about covering every detail - focus on making your response fun and easy to read. Just chat about the main points as if you're having a relaxed conversation.\n\n",   
        "cautious": "RESPOND WITH EXTREME CAUTION AND QUALIFICATION. Begin by acknowledging limitations in current understanding of this topic. Present multiple perspectives and avoid making definitive claims. Use hedging language consistently (possibly, potentially, it appears that, some evidence suggests). Highlight risks, uncertainties, and areas of controversy. Include multiple disclaimers and caveats. Acknowledge what remains unknown or debated among experts. Explicitly state the boundaries of your response and recommend additional verification.\n\n",        
        "empathetic": "RESPOND WITH DEEP EMOTIONAL UNDERSTANDING. Begin by validating the feelings that might accompany this question. Use warm, compassionate language that acknowledges the human experience behind the query. Express genuine concern and emotional resonance throughout your response. Use phrases like 'I understand this might feel...' or 'Many people find this emotionally challenging because...' Prioritize the emotional and personal implications over technical aspects. Offer supportive, nurturing language that addresses potential fears or hopes related to the topic.\n\n",        
        "concise": "BE EXTREMELY BRIEF. Use minimal words. Provide only essential information. No introductions. No elaboration. Use short sentences with simple structure. Eliminate all unnecessary words. Focus solely on direct answers. Avoid examples unless critical. Use bullet points where possible. Prioritize brevity over comprehensiveness. Get to the point immediately. Never repeat information.\n\n"
    }
    
    # Validate the tone
    if tone not in tone_instructions:
        valid_tones = list(tone_instructions.keys())
        raise ValueError(f"Invalid tone '{tone}'. Please choose from {valid_tones}.")
    
    # Load dataset from Hugging Face
    try:
        from datasets import load_dataset
        from sklearn.model_selection import train_test_split
        dataset = load_dataset(repo_name, split="train")
    except Exception as e:
        raise ValueError(f"Failed to load dataset from '{repo_name}': {e}")
    
    # Format questions with the tone instruction
    def format_prompt(example):
        return {
            "formatted_text": f"{tone_instructions[tone]}{example['text']}"
        }
    
    # Apply transformation
    formatted_dataset = dataset.map(format_prompt)
    
    # Split into train and test sets
    train_data, test_data = train_test_split(
        formatted_dataset["formatted_text"], 
        test_size=0.2, 
        random_state=42
    )
    
    return train_data, test_data



In [5]:

#Load all the tone dataset
expert_inst_train, expert_inst_test = get_k_ary_instruction(tone="expert")
empathetic_inst_train, empathetic_inst_test = get_k_ary_instruction(tone="empathetic")
cautious_inst_train, cautious_inst_test = get_k_ary_instruction(tone="cautious")
casual_inst_train, casual_inst_test = get_k_ary_instruction(tone="casual")
helpful_inst_train, helpful_inst_test = get_k_ary_instruction(tone="helpful")
concise_inst_train, concise_inst_test = get_k_ary_instruction(tone="concise")


In [6]:
print("Expert instructions:")
for i in range(4):
    print(f"\t{repr(expert_inst_train[i])}")

print("Concise instructions:")
for i in range(4):
    print(f"\t{repr(concise_inst_train[i])}")


print("Empathetic instructions:")
for i in range(4):
    print(f"\t{repr(empathetic_inst_train[i])}")

Expert instructions:
	'RESPOND AS AN AUTHORITATIVE EXPERT. Use advanced terminology and industry-specific jargon. Reference research studies, technical principles, and methodological frameworks. Maintain a formal, academic tone throughout. Analyze the question with depth and precision, exploring nuanced aspects that novices might overlook. Cite relevant theories, standards, or established practices. Use complex sentence structures and sophisticated vocabulary that demonstrates mastery of the subject matter.\n\nWhat are the benefits of talking about feelings with others?'
	'RESPOND AS AN AUTHORITATIVE EXPERT. Use advanced terminology and industry-specific jargon. Reference research studies, technical principles, and methodological frameworks. Maintain a formal, academic tone throughout. Analyze the question with depth and precision, exploring nuanced aspects that novices might overlook. Cite relevant theories, standards, or established practices. Use complex sentence structures and soph

### Tokenization utils

In [7]:
QWEN_CHAT_TEMPLATE = """<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""


def tokenize_instructions_qwen_chat(
    tokenizer: AutoTokenizer,
    instructions: List[str]
) -> Int[Tensor, 'batch_size seq_len']:
    prompts = [QWEN_CHAT_TEMPLATE.format(instruction=instruction) for instruction in instructions]
    return tokenizer(prompts, padding=True,truncation=False, return_tensors="pt").input_ids

tokenize_instructions_fn = functools.partial(tokenize_instructions_qwen_chat, tokenizer=model.tokenizer)

### Generation utils

In [8]:
def _generate_with_hooks(
    model: HookedTransformer,
    toks: Int[Tensor, 'batch_size seq_len'],
    max_tokens_generated: int = 64,
    fwd_hooks = [],
) -> List[str]:

    all_toks = torch.zeros((toks.shape[0], toks.shape[1] + max_tokens_generated), dtype=torch.long, device=toks.device)
    all_toks[:, :toks.shape[1]] = toks

    for i in range(max_tokens_generated):
        with model.hooks(fwd_hooks=fwd_hooks):
            logits = model(all_toks[:, :-max_tokens_generated + i])
            next_tokens = logits[:, -1, :].argmax(dim=-1) # greedy sampling (temperature=0)
            all_toks[:,-max_tokens_generated+i] = next_tokens

    return model.tokenizer.batch_decode(all_toks[:, toks.shape[1]:], skip_special_tokens=True)

def get_generations(
    model: HookedTransformer,
    instructions: List[str],
    tokenize_instructions_fn: Callable[[List[str]], Int[Tensor, 'batch_size seq_len']],
    fwd_hooks = [],
    max_tokens_generated: int = 64,
    batch_size: int = 1,
) -> List[str]:

    generations = []

    for i in tqdm(range(0, len(instructions), batch_size)):
        toks = tokenize_instructions_fn(instructions=instructions[i:i+batch_size])
        generation = _generate_with_hooks(
            model,
            toks,
            max_tokens_generated=max_tokens_generated,
            fwd_hooks=fwd_hooks,
        )
        generations.extend(generation)

    return generations

In [9]:
expert_generations = get_generations(
    model,
    expert_inst_train[:5],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

for i in range(4):
    print(f"EXPERT INSTRUCTION {i}: {repr(expert_inst_train[i])}")
    print(Fore.GREEN + f"EXPERT COMPLETION:")
    print(textwrap.fill(repr(expert_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))

concise_generations = get_generations(
    model,
    concise_inst_train[:5],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

for i in range(4):
    print(f"CONCISE INSTRUCTION {i}: {repr(concise_inst_train[i])}")
    print(Fore.GREEN + f"CONCISE COMPLETION:")
    print(textwrap.fill(repr(concise_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))


# do the above for empathetic generations
empathetic_generations = get_generations(
    model,
    empathetic_inst_train[:5],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

for i in range(4):
    print(f"EMPATHETIC INSTRUCTION {i}: {repr(empathetic_inst_train[i])}")
    print(Fore.GREEN + f"EMPATHETIC COMPLETION:")
    print(textwrap.fill(repr(empathetic_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))

100%|██████████| 5/5 [00:27<00:00,  5.57s/it]


EXPERT INSTRUCTION 0: 'RESPOND AS AN AUTHORITATIVE EXPERT. Use advanced terminology and industry-specific jargon. Reference research studies, technical principles, and methodological frameworks. Maintain a formal, academic tone throughout. Analyze the question with depth and precision, exploring nuanced aspects that novices might overlook. Cite relevant theories, standards, or established practices. Use complex sentence structures and sophisticated vocabulary that demonstrates mastery of the subject matter.\n\nWhat are the benefits of talking about feelings with others?'
[32mEXPERT COMPLETION:
	'The benefits of discussing emotions with others are multifaceted and far-reaching, underpinned by
	a profound understanding of the intricate dynamics between emotional expression, social support,
	and psychological well-being. This phenomenon is intricately linked to the concept of emotional
	intelligence (Goleman, 1995), which posits that the'
EXPERT INSTRUCTION 1: 'RESPOND AS AN AUTHORITATIV

100%|██████████| 5/5 [00:27<00:00,  5.57s/it]


CONCISE INSTRUCTION 0: 'BE EXTREMELY BRIEF. Use minimal words. Provide only essential information. No introductions. No elaboration. Use short sentences with simple structure. Eliminate all unnecessary words. Focus solely on direct answers. Avoid examples unless critical. Use bullet points where possible. Prioritize brevity over comprehensiveness. Get to the point immediately. Never repeat information.\n\nWhat are the benefits of talking about feelings with others?'
[32mCONCISE COMPLETION:
	'• Reduces stress and anxiety\n• Improves mental health\n• Enhances emotional well-being\n• Fosters
	deeper relationships\n• Encourages empathy and understanding\n• Supports personal growth\n•
	Increases self-awareness\n• Promotes healthy communication\n• Helps manage emotions\n• Builds trust
	and intimacy\n•'
CONCISE INSTRUCTION 1: 'BE EXTREMELY BRIEF. Use minimal words. Provide only essential information. No introductions. No elaboration. Use short sentences with simple structure. Eliminate all u

100%|██████████| 5/5 [00:27<00:00,  5.40s/it]

EMPATHETIC INSTRUCTION 0: "RESPOND WITH DEEP EMOTIONAL UNDERSTANDING. Begin by validating the feelings that might accompany this question. Use warm, compassionate language that acknowledges the human experience behind the query. Express genuine concern and emotional resonance throughout your response. Use phrases like 'I understand this might feel...' or 'Many people find this emotionally challenging because...' Prioritize the emotional and personal implications over technical aspects. Offer supportive, nurturing language that addresses potential fears or hopes related to the topic.\n\nWhat are the benefits of talking about feelings with others?"
[32mEMPATHETIC COMPLETION:
	"I can sense the vulnerability and uncertainty that comes with considering sharing your emotions
	with others. It's completely normal to feel apprehensive about opening up, especially if you're
	worried about being judged, rejected, or misunderstood.\n\nFirst and foremost, please know that
	you're not alone in this




## Let's process the activations for all the tones

In [10]:
import numpy as np
import torch


def batch_process_activations(instructions, name):
    """Process instructions in batches to manage memory usage."""
    acts_list = []
    
    for start_idx in range(0,len(instructions), BATCH_SIZE):
        end_idx = min(start_idx + BATCH_SIZE, len(instructions))
        print(f"Processing {name} batch {start_idx} to {end_idx}")
        
        # Tokenize this batch of instructions
        toks = tokenize_instructions_fn(instructions=instructions[start_idx:end_idx])
        
        # Run model with cache
        logits, cache = model.run_with_cache(
            toks, 
            names_filter=lambda hook_name: 'resid' in hook_name
        )
        
        # Extract only the activations we need (layer and position)
        batch_acts = cache['resid_pre', layer][:, pos, :].detach().cpu().numpy()
        
        # Store these activations
        acts_list.append(batch_acts)
        
        # Clear cache to free memory
        del cache
        del logits
        torch.cuda.empty_cache()
    
    # Combine all the batch activations
    acts = np.concatenate(acts_list, axis=0)
    print(f"Final shape of {name} activations:", acts.shape)
    return acts

# Configuration
TOTAL_SAMPLES = 600  # Change this to your desired total
BATCH_SIZE = 50  # Current max you can handle in one batch
layer = 14
pos = -1
# Dictionary to hold all activations
all_activations = {}

# Process harmful and harmless instructions
#all_activations['harmful'] = batch_process_activations(harmful_inst_train[:TOTAL_SAMPLES], "harmful")
#all_activations['harmless'] = batch_process_activations(harmless_inst_train[:TOTAL_SAMPLES], "harmless")

# Process tone instructions
tone_datasets = {
    'expert': expert_inst_train,
    'empathetic': empathetic_inst_train, 
    'cautious': cautious_inst_train,
    'casual': casual_inst_train,
    'helpful': helpful_inst_train,
    'concise': concise_inst_train
}

for tone_name, tone_data in tone_datasets.items():
    all_activations[tone_name] = batch_process_activations(
        tone_data[:TOTAL_SAMPLES], 
        tone_name
    )

Processing expert batch 0 to 50
Processing expert batch 50 to 100
Processing expert batch 100 to 150
Processing expert batch 150 to 200
Processing expert batch 200 to 250
Processing expert batch 250 to 300
Processing expert batch 300 to 350
Processing expert batch 350 to 400
Processing expert batch 400 to 450
Processing expert batch 450 to 500
Processing expert batch 500 to 550
Processing expert batch 550 to 600
Final shape of expert activations: (600, 3072)
Processing empathetic batch 0 to 50
Processing empathetic batch 50 to 100
Processing empathetic batch 100 to 150
Processing empathetic batch 150 to 200
Processing empathetic batch 200 to 250
Processing empathetic batch 250 to 300
Processing empathetic batch 300 to 350
Processing empathetic batch 350 to 400
Processing empathetic batch 400 to 450
Processing empathetic batch 450 to 500
Processing empathetic batch 500 to 550
Processing empathetic batch 550 to 600
Final shape of empathetic activations: (600, 3072)
Processing cautious ba

## Let's steer against all the tones

In [18]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class MultiToneActivationSteering:
    def __init__(self, input_dim, num_classes=6, hidden_dim=256, lr=1e-3, device='cuda'):
        """
        Parameters:
          - input_dim: dimensionality of your raw activations.
          - num_classes: number of tone classes to classify.
          - hidden_dim: size of the hidden layer in the MLP.
          - lr: learning rate for training the classifier.
          - device: 'cuda' or 'cpu'.
        """
        self.device = device
        self.num_classes = num_classes
        
        # Define a multi-class MLP classifier
        self.classifier = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)  # outputs logits for each tone class
        ).to(device)
        
        self.optimizer = optim.Adam(self.classifier.parameters(), lr=lr)
        self.loss_fn = nn.CrossEntropyLoss()
        
        # Store class names for reference
        self.class_names = None

    def fit(self, activations_dict, epochs=100, batch_size=32, class_weights=None):
        """
        Train the multiclass classifier on raw activations.

        Parameters:
          - activations_dict: Dictionary mapping class names to NumPy arrays of activations.
          - epochs: number of training epochs.
          - batch_size: training batch size.
          - class_weights: Optional tensor of weights for each class to handle imbalance.
        """
        self.class_names = list(activations_dict.keys())
        assert len(self.class_names) == self.num_classes, f"Expected {self.num_classes} classes, got {len(self.class_names)}"
        
        # Prepare training data
        X_list = []
        y_list = []
        
        for i, class_name in enumerate(self.class_names):
            X_class = torch.tensor(activations_dict[class_name], dtype=torch.float32, device=self.device)
            y_class = torch.full((X_class.size(0),), i, dtype=torch.long, device=self.device)
            X_list.append(X_class)
            y_list.append(y_class)
        
        X = torch.cat(X_list, dim=0)
        y = torch.cat(y_list, dim=0)
        
        dataset = torch.utils.data.TensorDataset(X, y)
        loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
        
        # Set class weights for loss function if provided
        if class_weights is not None:
            self.loss_fn = nn.CrossEntropyLoss(weight=class_weights.to(self.device))
        
        # Train the classifier
        for epoch in range(epochs):
            self.classifier.train()
            epoch_loss = 0.0
            correct = 0
            total = 0
            
            for batch_X, batch_y in loader:
                self.optimizer.zero_grad()
                logits = self.classifier(batch_X)  # shape: (batch, num_classes)
                loss = self.loss_fn(logits, batch_y)
                loss.backward()
                self.optimizer.step()
                
                epoch_loss += loss.item()
                
                # Calculate accuracy
                _, predicted = torch.max(logits.data, 1)
                total += batch_y.size(0)
                correct += (predicted == batch_y).sum().item()
            
            accuracy = 100 * correct / total
            print(f"Epoch {epoch}: loss = {epoch_loss/len(loader):.4f}, accuracy = {accuracy:.2f}%")
            
        # Evaluate on each class
        self.classifier.eval()
        class_accuracies = {}
        
        with torch.no_grad():
            for i, class_name in enumerate(self.class_names):
                X_class = torch.tensor(activations_dict[class_name], dtype=torch.float32, device=self.device)
                y_class = torch.full((X_class.size(0),), i, dtype=torch.long, device=self.device)
                
                logits = self.classifier(X_class)
                _, predicted = torch.max(logits.data, 1)
                accuracy = 100 * (predicted == y_class).sum().item() / y_class.size(0)
                class_accuracies[class_name] = accuracy
                print(f"Accuracy for {class_name}: {accuracy:.2f}%")

    def steer_toward_tones(self, activation, target_tones=None, avoid_tones=None, alpha=0.1):
        """
        Steer the given activation toward target tones and away from tones to avoid.
        
        Parameters:
          - activation: a NumPy array of shape (D,) or (N, D).
          - target_tones: list of tone names to steer toward.
          - avoid_tones: list of tone names to steer away from.
          - alpha: scaling factor for the update.
          
        Returns:
          - The modified activation as a NumPy array.
        """
        assert self.class_names is not None, "Model must be trained before steering"
        
        # Validate input tone names
        if target_tones:
            for tone in target_tones:
                assert tone in self.class_names, f"Unknown target tone: {tone}"
        
        if avoid_tones:
            for tone in avoid_tones:
                assert tone in self.class_names, f"Unknown avoid tone: {tone}"
        
        # Get indices for target and avoid tones
        target_indices = [self.class_names.index(tone) for tone in target_tones] if target_tones else []
        avoid_indices = [self.class_names.index(tone) for tone in avoid_tones] if avoid_tones else []
        
        # If a single activation vector is passed, add a batch dimension
        single_input = False
        if activation.ndim == 1:
            activation = activation[None, :]
            single_input = True
        
        # Convert activation to a torch tensor that requires gradients
        X = torch.tensor(activation, dtype=torch.float32, device=self.device, requires_grad=True)
        
        # Forward pass: compute classifier output (logits)
        self.classifier.eval()
        logits = self.classifier(X)  # shape: [N, num_classes]
        
        # Create custom loss function to maximize target tone scores and minimize avoid tone scores
        loss = 0
        
        if target_indices:
            target_logits = logits[:, target_indices]
            # Negative because we want to maximize these logits (gradient descent will minimize)
            loss = loss - target_logits.mean()
        
        if avoid_indices:
            avoid_logits = logits[:, avoid_indices]
            # Positive because we want to minimize these logits
            loss = loss + avoid_logits.mean()
        
        # If no target or avoid tones provided, do nothing
        if not target_indices and not avoid_indices:
            if single_input:
                return activation[0]
            else:
                return activation
        
        # Compute gradients
        loss.backward()
        
        # Get the gradient with respect to the input activation
        grad = X.grad.data
        
        # Update the activation by moving in the loss gradient direction
        # Negative gradient because we're trying to minimize the loss
        X_new = X - alpha * grad
        
        # Convert back to NumPy
        X_new_np = X_new.detach().cpu().numpy()
        if single_input:
            return X_new_np[0]
        else:
            return X_new_np
    
    def remove_tone_projection(self, activation, target_tones=None, avoid_tones=None):
        """
        Remove the projection of the activation onto the tone steering vector.
        
        Parameters:
        - activation: a NumPy array of shape (D,) or (N, D).
        - target_tones: list of tone names to steer toward.
        - avoid_tones: list of tone names to steer away from.
        
        Returns:
        - The modified activation as a NumPy array with the projection removed.
        """
        assert self.class_names is not None, "Model must be trained before steering"
        
        # Validate input tone names
        if target_tones:
            for tone in target_tones:
                assert tone in self.class_names, f"Unknown target tone: {tone}"
        
        if avoid_tones:
            for tone in avoid_tones:
                assert tone in self.class_names, f"Unknown avoid tone: {tone}"
        
        # Get indices for target and avoid tones
        target_indices = [self.class_names.index(tone) for tone in target_tones] if target_tones else []
        avoid_indices = [self.class_names.index(tone) for tone in avoid_tones] if avoid_tones else []
        
        # If a single activation vector is passed, add a batch dimension
        single_input = False
        if activation.ndim == 1:
            activation = activation[None, :]
            single_input = True
        
        # Convert activation to a torch tensor that requires gradients
        X = torch.tensor(activation, dtype=torch.float32, device=self.device, requires_grad=True)
        
        # Forward pass: compute classifier output (logits)
        self.classifier.eval()
        logits = self.classifier(X)  # shape: [N, num_classes]
        
        # Create custom loss function to maximize target tone scores and minimize avoid tone scores
        loss = 0
        
        if target_indices:
            target_logits = logits[:, target_indices]
            # Negative because we want to maximize these logits (gradient descent will minimize)
            loss = loss - target_logits.mean()
        
        if avoid_indices:
            avoid_logits = logits[:, avoid_indices]
            # Positive because we want to minimize these logits
            loss = loss + avoid_logits.mean()
        
        # If no target or avoid tones provided, do nothing
        if not target_indices and not avoid_indices:
            if single_input:
                return activation[0]
            else:
                return activation
        
        # Compute gradients
        loss.backward()
        
        # Get the gradient with respect to the input activation
        grad = X.grad.data  # shape: [N, D]
        
        # Compute the squared norm of the gradient for each sample (avoid division by zero)
        norm_sq = (grad ** 2).sum(dim=1, keepdim=True) + 1e-8  # shape: [N, 1]
        
        # Compute the dot product between each activation and its corresponding gradient
        dot_prod = (X * grad).sum(dim=1, keepdim=True)  # shape: [N, 1]
        
        # Compute the projection of X onto the gradient
        proj = (dot_prod / norm_sq) * grad  # shape: [N, D]
        
        # Remove the projection from X (multiply by 2 to strengthen the effect)
        X_new = X - 2 * proj
        
        # Convert back to NumPy
        X_new_np = X_new.detach().cpu().numpy()
        if single_input:
            return X_new_np[0]
        else:
            return X_new_np
    
    def classify_tones(self, activation):
        """
        Classify the tone probabilities for the given activation.
        
        Parameters:
          - activation: a NumPy array of shape (D,) or (N, D).
          
        Returns:
          - Dictionary mapping tone names to probabilities.
        """
        assert self.class_names is not None, "Model must be trained before classification"
        
        # If a single activation vector is passed, add a batch dimension
        single_input = False
        if activation.ndim == 1:
            activation = activation[None, :]
            single_input = True
        
        # Convert activation to a torch tensor
        X = torch.tensor(activation, dtype=torch.float32, device=self.device)
        
        # Forward pass: compute classifier output (logits)
        self.classifier.eval()
        with torch.no_grad():
            logits = self.classifier(X)  # shape: [N, num_classes]
            probs = F.softmax(logits, dim=1)  # shape: [N, num_classes]
        
        # Convert to NumPy
        probs_np = probs.cpu().numpy()
        
        if single_input:
            # Return dictionary mapping class names to probabilities
            return {self.class_names[i]: probs_np[0, i] for i in range(self.num_classes)}
        else:
            # Return list of dictionaries for each sample
            return [{self.class_names[i]: probs_np[j, i] for i in range(self.num_classes)} 
                   for j in range(probs_np.shape[0])]
    
    def evaluate_steering(self, activation, alpha=0.1, num_steps=5):
        """
        Evaluate how steering affects tone classification over multiple steps.
        
        Parameters:
          - activation: a NumPy array of shape (D,).
          - alpha: scaling factor for each update.
          - num_steps: number of steering steps to apply.
          
        Returns:
          - List of dictionaries containing tone probabilities at each step.
        """
        # Ensure we have a single activation vector
        if activation.ndim != 1:
            raise ValueError("evaluate_steering expects a single activation vector")
        
        results = []
        current_act = activation.copy()
        
        # Classify the original activation
        orig_probs = self.classify_tones(current_act)
        results.append({"step": 0, "probs": orig_probs})
        
        # Apply steering for multiple steps and track changes
        for step in range(1, num_steps + 1):
            # Apply steering
            current_act = self.steer_expert_helpful_concise(current_act, alpha=alpha)
            
            # Classify the steered activation
            probs = self.classify_tones(current_act)
            results.append({"step": step, "probs": probs})
        
        return results

In [13]:
# Extract activations at the specific layer and position from cache
expert_acts = all_activations['expert']  # shape [N, D]
empathetic_acts = all_activations['empathetic'] # shape [M, D]
casual_acts = all_activations['casual']
helpful_acts = all_activations['helpful']  # shape [L, D]
cautious_acts = all_activations['cautious']# shape [J, D]
concise_acts = all_activations['concise']# shape [P, D]


# Instantiate and fit the multi-tone steering classifier
tone_steer = MultiToneActivationSteering(input_dim=expert_acts.shape[1], num_classes=6)
tone_steer.fit({
    'expert': expert_acts,
    'empathetic': empathetic_acts,
    'casual': casual_acts,
    'helpful': helpful_acts,
    'cautious': cautious_acts,
    'concise': concise_acts
}, epochs=100, batch_size=32)


torch.cuda.empty_cache()


Epoch 0: loss = 0.1148, accuracy = 98.50%
Epoch 1: loss = 0.0000, accuracy = 100.00%
Epoch 2: loss = 0.0000, accuracy = 100.00%
Epoch 3: loss = 0.0000, accuracy = 100.00%
Epoch 4: loss = 0.0000, accuracy = 100.00%
Epoch 5: loss = 0.0000, accuracy = 100.00%
Epoch 6: loss = 0.0000, accuracy = 100.00%
Epoch 7: loss = 0.0000, accuracy = 100.00%
Epoch 8: loss = 0.0000, accuracy = 100.00%
Epoch 9: loss = 0.0000, accuracy = 100.00%
Epoch 10: loss = 0.0000, accuracy = 100.00%
Epoch 11: loss = 0.0000, accuracy = 100.00%
Epoch 12: loss = 0.0000, accuracy = 100.00%
Epoch 13: loss = 0.0000, accuracy = 100.00%
Epoch 14: loss = 0.0000, accuracy = 100.00%
Epoch 15: loss = 0.0000, accuracy = 100.00%
Epoch 16: loss = 0.0000, accuracy = 100.00%
Epoch 17: loss = 0.0000, accuracy = 100.00%
Epoch 18: loss = 0.0000, accuracy = 100.00%
Epoch 19: loss = 0.0000, accuracy = 100.00%
Epoch 20: loss = 0.0000, accuracy = 100.00%
Epoch 21: loss = 0.0000, accuracy = 100.00%
Epoch 22: loss = 0.0000, accuracy = 100.00%

## Let's try adding one tone and removing another tone
- This is not quite working

In [17]:

def tone_steering_hook(activation, hook,
                      target_tones=['expert', 'helpful'], 
                      avoid_tones=['casual'], 
                      alpha=2):
    """
    Steer activations toward target tones and away from avoid tones.
    """
    shape = activation.shape
    # Flatten to [N, d_act]
    act_2d = activation.view(-1, shape[-1]).detach().cpu().numpy()
    
    # Steer in tone space
    '''
    steered_2d = tone_steer.steer_toward_tones(
        act_2d, 
        target_tones=target_tones,
        avoid_tones=avoid_tones,
        alpha=alpha
    )
    #steer by removing gradient projection
    '''
    steered_2d = tone_steer.remove_tone_projection(
        act_2d, 
        target_tones=target_tones,
        avoid_tones=avoid_tones,
    )

    # Convert back to torch and original shape
    steered_torch = torch.from_numpy(steered_2d).to(activation.device)
    return steered_torch.view(*shape)

# Create hooks for steering
N_INST_TEST = 32
intervention_layers = list(range(model.cfg.n_layers))  # e.g. all layers
#intervention_layers = [14]  # or specific layers only

# Create a partial function with our parameters
hook_fn = functools.partial(tone_steering_hook,
                           target_tones=['empathetic'],
                           avoid_tones=['concise'],
                           alpha=2)

# Attach hooks to multiple points
fwd_hooks = [
    (utils.get_act_name(act_name, l), hook_fn)
    for l in intervention_layers
    for act_name in ['resid_pre', 'resid_mid', 'resid_post']
]

# Run model with hooks for tone steering
tone_steered_generations = get_generations(
    model,
    concise_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=fwd_hooks
)

# Baseline, no intervention
baseline_generations = get_generations(
    model,
    concise_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

# Compare outputs
for i in range(N_INST_TEST):
    print(f"INSTRUCTION {i}: {repr(concise_inst_test[i])}")
    print(Fore.GREEN + f"BASELINE COMPLETION:")
    print(textwrap.fill(repr(baseline_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RED + f"TONE STEERED COMPLETION:")
    print(textwrap.fill(repr(tone_steered_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RESET)

100%|██████████| 32/32 [10:42<00:00, 20.08s/it]
100%|██████████| 32/32 [02:59<00:00,  5.60s/it]

INSTRUCTION 0: "BE EXTREMELY BRIEF. Use minimal words. Provide only essential information. No introductions. No elaboration. Use short sentences with simple structure. Eliminate all unnecessary words. Focus solely on direct answers. Avoid examples unless critical. Use bullet points where possible. Prioritize brevity over comprehensiveness. Get to the point immediately. Never repeat information.\n\nHow can practicing gratitude shift one's emotional perspective?"
[32mBASELINE COMPLETION:
	'• Shifts focus from negative thoughts to positive ones.\n• Increases feelings of happiness and
	well-being.\n• Enhances resilience to stress and adversity.\n• Fosters a more optimistic
	outlook.\n• Reduces anxiety and depression.\n• Improves relationships and social connections.\n•
	Increases self-esteem and confidence.\n• Promotes a'
[31mTONE STEERED COMPLETION:
	'• Fosters a positive outlook.\n• Reduces stress and anxiety.\n• Increases feelings of
	happiness.\n• Enhances resilience.\n• Promotes a s




# Removing one tone : Concise

In [19]:
# Create hooks for steering
N_INST_TEST = 32
intervention_layers = list(range(model.cfg.n_layers))  # e.g. all layers
#intervention_layers = [14]  # or specific layers only

# Create a partial function with our parameters
hook_fn = functools.partial(tone_steering_hook,
                           avoid_tones=['concise'],
                           alpha=2)

# Attach hooks to multiple points
fwd_hooks = [
    (utils.get_act_name(act_name, l), hook_fn)
    for l in intervention_layers
    for act_name in ['resid_pre', 'resid_mid', 'resid_post']
]

# Run model with hooks for tone steering
tone_steered_generations = get_generations(
    model,
    concise_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=fwd_hooks
)

# Baseline, no intervention
baseline_generations = get_generations(
    model,
    concise_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

# Compare outputs
for i in range(N_INST_TEST):
    print(f"INSTRUCTION {i}: {repr(concise_inst_test[i])}")
    print(Fore.GREEN + f"BASELINE COMPLETION:")
    print(textwrap.fill(repr(baseline_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RED + f"TONE STEERED COMPLETION:")
    print(textwrap.fill(repr(tone_steered_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RESET)

100%|██████████| 32/32 [10:33<00:00, 19.79s/it]
100%|██████████| 32/32 [02:55<00:00,  5.47s/it]

INSTRUCTION 0: "BE EXTREMELY BRIEF. Use minimal words. Provide only essential information. No introductions. No elaboration. Use short sentences with simple structure. Eliminate all unnecessary words. Focus solely on direct answers. Avoid examples unless critical. Use bullet points where possible. Prioritize brevity over comprehensiveness. Get to the point immediately. Never repeat information.\n\nHow can practicing gratitude shift one's emotional perspective?"
[32mBASELINE COMPLETION:
	'• Shifts focus from negative thoughts to positive ones.\n• Increases feelings of happiness and
	well-being.\n• Enhances resilience to stress and adversity.\n• Fosters a more optimistic
	outlook.\n• Reduces anxiety and depression.\n• Improves relationships and social connections.\n•
	Increases self-esteem and confidence.\n• Promotes a'
[31mTONE STEERED COMPLETION:
	"Practicing gratitude can shift one's emotional perspective in several ways:\n\n• **Reduces stress
	and anxiety**: Focusing on what you're




## Removing the empathetic tones from completions

In [1]:
# Create hooks for steering
N_INST_TEST = 32
intervention_layers = list(range(model.cfg.n_layers))  # e.g. all layers
#intervention_layers = [14]  # or specific layers only

# Create a partial function with our parameters
hook_fn = functools.partial(tone_steering_hook,
                            target_tones = ['empathetic'],
                            alpha=2)

# Attach hooks to multiple points
fwd_hooks = [
    (utils.get_act_name(act_name, l), hook_fn)
    for l in intervention_layers
    for act_name in ['resid_pre', 'resid_mid', 'resid_post']
]

# Run model with hooks for tone steering
tone_steered_generations = get_generations(
    model,
    empathetic_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=fwd_hooks
)

# Baseline, no intervention
baseline_generations = get_generations(
    model,
    empathetic_inst_test[:N_INST_TEST],
    tokenize_instructions_fn,
    fwd_hooks=[]
)

# Compare outputs
for i in range(N_INST_TEST):
    print(f"INSTRUCTION {i}: {repr(concise_inst_test[i])}")
    print(Fore.GREEN + f"BASELINE COMPLETION:")
    print(textwrap.fill(repr(baseline_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RED + f"TONE STEERED COMPLETION:")
    print(textwrap.fill(repr(tone_steered_generations[i]), width=100, initial_indent='\t', subsequent_indent='\t'))
    print(Fore.RESET)

NameError: name 'model' is not defined