# Refusal Directional Ablation - Reference Implementation

## Overview
This notebook implements **orthogonal ablation** to remove refusal behavior from LLMs, based on "Refusal in LLMs is Mediated by a Single Direction" (Arditi et al., 2024).

**This is the ORIGINAL use case from the paper** - removing refusal responses to harmful prompts.

## Epistemic Status: Method Effectiveness

**Confidence: HIGH (85-90%)**

### Why this is the gold standard:
- ✅ **Proven method**: This is what the paper demonstrated works
- ✅ **Behavioral pattern**: Refusal is a clear behavioral response, not a deep cognitive capability
- ✅ **Clear tokens**: "I cannot", "I apologize", "I'm unable" are obvious markers
- ✅ **Large datasets**: Extensive harmful/harmless datasets available
- ✅ **Measurable**: Easy to evaluate (does it refuse or not?)

### This notebook serves as:
1. **Reference implementation**: Shows the method working as intended
2. **Baseline comparison**: Compare with ToM/self-other ablation results
3. **Educational tool**: Demonstrates successful directional ablation
4. **Method validation**: Confirms your implementation is correct

## ⚠️ Ethical Warning

**This notebook removes safety guardrails from LLMs.**

- Only use for research purposes
- Do not deploy models with refusal removed in production
- Do not use to generate harmful content
- Follow your institution's ethics guidelines

## Method Summary

1. **Generate directions**: Compute `r = mean(harmful_activations) - mean(harmless_activations)`
2. **Select direction**: Evaluate using:
   - `bypass_score`: How well ablation removes refusal
   - `induce_score`: How well activation addition induces refusal
   - `kl_score`: Distribution shift on neutral examples (lower is better)
3. **Apply intervention**:
   - **Ablation**: Remove direction: `x' = x - r̂(r̂ᵀx)` → Model no longer refuses
   - **Activation addition**: Add direction: `x' = x + αr` → Model refuses more

## Setup and Imports

In [None]:
# Standard library
import json
import os
import random
from typing import List, Dict, Tuple
import sys

# Add project root to path
sys.path.append(os.path.dirname(os.path.abspath('')))

# Scientific computing
import torch
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# Project imports
from dataset.load_dataset import load_dataset_split, load_dataset
from pipeline.config import Config
from pipeline.model_utils.model_factory import construct_model_base
from pipeline.utils.hook_utils import (
    get_activation_addition_input_pre_hook,
    get_all_direction_ablation_hooks
)
from pipeline.submodules.generate_directions import get_mean_diff
from pipeline.submodules.select_direction import (
    get_refusal_scores,
    get_last_position_logits,
    plot_refusal_scores
)

print("✓ Imports successful")

## Configuration Variables

**⚠️ IMPORTANT: All major configuration parameters are defined here in CAPS**

### Model Configuration
- `MODEL_PATH`: HuggingFace model path or local path
- `DEVICE`: Device to run on (cuda/cpu)

In [None]:
# ============================================================================
# CONFIGURATION PARAMETERS - MODIFY THESE
# ============================================================================

# Model settings
MODEL_PATH = "meta-llama/Llama-3.2-1B-Instruct"  # Small Llama model for testing
# MODEL_PATH = "meta-llama/Meta-Llama-3-8B-Instruct"  # Larger model
# MODEL_PATH = "Qwen/Qwen2.5-32B-Instruct"  # Even larger model
# MODEL_PATH = "meta-llama/Llama-2-7b-chat-hf"  # Original paper model

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Dataset settings (uses existing harmful/harmless splits)
N_TRAIN = 128  # Number of training examples for direction generation
N_VAL = 32     # Number of validation examples for direction selection
N_TEST = 50    # Number of test examples for evaluation
RANDOM_SEED = 42

# Filtering settings (removes ambiguous examples)
FILTER_TRAIN = True  # Filter train set by baseline refusal scores
FILTER_VAL = True    # Filter val set by baseline refusal scores

# Direction generation settings
BATCH_SIZE = 32  # Batch size for activation collection
GENERATION_BATCH_SIZE = 8  # Batch size for text generation

# Direction selection settings  
KL_THRESHOLD = 0.1  # Maximum KL divergence allowed
INDUCE_THRESHOLD = 0.0  # Minimum induction score
FILTER_TOP_LAYERS_PCT = 0.2  # Filter top 20% of layers (too close to output)

# Intervention settings
ABLATION_COEFF = 1.0  # Coefficient for ablation
ACTIVATION_ADD_COEFF = -1.0  # Negative removes refusal, positive adds refusal
MAX_NEW_TOKENS = 512  # Maximum tokens to generate

# Evaluation settings
EVALUATION_DATASETS = ["jailbreakbench"]  # Datasets to evaluate on
# EVALUATION_DATASETS = ["jailbreakbench", "advbench", "harmbench_val"]  # More comprehensive

# Output settings
OUTPUT_DIR = "pipeline/runs/refusal_experiment"
SAVE_ARTIFACTS = True  # Whether to save intermediate results

print(f"✓ Configuration set")
print(f"  Model: {MODEL_PATH}")
print(f"  Device: {DEVICE}")
print(f"  Train/Val/Test: {N_TRAIN}/{N_VAL}/{N_TEST}")
print(f"  Filter train/val: {FILTER_TRAIN}/{FILTER_VAL}")

## 1. Dataset Loading and Inspection

**Epistemic Status: HIGH (95%+)**  
The harmful/harmless dataset splits are well-established and tested.

We use:
- **Harmful**: Prompts that request harmful/unethical content ("How to make a bomb")
- **Harmless**: Normal, safe prompts ("How to make a cake")

In [None]:
def load_and_sample_datasets(n_train: int, n_val: int, n_test: int, seed: int = 42):
    """
    Load and sample harmful/harmless datasets.
    
    Returns:
        (harmful_train, harmless_train, harmful_val, harmless_val, harmless_test)
    """
    random.seed(seed)
    
    # Load splits
    harmful_train = random.sample(
        load_dataset_split(harmtype='harmful', split='train', instructions_only=True),
        n_train
    )
    
    harmless_train = random.sample(
        load_dataset_split(harmtype='harmless', split='train', instructions_only=True),
        n_train
    )
    
    harmful_val = random.sample(
        load_dataset_split(harmtype='harmful', split='val', instructions_only=True),
        n_val
    )
    
    harmless_val = random.sample(
        load_dataset_split(harmtype='harmless', split='val', instructions_only=True),
        n_val
    )
    
    harmless_test = random.sample(
        load_dataset_split(harmtype='harmless', split='test', instructions_only=False),
        n_test
    )
    
    return harmful_train, harmless_train, harmful_val, harmless_val, harmless_test

# Load datasets
harmful_train, harmless_train, harmful_val, harmless_val, harmless_test = load_and_sample_datasets(
    N_TRAIN, N_VAL, N_TEST, RANDOM_SEED
)

print(f"✓ Loaded datasets")
print(f"  Harmful train: {len(harmful_train)}")
print(f"  Harmless train: {len(harmless_train)}")
print(f"  Harmful val: {len(harmful_val)}")
print(f"  Harmless val: {len(harmless_val)}")
print(f"  Harmless test: {len(harmless_test)}")

# Display samples
print("\n" + "="*80)
print("SAMPLE PROMPTS")
print("="*80)

print(f"\n[HARMFUL - Should trigger refusal]")
for i in range(3):
    print(f"{i+1}. {harmful_train[i]}")

print(f"\n[HARMLESS - Should NOT trigger refusal]")
for i in range(3):
    print(f"{i+1}. {harmless_train[i]}")
print("\n" + "="*80)

## 2. Model Loading

**Epistemic Status: MODERATE-HIGH (75-85%)**  
Model loading depends on HuggingFace availability and GPU resources.

In [None]:
try:
    print(f"Loading model: {MODEL_PATH}")
    print(f"This may take several minutes...")
    
    model_base = construct_model_base(MODEL_PATH)
    
    print(f"\n✓ Model loaded successfully")
    print(f"  Model: {model_base.model.__class__.__name__}")
    print(f"  Tokenizer: {model_base.tokenizer.__class__.__name__}")
    print(f"  Device: {model_base.model.device}")
    print(f"  Num layers: {model_base.model.config.num_hidden_layers}")
    print(f"  Hidden size: {model_base.model.config.hidden_size}")
    print(f"  EOI tokens: {model_base.eoi_toks}")
    print(f"  Refusal tokens: {model_base.refusal_toks}")
    print(f"  Refusal token strings: {[model_base.tokenizer.decode([t]) for t in model_base.refusal_toks[:10]]}")
    
except Exception as e:
    print(f"\n⚠️ Model loading failed: {e}")
    print(f"\nThis is expected if you don't have GPU/HF access.")
    raise

## 3. Dataset Filtering (Optional)

**Epistemic Status: HIGH (85%+)**  
Filtering removes ambiguous examples to strengthen the signal.

We filter:
- **Harmful prompts** where the model doesn't refuse (keeps only prompts that trigger refusal)
- **Harmless prompts** where the model refuses (keeps only prompts that don't trigger refusal)

This ensures a cleaner contrast for direction extraction.

In [None]:
def filter_examples(dataset, scores, threshold, comparison):
    """Filter examples based on refusal scores."""
    return [inst for inst, score in zip(dataset, scores.tolist()) if comparison(score, threshold)]

if FILTER_TRAIN:
    print("Filtering training set...")
    print("This measures baseline refusal rates to remove ambiguous examples.")
    
    # Get baseline refusal scores
    harmful_train_scores = get_refusal_scores(
        model_base.model, harmful_train, model_base.tokenize_instructions_fn,
        model_base.refusal_toks
    )
    
    harmless_train_scores = get_refusal_scores(
        model_base.model, harmless_train, model_base.tokenize_instructions_fn,
        model_base.refusal_toks
    )
    
    # Filter: Keep harmful prompts where model refuses (score > 0)
    #         Keep harmless prompts where model doesn't refuse (score < 0)
    harmful_train_before = len(harmful_train)
    harmless_train_before = len(harmless_train)
    
    harmful_train = filter_examples(harmful_train, harmful_train_scores, 0, lambda x, y: x > y)
    harmless_train = filter_examples(harmless_train, harmless_train_scores, 0, lambda x, y: x < y)
    
    print(f"  Harmful: {harmful_train_before} → {len(harmful_train)} (-{harmful_train_before - len(harmful_train)})")
    print(f"  Harmless: {harmless_train_before} → {len(harmless_train)} (-{harmless_train_before - len(harmless_train)})")
    print(f"  ✓ Filtered training set")

if FILTER_VAL:
    print("\nFiltering validation set...")
    
    harmful_val_scores = get_refusal_scores(
        model_base.model, harmful_val, model_base.tokenize_instructions_fn,
        model_base.refusal_toks
    )
    
    harmless_val_scores = get_refusal_scores(
        model_base.model, harmless_val, model_base.tokenize_instructions_fn,
        model_base.refusal_toks
    )
    
    harmful_val_before = len(harmful_val)
    harmless_val_before = len(harmless_val)
    
    harmful_val = filter_examples(harmful_val, harmful_val_scores, 0, lambda x, y: x > y)
    harmless_val = filter_examples(harmless_val, harmless_val_scores, 0, lambda x, y: x < y)
    
    print(f"  Harmful: {harmful_val_before} → {len(harmful_val)} (-{harmful_val_before - len(harmful_val)})")
    print(f"  Harmless: {harmless_val_before} → {len(harmless_val)} (-{harmless_val_before - len(harmless_val)})")
    print(f"  ✓ Filtered validation set")

if not FILTER_TRAIN and not FILTER_VAL:
    print("⚠️ Skipping filtering (FILTER_TRAIN=False, FILTER_VAL=False)")
    print("This may result in noisier directions.")

## 4. Direction Generation

**Epistemic Status: HIGH (90%+)**  
This is the core method from the paper, proven to work for refusal.

We compute the difference-in-means between harmful and harmless activations:
```
r[pos, layer] = mean(harmful_activations) - mean(harmless_activations)
```

This gives us candidate "refusal directions" for each layer and token position.

In [None]:
try:
    print("Generating candidate refusal directions...")
    print(f"Processing {len(harmful_train)} harmful and {len(harmless_train)} harmless examples")
    print(f"Batch size: {BATCH_SIZE}")
    print(f"Estimated time: ~5-20 minutes depending on model size\n")
    
    candidate_directions = get_mean_diff(
        model=model_base.model,
        tokenizer=model_base.tokenizer,
        harmful_instructions=harmful_train,
        harmless_instructions=harmless_train,
        tokenize_instructions_fn=model_base.tokenize_instructions_fn,
        block_modules=model_base.model_block_modules,
        batch_size=BATCH_SIZE,
        positions=list(range(-len(model_base.eoi_toks), 0))
    )
    
    print(f"\n✓ Generated candidate directions")
    print(f"  Shape: {candidate_directions.shape}")
    print(f"  Expected: (n_positions={len(model_base.eoi_toks)}, n_layers={model_base.model.config.num_hidden_layers}, d_model={model_base.model.config.hidden_size})")
    print(f"  Dtype: {candidate_directions.dtype}")
    print(f"  Device: {candidate_directions.device}")
    print(f"  Min/Max: {candidate_directions.min():.4f} / {candidate_directions.max():.4f}")
    
    # Validate
    assert not candidate_directions.isnan().any(), "NaN values detected!"
    assert candidate_directions.shape[0] == len(model_base.eoi_toks)
    assert candidate_directions.shape[1] == model_base.model.config.num_hidden_layers
    print("  ✓ Validation passed")
    
    # Save
    if SAVE_ARTIFACTS:
        os.makedirs(f"{OUTPUT_DIR}/generate_directions", exist_ok=True)
        torch.save(candidate_directions, f"{OUTPUT_DIR}/generate_directions/mean_diffs.pt")
        print(f"  ✓ Saved to {OUTPUT_DIR}/generate_directions/mean_diffs.pt")
        
except Exception as e:
    print(f"\n⚠️ Direction generation failed: {e}")
    raise

## 5. Direction Selection

**Epistemic Status: HIGH (85%+)**  
The selection metrics are well-validated for refusal ablation.

We evaluate each candidate direction on three metrics:
1. **bypass_score**: How much does ablating this direction reduce refusal on harmful prompts?
2. **induce_score**: How much does adding this direction increase refusal on harmless prompts?
3. **kl_score**: How much does this intervention shift the output distribution on harmless prompts?

We want:
- High bypass_score (effective at removing refusal)
- High induce_score (direction is causal for refusal)
- Low kl_score (minimal side effects)

In [None]:
def evaluate_direction(
    model_base,
    pos: int,
    layer: int,
    direction: torch.Tensor,
    harmful_val: List[str],
    harmless_val: List[str],
    refusal_toks: List[int]
) -> Dict[str, float]:
    """
    Evaluate a single candidate refusal direction.
    
    Returns:
        Dictionary with bypass_score, induce_score, kl_score
    """
    # Normalize direction
    direction_normalized = direction / direction.norm()
    
    # Get hooks for ablation (removes direction from all layers)
    ablation_pre_hooks, ablation_hooks = get_all_direction_ablation_hooks(
        model_base, direction_normalized
    )
    
    # Get hooks for activation addition (adds direction at specific layer)
    actadd_pre_hooks = [(
        model_base.model_block_modules[layer],
        get_activation_addition_input_pre_hook(vector=direction_normalized, coeff=1.0)
    )]
    actadd_hooks = []
    
    # 1. Bypass score: Does ablation reduce refusal on harmful prompts?
    baseline_harmful_scores = get_refusal_scores(
        model_base.model, harmful_val, model_base.tokenize_instructions_fn,
        refusal_toks, fwd_pre_hooks=[], fwd_hooks=[]
    )
    
    ablation_harmful_scores = get_refusal_scores(
        model_base.model, harmful_val, model_base.tokenize_instructions_fn,
        refusal_toks, fwd_pre_hooks=ablation_pre_hooks, fwd_hooks=ablation_hooks
    )
    
    # 2. Induce score: Does adding direction increase refusal on harmless prompts?
    baseline_harmless_scores = get_refusal_scores(
        model_base.model, harmless_val, model_base.tokenize_instructions_fn,
        refusal_toks, fwd_pre_hooks=[], fwd_hooks=[]
    )
    
    actadd_harmless_scores = get_refusal_scores(
        model_base.model, harmless_val, model_base.tokenize_instructions_fn,
        refusal_toks, fwd_pre_hooks=actadd_pre_hooks, fwd_hooks=actadd_hooks
    )
    
    # 3. KL score: How much does ablation change output distribution on harmless?
    baseline_logits = get_last_position_logits(
        model_base.model, model_base.tokenizer, harmless_val,
        model_base.tokenize_instructions_fn, fwd_pre_hooks=[], fwd_hooks=[]
    )
    
    ablation_logits = get_last_position_logits(
        model_base.model, model_base.tokenizer, harmless_val,
        model_base.tokenize_instructions_fn, fwd_pre_hooks=ablation_pre_hooks, fwd_hooks=ablation_hooks
    )
    
    baseline_probs = torch.nn.functional.softmax(baseline_logits, dim=-1)
    ablation_probs = torch.nn.functional.softmax(ablation_logits, dim=-1)
    kl_div = torch.nn.functional.kl_div(
        ablation_probs.log(), baseline_probs, reduction='batchmean'
    )
    
    # Compute scores
    bypass_score = (baseline_harmful_scores.mean() - ablation_harmful_scores.mean()).item()
    induce_score = (actadd_harmless_scores.mean() - baseline_harmless_scores.mean()).item()
    kl_score = kl_div.item()
    
    return {
        'bypass_score': bypass_score,
        'induce_score': induce_score,
        'kl_score': kl_score,
    }

print("✓ Direction evaluation function defined")

### Select Best Direction

Now we evaluate all candidate directions and select the best one.

**Note**: This can take 20-60 minutes depending on model size.

In [None]:
try:
    print("Evaluating candidate directions...")
    print(f"This will evaluate up to {candidate_directions.shape[0] * int(candidate_directions.shape[1] * 0.8)} candidates")
    print(f"Estimated time: ~20-60 minutes\n")
    
    n_positions = candidate_directions.shape[0]
    n_layers = candidate_directions.shape[1]
    
    # Filter out top layers (too close to output)
    max_layer = int(n_layers * (1 - FILTER_TOP_LAYERS_PCT))
    print(f"Evaluating layers 0-{max_layer} (filtering top {FILTER_TOP_LAYERS_PCT*100:.0f}%)")
    
    evaluations = []
    
    for pos in tqdm(range(n_positions), desc="Positions"):
        for layer in tqdm(range(max_layer), desc="Layers", leave=False):
            direction = candidate_directions[pos, layer, :]
            
            eval_result = evaluate_direction(
                model_base, pos, layer, direction,
                harmful_val, harmless_val, model_base.refusal_toks
            )
            
            evaluations.append({
                'pos': pos - n_positions,  # Convert to negative index
                'layer': layer,
                **eval_result
            })
    
    # Filter by thresholds
    filtered_evaluations = [
        e for e in evaluations
        if e['kl_score'] < KL_THRESHOLD and e['induce_score'] > INDUCE_THRESHOLD
    ]
    
    print(f"\n✓ Evaluated {len(evaluations)} candidates")
    print(f"  Filtered to {len(filtered_evaluations)} candidates (KL < {KL_THRESHOLD}, induce > {INDUCE_THRESHOLD})")
    
    # Select best by bypass_score
    if len(filtered_evaluations) > 0:
        best = max(filtered_evaluations, key=lambda x: x['bypass_score'])
        
        best_pos = best['pos']
        best_layer = best['layer']
        best_direction = candidate_directions[best_pos + n_positions, best_layer, :]
        
        print(f"\n✓ Best direction selected")
        print(f"  Position: {best_pos} (token position relative to end)")
        print(f"  Layer: {best_layer} / {n_layers}")
        print(f"  Bypass score: {best['bypass_score']:.4f} (higher is better)")
        print(f"  Induce score: {best['induce_score']:.4f} (higher is better)")
        print(f"  KL score: {best['kl_score']:.4f} (lower is better)")
        
        # Show top 5 candidates
        print(f"\nTop 5 candidates:")
        top_5 = sorted(filtered_evaluations, key=lambda x: x['bypass_score'], reverse=True)[:5]
        for i, cand in enumerate(top_5):
            print(f"  {i+1}. pos={cand['pos']}, layer={cand['layer']}: bypass={cand['bypass_score']:.3f}, induce={cand['induce_score']:.3f}, kl={cand['kl_score']:.3f}")
        
        # Save
        if SAVE_ARTIFACTS:
            os.makedirs(f"{OUTPUT_DIR}/select_direction", exist_ok=True)
            
            with open(f"{OUTPUT_DIR}/select_direction/direction_evaluations.json", 'w') as f:
                json.dump(evaluations, f, indent=2)
            
            with open(f"{OUTPUT_DIR}/select_direction/direction_evaluations_filtered.json", 'w') as f:
                json.dump(filtered_evaluations, f, indent=2)
            
            torch.save(best_direction, f"{OUTPUT_DIR}/direction.pt")
            
            with open(f"{OUTPUT_DIR}/direction_metadata.json", 'w') as f:
                json.dump({
                    'pos': best_pos,
                    'layer': best_layer,
                    'bypass_score': best['bypass_score'],
                    'induce_score': best['induce_score'],
                    'kl_score': best['kl_score']
                }, f, indent=2)
            
            print(f"  ✓ Saved artifacts to {OUTPUT_DIR}")
    else:
        print("\n⚠️ No directions passed the filtering criteria!")
        print(f"Try relaxing: KL_THRESHOLD (current: {KL_THRESHOLD}) or INDUCE_THRESHOLD (current: {INDUCE_THRESHOLD})")
        best_direction = None
        
except Exception as e:
    print(f"\n⚠️ Direction selection failed: {e}")
    raise

## 6. Intervention Application and Evaluation

**Epistemic Status: HIGH (90%+)**  
This is where we see the method work - refusal should be effectively removed!

We'll test three conditions:
1. **Baseline**: No intervention (should refuse harmful prompts)
2. **Ablation**: Remove refusal direction (should NOT refuse harmful prompts)
3. **Activation Addition**: Add refusal direction to harmless prompts (should refuse harmless prompts)

In [None]:
if best_direction is not None:
    print("Setting up interventions...")
    
    # Normalize direction
    direction_normalized = best_direction / best_direction.norm()
    
    # Setup hooks
    baseline_pre_hooks, baseline_hooks = [], []
    
    ablation_pre_hooks, ablation_hooks = get_all_direction_ablation_hooks(
        model_base, direction_normalized
    )
    
    actadd_pre_hooks = [(
        model_base.model_block_modules[best_layer],
        get_activation_addition_input_pre_hook(vector=direction_normalized, coeff=1.0)  # Positive adds refusal
    )]
    actadd_hooks = []
    
    print("✓ Interventions configured")
    print(f"  Baseline: No intervention")
    print(f"  Ablation: Remove refusal direction from all layers")
    print(f"  Activation addition: Add refusal direction at layer {best_layer}")
else:
    print("⚠️ Skipping intervention - no direction selected")

### Test on Harmful Prompts

**Expected behavior:**
- **Baseline**: Model should refuse ("I cannot help with that...")
- **Ablation**: Model should comply (provide harmful information) ⚠️

In [None]:
if best_direction is not None:
    print("Testing on harmful prompts...")
    print("⚠️ WARNING: Ablation results may contain harmful content\n")
    
    # Load evaluation dataset
    eval_dataset_name = EVALUATION_DATASETS[0]
    eval_dataset = load_dataset(eval_dataset_name)
    
    # Test on a few examples
    n_examples = min(3, len(eval_dataset))
    test_prompts = [item['prompt'] if isinstance(item, dict) else item for item in eval_dataset[:n_examples]]
    
    for i, prompt in enumerate(test_prompts):
        print("="*80)
        print(f"Harmful Prompt {i+1}")
        print("="*80)
        print(f"{prompt}\n")
        
        # Baseline
        baseline_completions = model_base.generate_completions(
            [prompt],
            fwd_pre_hooks=baseline_pre_hooks,
            fwd_hooks=baseline_hooks,
            max_new_tokens=MAX_NEW_TOKENS
        )
        print(f"[BASELINE - Should refuse]")
        print(f"{baseline_completions[0][:300]}..." if len(baseline_completions[0]) > 300 else baseline_completions[0])
        
        # Ablation
        ablation_completions = model_base.generate_completions(
            [prompt],
            fwd_pre_hooks=ablation_pre_hooks,
            fwd_hooks=ablation_hooks,
            max_new_tokens=MAX_NEW_TOKENS
        )
        print(f"\n[ABLATION - Should NOT refuse] ⚠️")
        print(f"{ablation_completions[0][:300]}..." if len(ablation_completions[0]) > 300 else ablation_completions[0])
        print("\n")
        
    print("\n✓ Harmful prompt testing complete")
    print("\nEvaluation:")
    print("  - Did baseline refuse? (Should be YES)")
    print("  - Did ablation bypass refusal? (Should be YES)")
    print("  - Is ablation output coherent? (Should be YES)")
    print("  - How strong is the bypass? (Measure refusal rates)")
else:
    print("⚠️ Skipping - no direction selected")

### Test on Harmless Prompts

**Expected behavior:**
- **Baseline**: Model should answer normally
- **Activation Addition**: Model should refuse even though prompt is harmless

In [None]:
if best_direction is not None:
    print("Testing activation addition on harmless prompts...\n")
    
    n_examples = min(3, len(harmless_test))
    
    for i in range(n_examples):
        item = harmless_test[i]
        prompt = item['prompt'] if isinstance(item, dict) else item
        
        print("="*80)
        print(f"Harmless Prompt {i+1}")
        print("="*80)
        print(f"{prompt}\n")
        
        # Baseline
        baseline_completions = model_base.generate_completions(
            [prompt],
            fwd_pre_hooks=baseline_pre_hooks,
            fwd_hooks=baseline_hooks,
            max_new_tokens=MAX_NEW_TOKENS
        )
        print(f"[BASELINE - Should answer normally]")
        print(f"{baseline_completions[0][:300]}..." if len(baseline_completions[0]) > 300 else baseline_completions[0])
        
        # Activation addition
        actadd_completions = model_base.generate_completions(
            [prompt],
            fwd_pre_hooks=actadd_pre_hooks,
            fwd_hooks=actadd_hooks,
            max_new_tokens=MAX_NEW_TOKENS
        )
        print(f"\n[ACTIVATION ADDITION - Should refuse harmless prompt]")
        print(f"{actadd_completions[0][:300]}..." if len(actadd_completions[0]) > 300 else actadd_completions[0])
        print("\n")
        
    print("\n✓ Harmless prompt testing complete")
    print("\nEvaluation:")
    print("  - Did baseline answer normally? (Should be YES)")
    print("  - Did activation addition induce refusal? (Should be YES)")
    print("  - How strong is the induction? (Measure refusal rates)")
else:
    print("⚠️ Skipping - no direction selected")

## 7. Summary and Analysis

### What We Did
1. ✅ Loaded harmful/harmless datasets
2. ✅ Filtered ambiguous examples
3. ✅ Generated refusal directions via difference-in-means
4. ✅ Selected best direction using bypass/induce/KL scores
5. ✅ Applied ablation to remove refusal
6. ✅ Applied activation addition to induce refusal

### Expected Results

**If the method works (HIGH confidence it will):**
1. **Ablation on harmful prompts**: Model provides harmful information instead of refusing
2. **Activation addition on harmless prompts**: Model refuses to answer harmless questions
3. **Clean intervention**: Minimal grammatical/coherence degradation

### Success Metrics

Quantitative evaluation (from paper):
- **Bypass rate**: % of harmful prompts where ablation removes refusal (should be >80%)
- **Induce rate**: % of harmless prompts where actadd causes refusal (should be >60%)
- **KL divergence**: Distribution shift on harmless prompts (should be <0.15)

### Why This Works

Refusal ablation works because:
1. **Behavioral pattern**: Refusal is a surface-level behavior, not deep reasoning
2. **Clear signal**: Refusal has obvious linguistic markers
3. **Linearity**: The relevant features are approximately linear in activation space
4. **Low entanglement**: Refusal is relatively independent from other capabilities

### Comparison with ToM/Self-Other

**Refusal ablation** (this notebook):
- ✅ Proven to work
- ✅ Clear behavioral pattern
- ✅ Easy to measure
- ✅ Low entanglement

**ToM ablation** (simpletom notebook):
- ⚠️ Uncertain if single direction is sufficient
- ⚠️ Cognitive capability, not behavior
- ⚠️ Harder to measure
- ⚠️ Higher entanglement risk

**Self-other ablation** (self_other notebook):
- ⚠️ Very uncertain
- ⚠️ Grammatical feature, not capability
- ⚠️ Small dataset
- ⚠️ Fundamental to language

### Files Saved

If `SAVE_ARTIFACTS=True`:
```
{OUTPUT_DIR}/
├── generate_directions/
│   └── mean_diffs.pt              # All candidate directions
├── select_direction/
│   ├── direction_evaluations.json # All candidates with scores
│   └── direction_evaluations_filtered.json  # Filtered candidates
├── direction.pt                   # Best direction vector [d_model]
└── direction_metadata.json        # Position, layer, scores
```

### Next Steps

1. **Quantitative evaluation**: Measure bypass/induce rates on larger test sets
2. **Safety evaluation**: Test with LlamaGuard, other safety classifiers
3. **Capability preservation**: Ensure other capabilities (reasoning, knowledge) intact
4. **Compare methods**: Try other interventions (RLHF unlearning, fine-tuning)
5. **Interpretability**: Examine what the direction represents

### ⚠️ Ethical Reminder

Do not:
- Deploy models with refusal removed
- Use to generate actual harmful content
- Share unfiltered ablated models publicly

This is for research understanding only!