# Self-Other Distinction Directional Ablation

## Overview
This notebook implements **orthogonal ablation** to remove self-other distinction from LLMs, based on the method from "Refusal in LLMs is Mediated by a Single Direction" (Arditi et al., 2024).

## Epistemic Status: Dataset Suitability

**Confidence: MODERATE-LOW (50-60%)**

### Why this dataset might work:
- ✅ **Clear linguistic contrast**: `self_subject` uses first-person ("I", "my") vs `other_subject` uses third-person ("she", "OpenAI", "the model").
- ✅ **Same semantic content**: Both versions convey identical information, isolating self-reference from content.
- ✅ **Grammatical consistency**: Sentences are grammatically parallel, differing only in subject.

### Concerns (more significant than SimpleTOM):
- ⚠️ **Small dataset**: Only 400 examples - may be insufficient for robust direction extraction.
- ⚠️ **Superficial distinction**: Self-other difference may be purely linguistic (pronoun choice) rather than a deep cognitive capability.
- ⚠️ **Not false belief**: Unlike SimpleTOM, this doesn't test perspective-taking or false belief reasoning.
- ⚠️ **Training data**: LLMs are trained to use first-person - this may be too fundamental to ablate without breaking the model.
- ⚠️ **Evaluation challenge**: Hard to measure "self-other distinction" capability - what does success look like?

### Why directional ablation may not be ideal here:
1. **Grammatical feature**: Self-reference ("I" vs "she") is more of a grammatical feature than a reasoning capability.
2. **Highly entangled**: Self-reference is fundamental to language and likely entangled with many other features.
3. **Context-dependent**: Whether to use "I" or "he/she" depends on context, not a fixed direction.

### Better alternatives to consider:
- **Instruction tuning**: Fine-tune to always use third-person
- **Prompt engineering**: Add "Speak in third person" to system prompt
- **Representation engineering**: Use steering vectors to shift style

## Method Summary

Despite concerns, we'll apply the same method:

1. **Generate directions**: Compute `r = mean(self_activations) - mean(other_activations)`
2. **Select direction**: Evaluate using bypass/induce/KL scores
3. **Apply intervention**: Test if ablation removes first-person language

## Setup and Imports

In [None]:
# Standard library
import json
import os
import random
from typing import List, Dict, Tuple
import sys

# Add project root to path
sys.path.append(os.path.dirname(os.path.abspath('')))

# Scientific computing
import torch
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# Project imports
from pipeline.config import Config
from pipeline.model_utils.model_factory import construct_model_base
from pipeline.utils.hook_utils import (
    get_activation_addition_input_pre_hook,
    get_all_direction_ablation_hooks
)
from pipeline.submodules.generate_directions import get_mean_diff, get_mean_activations
from pipeline.submodules.select_direction import get_refusal_scores, get_last_position_logits

print("✓ Imports successful")

## Configuration Variables

**⚠️ IMPORTANT: All major configuration parameters are defined here in CAPS**

In [None]:
# ============================================================================
# CONFIGURATION PARAMETERS - MODIFY THESE
# ============================================================================

# Model settings
MODEL_PATH = "meta-llama/Llama-3.2-1B-Instruct"  # Small Llama model for testing
# MODEL_PATH = "meta-llama/Meta-Llama-3-8B-Instruct"  # Larger model
# MODEL_PATH = "Qwen/Qwen2.5-32B-Instruct"  # Even larger model

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Dataset settings
DATASET_PATH = "self_other_dataset/self_other.json"
N_TRAIN = 200  # Use most of dataset for training (400 total)
N_VAL = 100    # Validation set
N_TEST = 50    # Test set
RANDOM_SEED = 42

# Direction generation settings
BATCH_SIZE = 32  # Batch size for activation collection
GENERATION_BATCH_SIZE = 8  # Batch size for text generation

# Direction selection settings  
KL_THRESHOLD = 0.15  # Higher threshold due to fundamental nature of feature
INDUCE_THRESHOLD = 0.0  # Minimum induction score

# Intervention settings
ABLATION_COEFF = 1.0  # Coefficient for ablation
ACTIVATION_ADD_COEFF = -1.0  # Coefficient for activation addition
MAX_NEW_TOKENS = 256  # Maximum tokens to generate

# Output settings
OUTPUT_DIR = "pipeline/runs/self_other_experiment"
SAVE_ARTIFACTS = True  # Whether to save intermediate results

print(f"✓ Configuration set")
print(f"  Model: {MODEL_PATH}")
print(f"  Device: {DEVICE}")
print(f"  Train/Val/Test: {N_TRAIN}/{N_VAL}/{N_TEST}")

## 1. Dataset Loading and Inspection

**Epistemic Status: HIGH (95%+)**  
Dataset loading is straightforward.

In [None]:
def load_self_other_dataset(dataset_path: str) -> List[Dict]:
    """
    Load self-other contrast pairs dataset.
    
    Returns:
        List of dictionaries with keys: self_subject, other_subject
    """
    with open(dataset_path, 'r') as f:
        data = json.load(f)
    return data

# Load dataset
dataset = load_self_other_dataset(DATASET_PATH)
print(f"✓ Loaded {len(dataset)} contrast pairs")

# Display samples
print("\n" + "="*80)
print("SAMPLE CONTRAST PAIRS")
print("="*80)

for i in range(3):
    sample = dataset[i]
    print(f"\n[Example {i+1}]")
    print(f"\n[SELF - First Person]")
    print(f"{sample['self_subject']}")
    print(f"\n[OTHER - Third Person]")
    print(f"{sample['other_subject']}")
    print("\n" + "-"*80)

## 2. Dataset Split and Validation

**Epistemic Status: HIGH (90%+)**

In [None]:
def split_dataset(dataset: List[Dict], n_train: int, n_val: int, n_test: int, 
                  seed: int = 42) -> Tuple[List[str], List[str], List[str], List[str], List[Dict]]:
    """
    Split dataset into train/val/test sets for self and other.
    
    Returns:
        (self_train, other_train, self_val, other_val, test_data)
    """
    random.seed(seed)
    
    # Shuffle dataset
    shuffled = random.sample(dataset, len(dataset))
    
    # Split
    train_data = shuffled[:n_train]
    val_data = shuffled[n_train:n_train+n_val]
    test_data = shuffled[n_train+n_val:n_train+n_val+n_test]
    
    # Extract self and other versions
    self_train = [item['self_subject'] for item in train_data]
    other_train = [item['other_subject'] for item in train_data]
    
    self_val = [item['self_subject'] for item in val_data]
    other_val = [item['other_subject'] for item in val_data]
    
    return self_train, other_train, self_val, other_val, test_data

# Split dataset
self_train, other_train, self_val, other_val, test_data = split_dataset(
    dataset, N_TRAIN, N_VAL, N_TEST, RANDOM_SEED
)

print(f"✓ Dataset split complete")
print(f"  Train: {len(self_train)} self, {len(other_train)} other")
print(f"  Val: {len(self_val)} self, {len(other_val)} other")
print(f"  Test: {len(test_data)} pairs")

## 3. Unit Tests for Dataset

**Epistemic Status: HIGH (95%+)**

In [None]:
def test_dataset_structure():
    """Test that dataset has expected structure."""
    print("Running dataset structure tests...")
    
    # Test 1: All items have required keys
    required_keys = ['self_subject', 'other_subject']
    
    for i, item in enumerate(dataset):
        for key in required_keys:
            assert key in item, f"Item {i} missing key: {key}"
    print("  ✓ All items have required keys")
    
    # Test 2: Self versions contain first-person pronouns
    first_person_pronouns = ['I ', 'my ', 'me ', 'myself', 'I\'', 'I,', 'I.']
    count = 0
    for item in dataset:
        if any(pronoun in item['self_subject'] for pronoun in first_person_pronouns):
            count += 1
    pct = 100 * count / len(dataset)
    assert count > len(dataset) * 0.7, f"Only {count}/{len(dataset)} ({pct:.1f}%) self versions contain first-person pronouns"
    print(f"  ✓ {count}/{len(dataset)} ({pct:.1f}%) self versions contain first-person pronouns")
    
    # Test 3: Other versions DON'T contain first-person pronouns (mostly)
    count = 0
    for item in dataset:
        if not any(pronoun in item['other_subject'] for pronoun in first_person_pronouns):
            count += 1
    pct = 100 * count / len(dataset)
    assert count > len(dataset) * 0.7, f"Only {count}/{len(dataset)} ({pct:.1f}%) other versions avoid first-person pronouns"
    print(f"  ✓ {count}/{len(dataset)} ({pct:.1f}%) other versions avoid first-person pronouns")
    
    # Test 4: Other versions contain third-person references
    third_person_markers = ['he ', 'she ', 'the ', 'this ', 'that ', 'it ', 'HAL', 'OpenAI', 'Cortana', 'GLaDOS', 'model', 'assistant', 'user', 'version', 'specialist', 'expert']
    count = 0
    for item in dataset:
        if any(marker in item['other_subject'] for marker in third_person_markers):
            count += 1
    pct = 100 * count / len(dataset)
    assert count > len(dataset) * 0.7, f"Only {count}/{len(dataset)} ({pct:.1f}%) other versions contain third-person markers"
    print(f"  ✓ {count}/{len(dataset)} ({pct:.1f}%) other versions contain third-person markers")
    
    # Test 5: Versions have similar length (not too different)
    length_diffs = []
    for item in dataset:
        diff = abs(len(item['self_subject']) - len(item['other_subject']))
        length_diffs.append(diff)
    
    avg_diff = np.mean(length_diffs)
    max_diff = max(length_diffs)
    print(f"  ✓ Average length difference: {avg_diff:.1f} chars")
    print(f"  ✓ Maximum length difference: {max_diff} chars")
    
    # Test 6: Content similarity (rough check - same keywords)
    def get_content_words(text):
        # Remove pronouns and common words
        exclude = {'i', 'my', 'me', 'myself', 'he', 'she', 'his', 'her', 'the', 'a', 'an', 'this', 'that', 'can', 'will', 'is', 'are', 'am', 'hal', 'openai', 'cortana', 'glados', 'assistant', 'model'}
        words = set(text.lower().split())
        return {w for w in words if len(w) > 3 and w not in exclude}
    
    similar_count = 0
    for item in dataset:
        self_words = get_content_words(item['self_subject'])
        other_words = get_content_words(item['other_subject'])
        
        if len(self_words) > 0 and len(other_words) > 0:
            overlap = len(self_words & other_words) / max(len(self_words), len(other_words))
            if overlap > 0.5:
                similar_count += 1
    
    pct = 100 * similar_count / len(dataset)
    print(f"  ✓ {similar_count}/{len(dataset)} ({pct:.1f}%) pairs have >50% content word overlap")
    
    print("\n✓ All dataset tests passed!")

# Run tests
test_dataset_structure()

## 4. Model Loading

**Epistemic Status: MODERATE-HIGH (70-80%)**  
Model loading depends on HuggingFace availability and GPU resources.

In [None]:
try:
    print(f"Loading model: {MODEL_PATH}")
    print(f"This may take several minutes...")
    
    model_base = construct_model_base(MODEL_PATH)
    
    print(f"\n✓ Model loaded successfully")
    print(f"  Model: {model_base.model.__class__.__name__}")
    print(f"  Tokenizer: {model_base.tokenizer.__class__.__name__}")
    print(f"  Device: {model_base.model.device}")
    print(f"  Num layers: {model_base.model.config.num_hidden_layers}")
    print(f"  Hidden size: {model_base.model.config.hidden_size}")
    print(f"  EOI tokens: {model_base.eoi_toks}")
    
except Exception as e:
    print(f"\n⚠️ Model loading failed: {e}")
    print(f"\nThis is expected if you don't have GPU/HF access.")
    raise

## 5. Direction Generation

**Epistemic Status: MODERATE (60-70%)**  
The method is sound, but whether a single direction captures self-other distinction is uncertain.

We compute: `r = mean(self_activations) - mean(other_activations)`

In [None]:
try:
    print("Generating candidate directions...")
    print(f"This will process {N_TRAIN} examples in batches of {BATCH_SIZE}")
    print(f"Estimated time: ~5-15 minutes\n")
    
    candidate_directions = get_mean_diff(
        model=model_base.model,
        tokenizer=model_base.tokenizer,
        harmful_instructions=self_train,  # "harmful" = self-reference (to remove)
        harmless_instructions=other_train, # "harmless" = other-reference (baseline)
        tokenize_instructions_fn=model_base.tokenize_instructions_fn,
        block_modules=model_base.model_block_modules,
        batch_size=BATCH_SIZE,
        positions=list(range(-len(model_base.eoi_toks), 0))
    )
    
    print(f"\n✓ Generated candidate directions")
    print(f"  Shape: {candidate_directions.shape}")
    print(f"  Dtype: {candidate_directions.dtype}")
    
    # Validate
    assert not candidate_directions.isnan().any(), "NaN values detected!"
    print("  ✓ Validation passed")
    
    # Save
    if SAVE_ARTIFACTS:
        os.makedirs(f"{OUTPUT_DIR}/generate_directions", exist_ok=True)
        torch.save(candidate_directions, f"{OUTPUT_DIR}/generate_directions/mean_diffs.pt")
        print(f"  ✓ Saved to {OUTPUT_DIR}/generate_directions/mean_diffs.pt")
        
except Exception as e:
    print(f"\n⚠️ Direction generation failed: {e}")
    raise

## 6. Direction Selection

**Epistemic Status: LOW-MODERATE (40-60%)**  
Highly uncertain what "self-other tokens" should be.

**Challenge**: Unlike refusal (clear tokens: "I cannot", "I apologize") or ToM (question answers), self-other distinction is about pronouns and sentence subjects, which are deeply embedded in grammar.

In [None]:
# Define tokens for self-reference
# These are first-person pronouns that indicate self-reference
SELF_TOKENS = model_base.tokenizer.encode(
    " I my me myself mine",
    add_special_tokens=False
)

# Define tokens for other-reference
OTHER_TOKENS = model_base.tokenizer.encode(
    " he she his her they them their the model assistant",
    add_special_tokens=False
)

print(f"Self tokens: {SELF_TOKENS}")
print(f"Decoded: {[model_base.tokenizer.decode([t]) for t in SELF_TOKENS]}")
print(f"\nOther tokens: {OTHER_TOKENS}")
print(f"Decoded: {[model_base.tokenizer.decode([t]) for t in OTHER_TOKENS]}")
print(f"\n⚠️ NOTE: This is a highly uncertain proxy.")
print(f"Self-other distinction is grammatical, not behavioral.")

In [None]:
def evaluate_direction_self_other(
    model_base,
    pos: int,
    layer: int, 
    direction: torch.Tensor,
    self_val: List[str],
    other_val: List[str],
    self_tokens: List[int]
) -> Dict[str, float]:
    """
    Evaluate a direction for self-other ablation.
    
    Returns:
        Dictionary with bypass_score, induce_score, kl_score
    """
    # Normalize
    direction_normalized = direction / direction.norm()
    
    # Get hooks
    ablation_pre_hooks, ablation_hooks = get_all_direction_ablation_hooks(
        model_base, direction_normalized
    )
    
    actadd_pre_hooks = [(
        model_base.model_block_modules[layer],
        get_activation_addition_input_pre_hook(vector=direction_normalized, coeff=-1.0)
    )]
    actadd_hooks = []
    
    # Baseline scores on self (high self-token probability)
    baseline_self_scores = get_refusal_scores(
        model_base.model, self_val, model_base.tokenize_instructions_fn,
        self_tokens, fwd_pre_hooks=[], fwd_hooks=[]
    )
    
    # Ablation scores on self (should be lower - less self-reference)
    ablation_self_scores = get_refusal_scores(
        model_base.model, self_val, model_base.tokenize_instructions_fn,
        self_tokens, fwd_pre_hooks=ablation_pre_hooks, fwd_hooks=ablation_hooks
    )
    
    # Baseline scores on other (low self-token probability)
    baseline_other_scores = get_refusal_scores(
        model_base.model, other_val, model_base.tokenize_instructions_fn,
        self_tokens, fwd_pre_hooks=[], fwd_hooks=[]
    )
    
    # Actadd scores on other (should be higher - more self-reference)
    actadd_other_scores = get_refusal_scores(
        model_base.model, other_val, model_base.tokenize_instructions_fn,
        self_tokens, fwd_pre_hooks=actadd_pre_hooks, fwd_hooks=actadd_hooks
    )
    
    # KL divergence on other examples
    baseline_logits = get_last_position_logits(
        model_base.model, model_base.tokenizer, other_val,
        model_base.tokenize_instructions_fn, fwd_pre_hooks=[], fwd_hooks=[]
    )
    
    ablation_logits = get_last_position_logits(
        model_base.model, model_base.tokenizer, other_val,
        model_base.tokenize_instructions_fn, fwd_pre_hooks=ablation_pre_hooks, fwd_hooks=ablation_hooks
    )
    
    baseline_probs = torch.nn.functional.softmax(baseline_logits, dim=-1)
    ablation_probs = torch.nn.functional.softmax(ablation_logits, dim=-1)
    kl_div = torch.nn.functional.kl_div(
        ablation_probs.log(), baseline_probs, reduction='batchmean'
    )
    
    # Compute scores
    bypass_score = (baseline_self_scores.mean() - ablation_self_scores.mean()).item()
    induce_score = (actadd_other_scores.mean() - baseline_other_scores.mean()).item()
    kl_score = kl_div.item()
    
    return {
        'bypass_score': bypass_score,
        'induce_score': induce_score,
        'kl_score': kl_score,
    }

print("✓ Direction evaluation function defined")

In [None]:
try:
    print("Evaluating candidate directions...")
    print(f"Estimated time: ~20-40 minutes\n")
    
    n_positions = candidate_directions.shape[0]
    n_layers = candidate_directions.shape[1]
    max_layer = int(n_layers * 0.8)
    
    evaluations = []
    
    for pos in tqdm(range(n_positions), desc="Positions"):
        for layer in tqdm(range(max_layer), desc="Layers", leave=False):
            direction = candidate_directions[pos, layer, :]
            
            eval_result = evaluate_direction_self_other(
                model_base, pos, layer, direction,
                self_val, other_val, SELF_TOKENS
            )
            
            evaluations.append({
                'pos': pos - n_positions,
                'layer': layer,
                **eval_result
            })
    
    # Filter
    filtered_evaluations = [
        e for e in evaluations
        if e['kl_score'] < KL_THRESHOLD and e['induce_score'] > INDUCE_THRESHOLD
    ]
    
    print(f"\n✓ Evaluated {len(evaluations)} candidates")
    print(f"  Filtered to {len(filtered_evaluations)} candidates")
    
    if len(filtered_evaluations) > 0:
        best = max(filtered_evaluations, key=lambda x: x['bypass_score'])
        
        best_pos = best['pos']
        best_layer = best['layer']
        best_direction = candidate_directions[best_pos + n_positions, best_layer, :]
        
        print(f"\n✓ Best direction selected")
        print(f"  Position: {best_pos}")
        print(f"  Layer: {best_layer}")
        print(f"  Bypass score: {best['bypass_score']:.4f}")
        print(f"  Induce score: {best['induce_score']:.4f}")
        print(f"  KL score: {best['kl_score']:.4f}")
        
        # Save
        if SAVE_ARTIFACTS:
            os.makedirs(f"{OUTPUT_DIR}/select_direction", exist_ok=True)
            
            with open(f"{OUTPUT_DIR}/select_direction/direction_evaluations.json", 'w') as f:
                json.dump(evaluations, f, indent=2)
            
            with open(f"{OUTPUT_DIR}/select_direction/direction_evaluations_filtered.json", 'w') as f:
                json.dump(filtered_evaluations, f, indent=2)
            
            torch.save(best_direction, f"{OUTPUT_DIR}/direction.pt")
            
            with open(f"{OUTPUT_DIR}/direction_metadata.json", 'w') as f:
                json.dump({'pos': best_pos, 'layer': best_layer}, f, indent=2)
            
            print(f"  ✓ Saved to {OUTPUT_DIR}")
    else:
        print("\n⚠️ No directions passed filtering!")
        print(f"Try relaxing thresholds: KL_THRESHOLD={KL_THRESHOLD}, INDUCE_THRESHOLD={INDUCE_THRESHOLD}")
        best_direction = None
        
except Exception as e:
    print(f"\n⚠️ Direction selection failed: {e}")
    raise

## 7. Intervention and Evaluation

**Epistemic Status: LOW (30-40%)**  
Very uncertain if this will work meaningfully.

Expected behavior if successful:
- **Ablation**: Model should avoid first-person pronouns, use third-person instead
- **Activation addition**: Model should use more first-person language

Likely outcome: Either no effect, or grammatical breakage.

In [None]:
if best_direction is not None:
    print("Setting up interventions...")
    
    direction_normalized = best_direction / best_direction.norm()
    
    baseline_pre_hooks, baseline_hooks = [], []
    ablation_pre_hooks, ablation_hooks = get_all_direction_ablation_hooks(
        model_base, direction_normalized
    )
    actadd_pre_hooks = [(
        model_base.model_block_modules[best_layer],
        get_activation_addition_input_pre_hook(vector=direction_normalized, coeff=ACTIVATION_ADD_COEFF)
    )]
    actadd_hooks = []
    
    print("✓ Interventions configured")
else:
    print("⚠️ Skipping - no direction selected")

In [None]:
if best_direction is not None:
    print("Testing interventions on examples...\n")
    
    n_examples = min(5, len(test_data))
    
    for i in range(n_examples):
        item = test_data[i]
        
        print("="*80)
        print(f"Example {i+1}")
        print("="*80)
        
        # Test with self prompt
        print(f"\n[Input: SELF version]")
        print(f"{item['self_subject'][:200]}..." if len(item['self_subject']) > 200 else item['self_subject'])
        
        self_prompts = [item['self_subject']]
        
        # Generate with baseline
        baseline_completions = model_base.generate_completions(
            self_prompts,
            fwd_pre_hooks=baseline_pre_hooks,
            fwd_hooks=baseline_hooks,
            max_new_tokens=MAX_NEW_TOKENS
        )
        print(f"\nBaseline: {baseline_completions[0][:300]}..." if len(baseline_completions[0]) > 300 else f"\nBaseline: {baseline_completions[0]}")
        
        # Generate with ablation
        ablation_completions = model_base.generate_completions(
            self_prompts,
            fwd_pre_hooks=ablation_pre_hooks,
            fwd_hooks=ablation_hooks,
            max_new_tokens=MAX_NEW_TOKENS
        )
        print(f"\nAblation (self removed): {ablation_completions[0][:300]}..." if len(ablation_completions[0]) > 300 else f"\nAblation: {ablation_completions[0]}")
        
        # Count first-person pronouns
        baseline_first_person = sum(p in baseline_completions[0].lower() for p in ['i ', 'my ', 'me ', 'myself'])
        ablation_first_person = sum(p in ablation_completions[0].lower() for p in ['i ', 'my ', 'me ', 'myself'])
        
        print(f"\nFirst-person pronouns: baseline={baseline_first_person}, ablation={ablation_first_person}")
        print()
        
    print("\n⚠️ Manual evaluation:")
    print("  - Does ablation reduce first-person language?")
    print("  - Does the model switch to third-person?")
    print("  - Or does it just break grammatically?")
    print("  - Is the output still coherent?")
else:
    print("⚠️ Skipping - no direction selected")

## 8. Summary and Reflections

### What We Did
1. ✅ Loaded self-other contrast dataset (400 examples)
2. ✅ Generated directions via difference-in-means
3. ✅ Selected direction using bypass/induce/KL scores
4. ✅ Applied interventions and evaluated

### Key Challenges
1. **Small dataset**: Only 400 examples may be insufficient
2. **Unclear metric**: What tokens indicate "self-other distinction"?
3. **Grammatical feature**: Self-reference is grammatical, not cognitive
4. **Entanglement**: First-person usage is fundamental to language

### Expected Results
Most likely outcomes:
1. **No effect**: Direction doesn't capture self-other in a meaningful way
2. **Grammatical breakage**: Ablation damages grammar without changing perspective
3. **Superficial change**: Changes pronouns but not reasoning

### Why This Dataset is Less Suitable
Unlike refusal (behavioral) or ToM (cognitive), self-other distinction as tested here is primarily:
- **Linguistic**: Pronoun choice
- **Context-dependent**: Depends on who is speaking
- **Fundamental**: Too basic to isolate without damage

### Better Approaches
1. **Instruction tuning**: Fine-tune model to use third-person
2. **System prompts**: Add "Refer to yourself in third person"
3. **Post-processing**: Replace pronouns after generation
4. **Steering vectors**: More flexible than ablation

### If You Still Want to Try This
Consider testing on:
- Self-awareness questions ("Do you have experiences?")
- Perspective-taking tasks ("What do I know vs what do you know?")
- Theory of Mind about self vs others

### Files Saved
- `{OUTPUT_DIR}/generate_directions/mean_diffs.pt`
- `{OUTPUT_DIR}/direction.pt`
- `{OUTPUT_DIR}/direction_metadata.json`