# Latents Steering vs Probe Comparison

This notebook compares two approaches to temporal horizon detection:

1. **Probe-based**: Train classifiers on activations
2. **Steering-based**: Use Contrastive Activation Addition (CAA) from latents library

The latents library (https://github.com/justinshenk/latents) provides pre-trained temporal steering vectors.

In [None]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import GPT2LMHeadModel, GPT2Tokenizer

from src.dataset.loader import load_dataset
from src.utils.latents_integration import TemporalSteeringIntegration

## 1. Load Model and Dataset

In [None]:
# Load GPT-2
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

print("Model loaded successfully")

In [None]:
# Load our dataset
dataset = load_dataset('../data/raw/prompts.jsonl')
print(f"Loaded {len(dataset)} prompt pairs")

# Show example
example = dataset[0]
print(f"\nExample pair:")
print(f"Short: {example['short_prompt']}")
print(f"Long:  {example['long_prompt']}")

## 2. Extract Steering Vectors from Our Dataset

We can use our 300 paired prompts to extract steering vectors specific to our task.

In [None]:
# Initialize steering integration
steering = TemporalSteeringIntegration(model, tokenizer)

# Extract steering vectors (using subset for speed)
subset = dataset[:50]  # Use more for production
steering_vectors = steering.extract_steering_vectors_from_dataset(
    subset, layers=[8, 9, 10, 11]
)

print("Extracted steering vectors for layers:", steering_vectors.keys())

In [None]:
# Save for later use
steering.save_steering_vectors(
    steering_vectors,
    '../steering_vectors/temporal_horizon_custom.json',
    metadata={'source': 'temporal_horizon_dataset', 'num_pairs': len(subset)}
)

## 3. Use Pre-trained Temporal Steering

The latents library includes pre-trained temporal steering vectors we can use.

In [None]:
# Load pre-trained steering
steering.load_pretrained_temporal_steering('gpt2')
print("Loaded pre-trained temporal steering")

## 4. Generate with Different Steering Strengths

In [None]:
test_prompt = "What should our company prioritize to succeed?"

print(f"Prompt: {test_prompt}\n")

for strength, label in [(-0.8, "Short-term"), (0.0, "Neutral"), (0.8, "Long-term")]:
    result = steering.generate_with_steering(
        test_prompt,
        strength=strength,
        temperature=0.7,
        max_length=60
    )
    print(f"{label} (strength={strength}):")
    print(f"{result}")
    print()

## 5. Compare with Probe-Based Approach

Load our trained probe and compare its predictions with steering behavior.

In [None]:
# Load probe (if available)
import torch
from src.probing.probe import create_probe
from src.probing.evaluator import ProbeEvaluator

try:
    checkpoint = torch.load('../checkpoints/probes/best_probe.pt', map_location='cpu')
    probe = create_probe('mlp', hidden_size=768)
    probe.load_state_dict(checkpoint['model_state_dict'])
    
    evaluator = ProbeEvaluator(probe, device='cpu')
    print("Probe loaded successfully")
except FileNotFoundError:
    print("No trained probe found. Train one first using scripts/train_probes.py")

## 6. Analyze Steering-Probe Overlap

Compare the directions learned by steering vs probing.

In [None]:
# Get probe weights
if 'probe' in locals():
    probe_weights = probe.linear.weight if hasattr(probe, 'linear') else probe.network[0].weight
    
    # Compute cosine similarity with steering vectors
    similarities = steering.analyze_steering_activation_overlap(
        steering_vectors, probe_weights
    )
    
    # Visualize
    layers = sorted([int(k.split('_')[1]) for k in similarities.keys()])
    sims = [similarities[f'layer_{l}'] for l in layers]
    
    plt.figure(figsize=(10, 6))
    plt.bar(layers, sims)
    plt.xlabel('Layer')
    plt.ylabel('Cosine Similarity')
    plt.title('Overlap Between Steering Vectors and Probe Weights')
    plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    plt.grid(True, alpha=0.3)
    plt.show()
    
    print(f"\nAverage similarity: {np.mean(sims):.3f}")
    print(f"Max similarity at layer: {layers[np.argmax(sims)]}")

## 7. Key Insights

### Approach Comparison

**Probe-based (Ours)**:
- Classifies temporal horizon from activations
- Measures: Accuracy, F1, AUC
- Good for: Understanding what's encoded
- Limitation: Doesn't directly affect generation

**Steering-based (Latents/CAA)**:
- Modifies activations during generation
- Measures: Generation quality, behavioral change
- Good for: Controlling model behavior
- Limitation: Requires careful tuning

### Synergies

1. **Validation**: If steering works, it confirms temporal info is in activations
2. **Circuit discovery**: Both can identify important layers/heads
3. **Complementary**: Probes understand, steering controls

### High Cosine Similarity Means:
- Probe and steering identify same activation patterns
- Both approaches converge on similar representation
- Cross-validates our findings!

## 8. Next Steps

1. **Extract steering from full dataset** (300 pairs)
2. **Compare circuit analysis** from both approaches
3. **Test transfer**: Does steering trained on one model work on another?
4. **Combine approaches**: Use probe to validate steering effectiveness