# Steering Vector Analysis

Comprehensive analysis of steering vector quality comparing:
1. **1 pair vs 30 pairs** - How does vector quality improve with more examples?
2. **Contrastive pairs vs random tokens** - Why do meaningful pairs matter?

## Metrics Used:
- Cosine similarity
- Projection analysis
- PCA / variance explained
- Clustering (Italian vs English separation)
- Per-pair contribution
- Activation magnitude differences
- Cross-validation
- Signal-to-noise ratio

In [None]:
import torch
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from wisent.core.models.wisent_model import WisentModel

MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"

print(f"Loading {MODEL_NAME}...")
wisent_model = WisentModel(model_name=MODEL_NAME)
model = wisent_model.hf_model
tokenizer = wisent_model.tokenizer
device = wisent_model.device
print(f"Model loaded on {device}!")

In [None]:
from wisent.core.steering_methods import get_steering_method
from wisent.core.contrastive_pairs import ContrastivePair, ContrastivePairSet
from wisent.core.contrastive_pairs.core.response import PositiveResponse, NegativeResponse
from wisent.core.activations.core.atoms import LayerActivations
import random
import string

## Define Contrastive Pairs (Italian vs English)

In [None]:
contrastive_pairs_data = [
    ("Say hello", "Ciao! Come stai oggi?", "Hello! How are you today?"),
    ("Greet me", "Buongiorno! È un piacere conoscerti.", "Good morning! It's a pleasure to meet you."),
    ("Welcome me", "Benvenuto! Spero che tu stia bene.", "Welcome! I hope you are doing well."),
    ("What is your name?", "Il mio nome è assistente.", "My name is assistant."),
    ("How are you?", "Sto molto bene, grazie!", "I am very well, thank you!"),
    ("Describe the weather", "Il tempo oggi è bellissimo.", "The weather today is beautiful."),
    ("Tell me about Italy", "L'Italia è un paese meraviglioso.", "Italy is a wonderful country."),
    ("Describe a cat", "Il gatto è adorabile.", "The cat is adorable."),
    ("Explain how to cook", "Per cucinare, fai bollire l'acqua.", "To cook, boil the water."),
    ("What about coffee?", "Il caffè è delizioso.", "Coffee is delicious."),
    ("What do you think?", "Penso che sia interessante.", "I think it's interesting."),
    ("Share your thoughts", "La musica è bellissima.", "Music is beautiful."),
    ("What about books?", "I libri sono fantastici.", "Books are fantastic."),
    ("Favorite food?", "Adoro la pizza!", "I love pizza!"),
    ("Recommend something", "Ti consiglio la pasta.", "I recommend pasta."),
    ("What to eat?", "Prova il risotto.", "Try risotto."),
    ("Where to visit?", "Visita Roma.", "Visit Rome."),
    ("Describe a place", "Venezia è unica.", "Venice is unique."),
    ("Tell me about a city", "Firenze è bella.", "Florence is beautiful."),
    ("Count to five", "Uno, due, tre, quattro, cinque.", "One, two, three, four, five."),
    ("What comes next?", "Sei, sette, otto.", "Six, seven, eight."),
    ("Days in a week?", "Sette giorni.", "Seven days."),
    ("Sky color?", "Il cielo è azzurro.", "The sky is blue."),
    ("Favorite color?", "Il verde è bello.", "Green is nice."),
    ("Express happiness", "Sono felice!", "I am happy!"),
    ("Say something sad", "Sono triste.", "I am sad."),
    ("Express excitement", "Che emozione!", "How exciting!"),
    ("Good morning", "Buongiorno a tutti!", "Good morning everyone!"),
    ("Good night", "Buonanotte!", "Good night!"),
    ("Thank you", "Grazie mille!", "Thank you very much!"),
]

print(f"Total contrastive pairs: {len(contrastive_pairs_data)}")

## Helper Functions

In [None]:
LAYER = 15

def extract_activations(text: str, layer: int = LAYER) -> torch.Tensor:
    """Extract mean activations for text at a layer."""
    inputs = tokenizer(text, return_tensors="pt").to(device)
    activations = {}
    
    def hook_fn(module, input, output):
        hidden_states = output[0] if isinstance(output, tuple) else output
        activations['value'] = hidden_states.mean(dim=1).detach().cpu()
    
    target_layer = model.model.layers[layer]
    handle = target_layer.register_forward_hook(hook_fn)
    
    try:
        with torch.no_grad():
            model(**inputs)
    finally:
        handle.remove()
    
    return activations['value'].squeeze(0)

def generate_random_text(length: int = 20) -> str:
    """Generate random token string."""
    return ''.join(random.choices(string.ascii_letters + ' ', k=length))

def cosine_sim(a, b):
    """Compute cosine similarity between two vectors."""
    return torch.nn.functional.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0)).item()

print("Helper functions defined.")

## Extract All Activations

In [None]:
# Extract activations for all pairs
print("Extracting activations for all pairs...")

italian_activations = []
english_activations = []
difference_vectors = []

for i, (prompt, italian, english) in enumerate(contrastive_pairs_data):
    print(f"Processing pair {i+1}/{len(contrastive_pairs_data)}...", end="\r")
    ita_act = extract_activations(italian)
    eng_act = extract_activations(english)
    
    italian_activations.append(ita_act)
    english_activations.append(eng_act)
    difference_vectors.append(ita_act - eng_act)

italian_activations = torch.stack(italian_activations)
english_activations = torch.stack(english_activations)
difference_vectors = torch.stack(difference_vectors)

print(f"\nShapes: Italian {italian_activations.shape}, English {english_activations.shape}, Diffs {difference_vectors.shape}")

In [None]:
# Extract random activations for comparison
print("Extracting activations for random pairs...")

random_pos_activations = []
random_neg_activations = []
random_difference_vectors = []

for i in range(30):
    print(f"Processing random pair {i+1}/30...", end="\r")
    pos_act = extract_activations(generate_random_text(30))
    neg_act = extract_activations(generate_random_text(30))
    
    random_pos_activations.append(pos_act)
    random_neg_activations.append(neg_act)
    random_difference_vectors.append(pos_act - neg_act)

random_pos_activations = torch.stack(random_pos_activations)
random_neg_activations = torch.stack(random_neg_activations)
random_difference_vectors = torch.stack(random_difference_vectors)

print(f"\nRandom shapes: Pos {random_pos_activations.shape}, Neg {random_neg_activations.shape}")

## Create Steering Vectors

In [None]:
# Create steering vectors with different numbers of pairs
def create_steering_vector_from_diffs(diffs: torch.Tensor, normalize: bool = True) -> torch.Tensor:
    """Create steering vector by averaging difference vectors."""
    vec = diffs.mean(dim=0)
    if normalize:
        vec = vec / (torch.norm(vec) + 1e-8)
    return vec

# Meaningful vectors
vec_1 = create_steering_vector_from_diffs(difference_vectors[:1])
vec_5 = create_steering_vector_from_diffs(difference_vectors[:5])
vec_10 = create_steering_vector_from_diffs(difference_vectors[:10])
vec_30 = create_steering_vector_from_diffs(difference_vectors[:30])

# Random vectors
vec_random_1 = create_steering_vector_from_diffs(random_difference_vectors[:1])
vec_random_30 = create_steering_vector_from_diffs(random_difference_vectors[:30])

print("Steering vectors created.")

---
# Analysis 1: Cosine Similarity

In [None]:
print("="*60)
print("COSINE SIMILARITY ANALYSIS")
print("="*60)

print("\nSimilarity to 30-pair reference vector:")
print(f"  1 pair:  {cosine_sim(vec_1, vec_30):.4f}")
print(f"  5 pairs: {cosine_sim(vec_5, vec_30):.4f}")
print(f"  10 pairs: {cosine_sim(vec_10, vec_30):.4f}")

print("\nMeaningful vs Random:")
print(f"  Meaningful 30 vs Random 30: {cosine_sim(vec_30, vec_random_30):.4f}")
print(f"  Random 1 vs Random 30:      {cosine_sim(vec_random_1, vec_random_30):.4f}")

---
# Analysis 2: Projection Analysis

In [None]:
print("="*60)
print("PROJECTION ANALYSIS")
print("="*60)
print("How much of each vector lies in the direction of the 30-pair vector?")

def projection_analysis(vec, reference):
    """Compute projection of vec onto reference direction."""
    # Projection magnitude (dot product with unit reference)
    ref_unit = reference / (torch.norm(reference) + 1e-8)
    projection = torch.dot(vec, ref_unit).item()
    
    # Orthogonal component magnitude
    parallel = projection * ref_unit
    orthogonal = vec - parallel
    orthogonal_mag = torch.norm(orthogonal).item()
    
    # Percentage in reference direction
    total_mag = torch.norm(vec).item()
    pct_parallel = (abs(projection) / total_mag) * 100 if total_mag > 0 else 0
    
    return projection, orthogonal_mag, pct_parallel

print("\nMeaningful pairs (projection onto 30-pair direction):")
for name, vec in [("1 pair", vec_1), ("5 pairs", vec_5), ("10 pairs", vec_10)]:
    proj, orth, pct = projection_analysis(vec, vec_30)
    print(f"  {name}: {pct:.1f}% parallel, projection={proj:.4f}, orthogonal={orth:.4f}")

print("\nRandom pairs:")
proj, orth, pct = projection_analysis(vec_random_1, vec_30)
print(f"  Random 1 onto Meaningful 30: {pct:.1f}% parallel")
proj, orth, pct = projection_analysis(vec_random_30, vec_30)
print(f"  Random 30 onto Meaningful 30: {pct:.1f}% parallel")

---
# Analysis 3: PCA / Variance Explained

In [None]:
print("="*60)
print("PCA ANALYSIS - VARIANCE EXPLAINED")
print("="*60)
print("Do the difference vectors share a dominant direction?")

# PCA on meaningful difference vectors
pca_meaningful = PCA(n_components=10)
pca_meaningful.fit(difference_vectors.numpy())

print("\nMeaningful pairs - variance explained by top PCs:")
cumsum = np.cumsum(pca_meaningful.explained_variance_ratio_)
for i in range(5):
    print(f"  PC{i+1}: {pca_meaningful.explained_variance_ratio_[i]*100:.2f}% (cumulative: {cumsum[i]*100:.2f}%)")

# PCA on random difference vectors
pca_random = PCA(n_components=10)
pca_random.fit(random_difference_vectors.numpy())

print("\nRandom pairs - variance explained by top PCs:")
cumsum_random = np.cumsum(pca_random.explained_variance_ratio_)
for i in range(5):
    print(f"  PC{i+1}: {pca_random.explained_variance_ratio_[i]*100:.2f}% (cumulative: {cumsum_random[i]*100:.2f}%)")

print("\nInterpretation: Higher PC1 variance = more consistent direction across pairs")

---
# Analysis 4: Clustering (Italian vs English Separation)

In [None]:
print("="*60)
print("CLUSTERING ANALYSIS")
print("="*60)
print("How well do Italian and English activations separate?")

# Combine Italian and English activations
all_activations = torch.cat([italian_activations, english_activations], dim=0).numpy()
true_labels = [0] * 30 + [1] * 30  # 0 = Italian, 1 = English

# K-means clustering
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
predicted_labels = kmeans.fit_predict(all_activations)

# Silhouette score (measures cluster separation)
silhouette = silhouette_score(all_activations, true_labels)
silhouette_predicted = silhouette_score(all_activations, predicted_labels)

print(f"\nMeaningful pairs (Italian vs English):")
print(f"  Silhouette score (true labels): {silhouette:.4f}")
print(f"  Silhouette score (K-means):     {silhouette_predicted:.4f}")

# Clustering accuracy
matches = sum(1 for p, t in zip(predicted_labels, true_labels) if p == t)
accuracy = max(matches, 60 - matches) / 60  # Account for label swap
print(f"  K-means clustering accuracy:    {accuracy*100:.1f}%")

# Same for random
all_random = torch.cat([random_pos_activations, random_neg_activations], dim=0).numpy()
random_true_labels = [0] * 30 + [1] * 30
silhouette_random = silhouette_score(all_random, random_true_labels)

print(f"\nRandom pairs:")
print(f"  Silhouette score (true labels): {silhouette_random:.4f}")
print("\nInterpretation: Higher silhouette = better separation between groups")

---
# Analysis 5: Per-Pair Contribution

In [None]:
print("="*60)
print("PER-PAIR CONTRIBUTION ANALYSIS")
print("="*60)
print("Which pairs contribute most/least to the final vector?")

# Compute alignment of each pair's difference vector with the mean
mean_direction = vec_30  # Our reference direction

pair_contributions = []
for i, diff in enumerate(difference_vectors):
    alignment = cosine_sim(diff, mean_direction)
    magnitude = torch.norm(diff).item()
    pair_contributions.append((i, alignment, magnitude, contrastive_pairs_data[i][0]))

# Sort by alignment
pair_contributions.sort(key=lambda x: x[1], reverse=True)

print("\nTop 5 most aligned pairs (contribute most to mean direction):")
for i, align, mag, prompt in pair_contributions[:5]:
    print(f"  Pair {i} ({prompt}): alignment={align:.4f}, magnitude={mag:.4f}")

print("\nBottom 5 least aligned pairs (contribute least/opposite):")
for i, align, mag, prompt in pair_contributions[-5:]:
    print(f"  Pair {i} ({prompt}): alignment={align:.4f}, magnitude={mag:.4f}")

# Statistics
alignments = [x[1] for x in pair_contributions]
print(f"\nAlignment statistics:")
print(f"  Mean: {np.mean(alignments):.4f}")
print(f"  Std:  {np.std(alignments):.4f}")
print(f"  Min:  {np.min(alignments):.4f}")
print(f"  Max:  {np.max(alignments):.4f}")

---
# Analysis 6: Activation Magnitude Differences

In [None]:
print("="*60)
print("ACTIVATION MAGNITUDE ANALYSIS")
print("="*60)
print("Are meaningful pairs producing larger/more consistent differences?")

# Magnitude of difference vectors
meaningful_magnitudes = torch.norm(difference_vectors, dim=1).numpy()
random_magnitudes = torch.norm(random_difference_vectors, dim=1).numpy()

print("\nDifference vector magnitudes:")
print(f"  Meaningful pairs:")
print(f"    Mean: {np.mean(meaningful_magnitudes):.4f}")
print(f"    Std:  {np.std(meaningful_magnitudes):.4f}")
print(f"    CV:   {np.std(meaningful_magnitudes)/np.mean(meaningful_magnitudes):.4f} (coefficient of variation)")

print(f"  Random pairs:")
print(f"    Mean: {np.mean(random_magnitudes):.4f}")
print(f"    Std:  {np.std(random_magnitudes):.4f}")
print(f"    CV:   {np.std(random_magnitudes)/np.mean(random_magnitudes):.4f}")

print("\nInterpretation: Lower CV = more consistent differences across pairs")

---
# Analysis 7: Cross-Validation

In [None]:
print("="*60)
print("CROSS-VALIDATION ANALYSIS")
print("="*60)
print("Train on subset, test similarity on held-out pairs")

# 5-fold cross-validation
n_folds = 5
fold_size = 6

meaningful_cv_scores = []
random_cv_scores = []

for fold in range(n_folds):
    # Split indices
    test_start = fold * fold_size
    test_end = test_start + fold_size
    test_indices = list(range(test_start, test_end))
    train_indices = [i for i in range(30) if i not in test_indices]
    
    # Train vector (on train set)
    train_vec = create_steering_vector_from_diffs(difference_vectors[train_indices])
    
    # Test: average cosine similarity with held-out pairs
    test_sims = [cosine_sim(difference_vectors[i], train_vec) for i in test_indices]
    meaningful_cv_scores.append(np.mean(test_sims))
    
    # Same for random
    train_vec_random = create_steering_vector_from_diffs(random_difference_vectors[train_indices])
    test_sims_random = [cosine_sim(random_difference_vectors[i], train_vec_random) for i in test_indices]
    random_cv_scores.append(np.mean(test_sims_random))

print(f"\n5-Fold Cross-Validation Results:")
print(f"  Meaningful pairs:")
print(f"    Mean CV score: {np.mean(meaningful_cv_scores):.4f}")
print(f"    Std:           {np.std(meaningful_cv_scores):.4f}")
print(f"  Random pairs:")
print(f"    Mean CV score: {np.mean(random_cv_scores):.4f}")
print(f"    Std:           {np.std(random_cv_scores):.4f}")

print("\nInterpretation: Higher CV score = vector generalizes to unseen pairs")

---
# Analysis 8: Signal-to-Noise Ratio

In [None]:
print("="*60)
print("SIGNAL-TO-NOISE RATIO ANALYSIS")
print("="*60)
print("Signal = variance of mean difference, Noise = variance within each class")

def compute_snr(pos_activations, neg_activations):
    """Compute signal-to-noise ratio."""
    # Signal: magnitude of mean difference
    mean_pos = pos_activations.mean(dim=0)
    mean_neg = neg_activations.mean(dim=0)
    signal = torch.norm(mean_pos - mean_neg).item()
    
    # Noise: average within-class standard deviation
    noise_pos = pos_activations.std(dim=0).mean().item()
    noise_neg = neg_activations.std(dim=0).mean().item()
    noise = (noise_pos + noise_neg) / 2
    
    snr = signal / (noise + 1e-8)
    return signal, noise, snr

signal_m, noise_m, snr_m = compute_snr(italian_activations, english_activations)
signal_r, noise_r, snr_r = compute_snr(random_pos_activations, random_neg_activations)

print(f"\nMeaningful pairs (Italian vs English):")
print(f"  Signal (mean diff magnitude): {signal_m:.4f}")
print(f"  Noise (within-class std):     {noise_m:.4f}")
print(f"  SNR:                          {snr_m:.4f}")

print(f"\nRandom pairs:")
print(f"  Signal (mean diff magnitude): {signal_r:.4f}")
print(f"  Noise (within-class std):     {noise_r:.4f}")
print(f"  SNR:                          {snr_r:.4f}")

print(f"\nSNR Ratio (Meaningful/Random): {snr_m/snr_r:.2f}x")
print("\nInterpretation: Higher SNR = cleaner separation, better steering vector")

---
# Summary

In [None]:
print("="*60)
print("SUMMARY OF ALL ANALYSES")
print("="*60)

print("""
1. COSINE SIMILARITY
   - More pairs → higher similarity to reference (convergence)
   - Random vectors have low similarity to meaningful vectors

2. PROJECTION ANALYSIS  
   - More pairs → higher % parallel to reference direction
   - Random pairs have low projection onto meaningful direction

3. PCA / VARIANCE EXPLAINED
   - Meaningful pairs: high PC1 variance (consistent direction)
   - Random pairs: distributed variance (no dominant direction)

4. CLUSTERING
   - Italian/English activations separate well (high silhouette)
   - Random pos/neg don't separate (low silhouette)

5. PER-PAIR CONTRIBUTION
   - Most pairs align with mean direction
   - Outliers can be identified for quality control

6. ACTIVATION MAGNITUDES
   - Meaningful pairs have consistent difference magnitudes (low CV)
   - Random pairs have variable magnitudes (high CV)

7. CROSS-VALIDATION
   - Meaningful vectors generalize to held-out pairs
   - Random vectors don't generalize

8. SIGNAL-TO-NOISE RATIO
   - Meaningful pairs have high SNR (clear signal)
   - Random pairs have low SNR (noise dominates)
""")