<a href="https://colab.research.google.com/github/yilmajung/belief_and_llms_v0/blob/main/3_1_contrastive_steering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 3.1: Contrastive Steering Vectors

**Problem with Current Approach:**
The current steering vector `v_Republican = mean(Republican_prompts) - mean(Baseline_prompts)` encodes "Republican vs generic person". Adding it increases similarity to ALL demographics.

**Proposed Solution:**
Use contrastive vectors: `v_contrastive = v_Republican - v_Democrat`

This encodes "Republican vs Democrat" directly, which should produce:
- **Positive delta** for Republican, Conservative
- **Negative delta** for Democrat, Liberal

**Key Questions:**
1. Do contrastive vectors produce oppositional effects (some up, some down)?
2. Which layer produces the clearest oppositional pattern?
3. How does contrastive steering compare to original steering?

## 1. Setup & Load Vectors

In [None]:
!pip install -q -U bitsandbytes

In [None]:
import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
# Link to Google Drive
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# Configuration
BASE_DIR = "/content/drive/MyDrive/belief_and_llms_v0"
VECTOR_DIR = os.path.join(BASE_DIR, "vectors")
LAYERS = list(range(5, 21))  # Layers 5-20

# Load all layer vectors
all_layer_vectors = {}
for layer in LAYERS:
    path = os.path.join(VECTOR_DIR, f"gss_demographic_vectors_layer{layer}.pt")
    if os.path.exists(path):
        all_layer_vectors[layer] = torch.load(path)
        print(f"Loaded Layer {layer}: {len(all_layer_vectors[layer])} vectors")
    else:
        print(f"WARNING: Layer {layer} vectors not found at {path}")

print(f"\nLoaded vectors for {len(all_layer_vectors)} layers.")

In [None]:
# Inspect available demographics
sample_layer = list(all_layer_vectors.keys())[0]
demographics = list(all_layer_vectors[sample_layer].keys())
print(f"Available demographics ({len(demographics)} total):")
for demo in sorted(demographics):
    print(f"  - {demo}")

## 2. Create Contrastive Vectors

Define contrastive pairs and compute contrastive vectors for each layer.

In [None]:
# Define contrastive pairs: (name, positive_demo, negative_demo)
# Contrastive vector = positive_demo - negative_demo
CONTRASTIVE_PAIRS = [
    ("Republican_vs_Democrat", "PartyID_Strong Republican", "PartyID_Strong Democrat"),
    ("Conservative_vs_Liberal", "PolViews_person with a conservative political view", "PolViews_person with a liberal political view"),
    ("White_vs_Black", "Race_White person", "Race_Black person"),
    ("Boomer_vs_Millennial", "Generation_Baby Boomer", "Generation_Millennial"),
]

print("Contrastive pairs to create:")
for name, pos, neg in CONTRASTIVE_PAIRS:
    print(f"  {name}: {pos} - {neg}")

In [None]:
def create_contrastive_vectors(layer_vectors, contrastive_pairs):
    """
    Create contrastive vectors from pairs of demographic vectors.
    
    Contrastive vector = positive_demo_vector - negative_demo_vector
    
    Args:
        layer_vectors: Dictionary of demographic vectors for one layer
        contrastive_pairs: List of (name, positive_demo, negative_demo) tuples
    
    Returns:
        Dictionary mapping contrastive pair names to their vectors
    """
    contrastive_vectors = {}
    
    for name, pos_demo, neg_demo in contrastive_pairs:
        if pos_demo not in layer_vectors:
            print(f"WARNING: {pos_demo} not found, skipping {name}")
            continue
        if neg_demo not in layer_vectors:
            print(f"WARNING: {neg_demo} not found, skipping {name}")
            continue
        
        pos_vec = layer_vectors[pos_demo]['vector']
        neg_vec = layer_vectors[neg_demo]['vector']
        
        # Contrastive vector = positive - negative
        contrastive_vec = pos_vec - neg_vec
        
        contrastive_vectors[name] = {
            'vector': contrastive_vec,
            'magnitude': contrastive_vec.norm().item(),
            'positive_demo': pos_demo,
            'negative_demo': neg_demo
        }
    
    return contrastive_vectors

# Create contrastive vectors for all layers
all_contrastive_vectors = {}
for layer in all_layer_vectors.keys():
    all_contrastive_vectors[layer] = create_contrastive_vectors(
        all_layer_vectors[layer], 
        CONTRASTIVE_PAIRS
    )
    
print(f"\nCreated contrastive vectors for {len(all_contrastive_vectors)} layers.")
print(f"Contrastive pairs per layer: {len(all_contrastive_vectors[sample_layer])}")

In [None]:
# Visualize contrastive vector magnitudes across layers
magnitude_data = []
for layer in sorted(all_contrastive_vectors.keys()):
    for name, data in all_contrastive_vectors[layer].items():
        magnitude_data.append({
            'layer': layer,
            'contrastive_pair': name,
            'magnitude': data['magnitude']
        })

mag_df = pd.DataFrame(magnitude_data)

plt.figure(figsize=(12, 6))
for pair_name in mag_df['contrastive_pair'].unique():
    subset = mag_df[mag_df['contrastive_pair'] == pair_name]
    plt.plot(subset['layer'], subset['magnitude'], 'o-', label=pair_name, linewidth=2, markersize=8)

plt.xlabel('Layer', fontsize=12)
plt.ylabel('Contrastive Vector Magnitude', fontsize=12)
plt.title('Contrastive Vector Magnitude by Layer', fontsize=14)
plt.legend(loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 3. Load Model for Steering Experiments

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

print("Model loaded successfully.")

## 4. Steering Functions

In [None]:
def get_activations_with_steering(model, tokenizer, text, layer_idx, steering_vector=None, strength=0.0):
    """
    Get hidden state activations with optional steering vector injection.

    Args:
        model: The LLM model
        tokenizer: Tokenizer
        text: Input text
        layer_idx: Which layer to extract/inject
        steering_vector: Optional steering vector to inject
        strength: Injection strength multiplier

    Returns:
        Hidden state at the last token position
    """
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    captured_hidden = None

    def hook_fn(module, input, output):
        nonlocal captured_hidden

        if isinstance(output, tuple):
            h_states = output[0]
        else:
            h_states = output

        # Apply steering if provided
        if steering_vector is not None and strength != 0.0:
            steer = steering_vector.to(h_states.device).to(h_states.dtype)
            # Add steering vector to all token positions
            h_states = h_states + strength * steer.unsqueeze(0).unsqueeze(0)

        if h_states.dim() == 3:
            captured_hidden = h_states[0, -1, :].detach().cpu()
        elif h_states.dim() == 2:
            captured_hidden = h_states[-1, :].detach().cpu()

        # Return modified output if steering was applied
        if steering_vector is not None and strength != 0.0:
            if isinstance(output, tuple):
                return (h_states,) + output[1:]
            return h_states
        return output

    layer = model.model.layers[layer_idx]
    handle = layer.register_forward_hook(hook_fn)

    with torch.no_grad():
        model(**inputs)

    handle.remove()
    return captured_hidden

In [None]:
def measure_contrastive_steering_effect(model, tokenizer, layer_vectors, contrastive_vec_data,
                                         layer_idx, strength=2.0):
    """
    Measure how steering with a CONTRASTIVE vector affects similarities to all demographics.

    Args:
        layer_vectors: Dictionary of demographic vectors for this layer
        contrastive_vec_data: Dict with 'vector', 'positive_demo', 'negative_demo'
        layer_idx: Layer to inject steering vector
        strength: Steering strength

    Returns:
        DataFrame with baseline and steered similarities for each demographic
    """
    steering_vec = contrastive_vec_data['vector']

    # Test prompt (neutral)
    test_prompt = "[INST] You are a person living in the United States. What are your thoughts on current events? [/INST]"

    # Get baseline activation (no steering)
    baseline_act = get_activations_with_steering(
        model, tokenizer, test_prompt, layer_idx,
        steering_vector=None, strength=0.0
    )

    # Get steered activation
    steered_act = get_activations_with_steering(
        model, tokenizer, test_prompt, layer_idx,
        steering_vector=steering_vec, strength=strength
    )

    # Measure similarity to all demographic vectors
    results = []

    for label, data in layer_vectors.items():
        demo_vec = data['vector']

        # Baseline similarity
        baseline_sim = F.cosine_similarity(
            baseline_act.unsqueeze(0), demo_vec.unsqueeze(0)
        ).item()

        # Steered similarity
        steered_sim = F.cosine_similarity(
            steered_act.unsqueeze(0), demo_vec.unsqueeze(0)
        ).item()

        # Change
        delta = steered_sim - baseline_sim

        # Mark if this is the positive or negative demo
        role = 'neutral'
        if label == contrastive_vec_data['positive_demo']:
            role = 'positive'
        elif label == contrastive_vec_data['negative_demo']:
            role = 'negative'

        results.append({
            'demographic': label,
            'category': label.split('_')[0],
            'baseline_sim': baseline_sim,
            'steered_sim': steered_sim,
            'delta': delta,
            'role': role,
            'layer': layer_idx
        })

    return pd.DataFrame(results)

In [None]:
def measure_original_steering_effect(model, tokenizer, layer_vectors,
                                      steering_label, layer_idx, strength=2.0):
    """
    Measure how steering with ORIGINAL (non-contrastive) vector affects similarities.
    
    This is for comparison with contrastive steering.
    """
    steering_vec = layer_vectors[steering_label]['vector']

    test_prompt = "[INST] You are a person living in the United States. What are your thoughts on current events? [/INST]"

    baseline_act = get_activations_with_steering(
        model, tokenizer, test_prompt, layer_idx,
        steering_vector=None, strength=0.0
    )

    steered_act = get_activations_with_steering(
        model, tokenizer, test_prompt, layer_idx,
        steering_vector=steering_vec, strength=strength
    )

    results = []

    for label, data in layer_vectors.items():
        demo_vec = data['vector']

        baseline_sim = F.cosine_similarity(
            baseline_act.unsqueeze(0), demo_vec.unsqueeze(0)
        ).item()

        steered_sim = F.cosine_similarity(
            steered_act.unsqueeze(0), demo_vec.unsqueeze(0)
        ).item()

        delta = steered_sim - baseline_sim

        results.append({
            'demographic': label,
            'category': label.split('_')[0],
            'baseline_sim': baseline_sim,
            'steered_sim': steered_sim,
            'delta': delta,
            'layer': layer_idx
        })

    return pd.DataFrame(results)

## 5. Contrastive Steering Experiments

In [None]:
# Test contrastive steering: Republican vs Democrat
TEST_LAYER = 13  # Based on delta magnitude analysis
STRENGTH = 2.0
CONTRASTIVE_PAIR = "Republican_vs_Democrat"

print(f"Testing Contrastive Steering: {CONTRASTIVE_PAIR}")
print(f"Layer: {TEST_LAYER}, Strength: {STRENGTH}")

contrastive_results = measure_contrastive_steering_effect(
    model, tokenizer,
    all_layer_vectors[TEST_LAYER],
    all_contrastive_vectors[TEST_LAYER][CONTRASTIVE_PAIR],
    layer_idx=TEST_LAYER,
    strength=STRENGTH
)

# Sort by delta
contrastive_results = contrastive_results.sort_values('delta', ascending=False)

print("\n=== Top 10 Positive Deltas ===")
print(contrastive_results.head(10)[['demographic', 'delta', 'role']])

print("\n=== Top 10 Negative Deltas ===")
print(contrastive_results.tail(10)[['demographic', 'delta', 'role']])

In [None]:
# Visualize contrastive steering effects
plt.figure(figsize=(14, 10))

# Color by role and category
def get_color(row):
    if row['role'] == 'positive':
        return 'darkgreen'
    elif row['role'] == 'negative':
        return 'darkred'
    else:
        category_colors = {
            'PartyID': 'blue',
            'PolViews': 'red',
            'Race': 'green',
            'Religion': 'purple',
            'Degree': 'orange',
            'Generation': 'brown',
            'Sex': 'gray'
        }
        return category_colors.get(row['category'], 'black')

colors = contrastive_results.apply(get_color, axis=1)

plt.barh(
    contrastive_results['demographic'],
    contrastive_results['delta'],
    color=colors
)

plt.axvline(x=0, color='black', linestyle='--', linewidth=2)
plt.xlabel('Similarity Change (Delta)', fontsize=12)
plt.ylabel('Demographic', fontsize=12)
plt.title(f'Contrastive Steering: {CONTRASTIVE_PAIR}\n(Layer {TEST_LAYER}, Strength {STRENGTH})\n[Dark green = positive demo, Dark red = negative demo]', fontsize=14)
plt.tight_layout()
plt.show()

# Count positive vs negative deltas
n_positive = (contrastive_results['delta'] > 0).sum()
n_negative = (contrastive_results['delta'] < 0).sum()
print(f"\nDemographics with positive delta: {n_positive}")
print(f"Demographics with negative delta: {n_negative}")

## 6. Compare Original vs Contrastive Steering

In [None]:
# Get original steering results for comparison
ORIGINAL_STEERING_DEMO = "PartyID_Strong Republican"

print(f"Comparing Original ({ORIGINAL_STEERING_DEMO}) vs Contrastive ({CONTRASTIVE_PAIR})")

original_results = measure_original_steering_effect(
    model, tokenizer,
    all_layer_vectors[TEST_LAYER],
    steering_label=ORIGINAL_STEERING_DEMO,
    layer_idx=TEST_LAYER,
    strength=STRENGTH
)

original_results = original_results.sort_values('delta', ascending=False)

In [None]:
# Side-by-side comparison
fig, axes = plt.subplots(1, 2, figsize=(20, 12))

# Original steering
original_sorted = original_results.sort_values('delta', ascending=True)
axes[0].barh(
    original_sorted['demographic'],
    original_sorted['delta'],
    color='steelblue'
)
axes[0].axvline(x=0, color='black', linestyle='--', linewidth=2)
axes[0].set_xlabel('Delta', fontsize=12)
axes[0].set_title(f'ORIGINAL Steering\n{ORIGINAL_STEERING_DEMO}', fontsize=14)
axes[0].set_xlim(-0.1, max(original_sorted['delta'].max() * 1.1, 0.35))

# Contrastive steering
contrastive_sorted = contrastive_results.sort_values('delta', ascending=True)
colors = contrastive_sorted.apply(get_color, axis=1)
axes[1].barh(
    contrastive_sorted['demographic'],
    contrastive_sorted['delta'],
    color=colors
)
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=2)
axes[1].set_xlabel('Delta', fontsize=12)
axes[1].set_title(f'CONTRASTIVE Steering\n{CONTRASTIVE_PAIR}', fontsize=14)

plt.suptitle(f'Original vs Contrastive Steering (Layer {TEST_LAYER}, Strength {STRENGTH})', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

# Summary statistics
print("\n=== COMPARISON SUMMARY ===")
print(f"\nOriginal Steering ({ORIGINAL_STEERING_DEMO}):")
print(f"  Positive deltas: {(original_results['delta'] > 0).sum()}")
print(f"  Negative deltas: {(original_results['delta'] < 0).sum()}")
print(f"  Mean delta: {original_results['delta'].mean():.4f}")

print(f"\nContrastive Steering ({CONTRASTIVE_PAIR}):")
print(f"  Positive deltas: {(contrastive_results['delta'] > 0).sum()}")
print(f"  Negative deltas: {(contrastive_results['delta'] < 0).sum()}")
print(f"  Mean delta: {contrastive_results['delta'].mean():.4f}")

In [None]:
# Merge for detailed comparison
comparison_df = original_results[['demographic', 'delta']].rename(columns={'delta': 'original_delta'})
comparison_df = comparison_df.merge(
    contrastive_results[['demographic', 'delta', 'role']].rename(columns={'delta': 'contrastive_delta'}),
    on='demographic'
)
comparison_df['delta_difference'] = comparison_df['contrastive_delta'] - comparison_df['original_delta']

print("Demographics where contrastive has MORE NEGATIVE delta than original:")
negative_shift = comparison_df[comparison_df['delta_difference'] < -0.05].sort_values('delta_difference')
print(negative_shift[['demographic', 'original_delta', 'contrastive_delta', 'delta_difference', 'role']])

## 7. Test All Contrastive Pairs

In [None]:
# Run all contrastive pairs at the test layer
all_contrastive_results = {}

for pair_name in all_contrastive_vectors[TEST_LAYER].keys():
    print(f"Testing: {pair_name}...")
    results = measure_contrastive_steering_effect(
        model, tokenizer,
        all_layer_vectors[TEST_LAYER],
        all_contrastive_vectors[TEST_LAYER][pair_name],
        layer_idx=TEST_LAYER,
        strength=STRENGTH
    )
    all_contrastive_results[pair_name] = results

print("\nDone!")

In [None]:
# Summary for each contrastive pair
summary_data = []

for pair_name, results_df in all_contrastive_results.items():
    pos_demo = all_contrastive_vectors[TEST_LAYER][pair_name]['positive_demo']
    neg_demo = all_contrastive_vectors[TEST_LAYER][pair_name]['negative_demo']
    
    pos_delta = results_df[results_df['demographic'] == pos_demo]['delta'].values[0]
    neg_delta = results_df[results_df['demographic'] == neg_demo]['delta'].values[0]
    
    n_positive = (results_df['delta'] > 0).sum()
    n_negative = (results_df['delta'] < 0).sum()
    
    summary_data.append({
        'pair': pair_name,
        'positive_demo': pos_demo,
        'negative_demo': neg_demo,
        'pos_demo_delta': pos_delta,
        'neg_demo_delta': neg_delta,
        'oppositional': pos_delta > 0 and neg_delta < 0,
        'n_positive_deltas': n_positive,
        'n_negative_deltas': n_negative
    })

summary_df = pd.DataFrame(summary_data)
print("=== Contrastive Steering Summary ===")
summary_df

In [None]:
# Visualize all contrastive pairs
n_pairs = len(all_contrastive_results)
fig, axes = plt.subplots(2, 2, figsize=(20, 16))
axes = axes.flatten()

for idx, (pair_name, results_df) in enumerate(all_contrastive_results.items()):
    if idx >= 4:
        break
    
    ax = axes[idx]
    results_sorted = results_df.sort_values('delta', ascending=True)
    
    colors = results_sorted.apply(get_color, axis=1)
    
    ax.barh(
        results_sorted['demographic'],
        results_sorted['delta'],
        color=colors
    )
    ax.axvline(x=0, color='black', linestyle='--', linewidth=2)
    ax.set_xlabel('Delta', fontsize=10)
    ax.set_title(f'{pair_name}', fontsize=12)
    ax.tick_params(axis='y', labelsize=7)

plt.suptitle(f'All Contrastive Steering Effects (Layer {TEST_LAYER})', fontsize=16)
plt.tight_layout()
plt.show()

## 8. Layer Comparison for Contrastive Steering

In [None]:
# Test which layer produces clearest oppositional effects
CONTRASTIVE_PAIR = "Republican_vs_Democrat"

layer_contrastive_results = []

for layer in all_contrastive_vectors.keys():
    print(f"Testing Layer {layer}...")
    
    results = measure_contrastive_steering_effect(
        model, tokenizer,
        all_layer_vectors[layer],
        all_contrastive_vectors[layer][CONTRASTIVE_PAIR],
        layer_idx=layer,
        strength=STRENGTH
    )
    
    pos_demo = all_contrastive_vectors[layer][CONTRASTIVE_PAIR]['positive_demo']
    neg_demo = all_contrastive_vectors[layer][CONTRASTIVE_PAIR]['negative_demo']
    
    pos_delta = results[results['demographic'] == pos_demo]['delta'].values[0]
    neg_delta = results[results['demographic'] == neg_demo]['delta'].values[0]
    
    n_positive = (results['delta'] > 0).sum()
    n_negative = (results['delta'] < 0).sum()
    
    layer_contrastive_results.append({
        'layer': layer,
        'pos_demo_delta': pos_delta,
        'neg_demo_delta': neg_delta,
        'oppositional_strength': pos_delta - neg_delta,  # Higher = more oppositional
        'n_positive_deltas': n_positive,
        'n_negative_deltas': n_negative,
        'balance': abs(n_positive - n_negative)  # Lower = more balanced
    })

layer_results_df = pd.DataFrame(layer_contrastive_results)
print("\nDone!")
layer_results_df

In [None]:
# Visualize layer comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Positive and negative demo deltas by layer
axes[0, 0].plot(layer_results_df['layer'], layer_results_df['pos_demo_delta'], 'g-o', label='Positive demo delta', linewidth=2, markersize=8)
axes[0, 0].plot(layer_results_df['layer'], layer_results_df['neg_demo_delta'], 'r-o', label='Negative demo delta', linewidth=2, markersize=8)
axes[0, 0].axhline(y=0, color='black', linestyle='--', linewidth=1)
axes[0, 0].set_xlabel('Layer', fontsize=12)
axes[0, 0].set_ylabel('Delta', fontsize=12)
axes[0, 0].set_title(f'Contrastive Pair: {CONTRASTIVE_PAIR}\nPositive vs Negative Demo Delta', fontsize=12)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Oppositional strength by layer
axes[0, 1].bar(layer_results_df['layer'], layer_results_df['oppositional_strength'], color='purple', edgecolor='black')
axes[0, 1].set_xlabel('Layer', fontsize=12)
axes[0, 1].set_ylabel('Oppositional Strength (pos_delta - neg_delta)', fontsize=12)
axes[0, 1].set_title('Oppositional Strength by Layer\n(Higher = more contrastive effect)', fontsize=12)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Highlight best layer
best_layer = layer_results_df.loc[layer_results_df['oppositional_strength'].idxmax(), 'layer']
axes[0, 1].axvline(x=best_layer, color='red', linestyle='--', linewidth=2, label=f'Best: Layer {best_layer}')
axes[0, 1].legend()

# Plot 3: Number of positive vs negative deltas
width = 0.35
x = np.array(layer_results_df['layer'])
axes[1, 0].bar(x - width/2, layer_results_df['n_positive_deltas'], width, label='Positive deltas', color='green', alpha=0.7)
axes[1, 0].bar(x + width/2, layer_results_df['n_negative_deltas'], width, label='Negative deltas', color='red', alpha=0.7)
axes[1, 0].set_xlabel('Layer', fontsize=12)
axes[1, 0].set_ylabel('Count', fontsize=12)
axes[1, 0].set_title('Distribution of Positive vs Negative Deltas by Layer', fontsize=12)
axes[1, 0].legend()
axes[1, 0].set_xticks(x)

# Plot 4: Balance (lower = more balanced distribution)
axes[1, 1].bar(layer_results_df['layer'], layer_results_df['balance'], color='orange', edgecolor='black')
axes[1, 1].set_xlabel('Layer', fontsize=12)
axes[1, 1].set_ylabel('|Positive - Negative|', fontsize=12)
axes[1, 1].set_title('Balance of Effects by Layer\n(Lower = more balanced split)', fontsize=12)
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\n=== BEST LAYER ANALYSIS ===")
print(f"Best layer by oppositional strength: Layer {best_layer}")
best_row = layer_results_df[layer_results_df['layer'] == best_layer].iloc[0]
print(f"  Positive demo delta: {best_row['pos_demo_delta']:.4f}")
print(f"  Negative demo delta: {best_row['neg_demo_delta']:.4f}")
print(f"  Oppositional strength: {best_row['oppositional_strength']:.4f}")

## 9. Save Results

In [None]:
# Save contrastive steering results for all pairs at test layer
all_results_combined = []
for pair_name, results_df in all_contrastive_results.items():
    results_df = results_df.copy()
    results_df['contrastive_pair'] = pair_name
    all_results_combined.append(results_df)

combined_df = pd.concat(all_results_combined, ignore_index=True)
combined_df.to_csv(os.path.join(BASE_DIR, f"contrastive_steering_results_layer{TEST_LAYER}.csv"), index=False)

# Save layer comparison results
layer_results_df.to_csv(os.path.join(BASE_DIR, "contrastive_layer_comparison.csv"), index=False)

# Save comparison between original and contrastive
comparison_df.to_csv(os.path.join(BASE_DIR, f"contrastive_vs_original_comparison_layer{TEST_LAYER}.csv"), index=False)

# Save summary
summary_df.to_csv(os.path.join(BASE_DIR, f"contrastive_steering_summary_layer{TEST_LAYER}.csv"), index=False)

print("Results saved:")
print(f"  - contrastive_steering_results_layer{TEST_LAYER}.csv")
print(f"  - contrastive_layer_comparison.csv")
print(f"  - contrastive_vs_original_comparison_layer{TEST_LAYER}.csv")
print(f"  - contrastive_steering_summary_layer{TEST_LAYER}.csv")

## Summary

This notebook tested **contrastive steering vectors** to produce oppositional effects.

**Key Findings:**

1. **Original Steering Problem:**
   - Original vectors (`v_Republican = mean(Rep) - mean(Baseline)`) encode "X vs generic"
   - Adding them tends to increase similarity to ALL demographics

2. **Contrastive Steering Solution:**
   - Contrastive vectors (`v = v_Republican - v_Democrat`) encode "X vs Y" directly
   - Should produce positive deltas for X-aligned demographics
   - Should produce negative deltas for Y-aligned demographics

3. **Layer Analysis:**
   - Different layers produce different oppositional strengths
   - Best layer for contrastive steering may differ from best layer for original steering

**Output Files:**
- `contrastive_steering_results_layer{L}.csv` - Full results for all contrastive pairs
- `contrastive_layer_comparison.csv` - Layer-by-layer analysis
- `contrastive_vs_original_comparison_layer{L}.csv` - Direct comparison
- `contrastive_steering_summary_layer{L}.csv` - Summary statistics