<a href="https://colab.research.google.com/github/yilmajung/belief_and_llms_v0/blob/main/3_investigate_correlations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 3: Investigate Demographic Vector Correlations

**Goal:** Investigate how steering with one demographic vector affects other demographic representations.

**Key Design:** Extract and steer at the **SAME layer**. For each layer (5-20):
- Load vectors extracted at that layer
- Steer at that same layer
- Measure effects on other demographics

**Key Questions:**
1. When we inject a "Republican" steering vector, how do other demographic similarities change?
2. Are there clusters of demographics that move together?
3. Which layer produces the cleanest/strongest steering effects?

In [None]:
!pip install -q -U bitsandbytes

In [None]:
import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
# Link to Google Drive
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# Configuration
BASE_DIR = "/content/drive/MyDrive/belief_and_llms_v0"
VECTOR_DIR = os.path.join(BASE_DIR, "vectors")
LAYERS = list(range(5, 21))  # Layers 5-20

# Load all layer vectors
all_layer_vectors = {}
for layer in LAYERS:
    path = os.path.join(VECTOR_DIR, f"gss_demographic_vectors_layer{layer}.pt")
    if os.path.exists(path):
        all_layer_vectors[layer] = torch.load(path)
        print(f"Loaded Layer {layer}: {len(all_layer_vectors[layer])} vectors")
    else:
        print(f"WARNING: Layer {layer} vectors not found at {path}")

print(f"\nLoaded vectors for {len(all_layer_vectors)} layers.")

## 1. Build Baseline Similarity Matrix (Per Layer)

Compute pairwise cosine similarities between all demographic vectors for each layer.

In [None]:
def compute_similarity_matrix(vectors_dict):
    """
    Compute pairwise cosine similarity matrix for all demographic vectors.
    """
    labels = list(vectors_dict.keys())
    n = len(labels)
    sim_matrix = np.zeros((n, n))
    
    for i, label_a in enumerate(labels):
        vec_a = vectors_dict[label_a]['vector']
        for j, label_b in enumerate(labels):
            vec_b = vectors_dict[label_b]['vector']
            sim = F.cosine_similarity(vec_a.unsqueeze(0), vec_b.unsqueeze(0)).item()
            sim_matrix[i, j] = sim
    
    return pd.DataFrame(sim_matrix, index=labels, columns=labels)

# Compute baseline similarities for a reference layer (e.g., Layer 15)
REF_LAYER = 15
if REF_LAYER in all_layer_vectors:
    baseline_sim_df = compute_similarity_matrix(all_layer_vectors[REF_LAYER])
    print(f"Baseline similarity matrix shape: {baseline_sim_df.shape}")

In [None]:
# Visualize baseline similarity matrix
plt.figure(figsize=(16, 14))
sns.heatmap(
    baseline_sim_df, 
    cmap='RdBu_r', 
    center=0,
    annot=False,
    xticklabels=True,
    yticklabels=True
)
plt.title(f"Baseline Cosine Similarity Matrix (Layer {REF_LAYER})", fontsize=14)
plt.xticks(rotation=90, fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.show()

## 2. Load Model for Steering Experiments

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

print("Model loaded successfully.")

## 3. Define Steering Functions

In [None]:
def get_activations_with_steering(model, tokenizer, text, layer_idx, steering_vector=None, strength=0.0):
    """
    Get hidden state activations with optional steering vector injection.
    
    Args:
        model: The LLM model
        tokenizer: Tokenizer
        text: Input text
        layer_idx: Which layer to extract/inject
        steering_vector: Optional steering vector to inject (must be from SAME layer)
        strength: Injection strength multiplier
    
    Returns:
        Hidden state at the last token position
    """
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    captured_hidden = None
    
    def hook_fn(module, input, output):
        nonlocal captured_hidden
        
        if isinstance(output, tuple):
            h_states = output[0]
        else:
            h_states = output
        
        # Apply steering if provided
        if steering_vector is not None and strength != 0.0:
            steer = steering_vector.to(h_states.device).to(h_states.dtype)
            # Add steering vector to all token positions
            h_states = h_states + strength * steer.unsqueeze(0).unsqueeze(0)
        
        if h_states.dim() == 3:
            captured_hidden = h_states[0, -1, :].detach().cpu()
        elif h_states.dim() == 2:
            captured_hidden = h_states[-1, :].detach().cpu()
        
        # Return modified output if steering was applied
        if steering_vector is not None and strength != 0.0:
            if isinstance(output, tuple):
                return (h_states,) + output[1:]
            return h_states
        return output
    
    layer = model.model.layers[layer_idx]
    handle = layer.register_forward_hook(hook_fn)
    
    with torch.no_grad():
        model(**inputs)
    
    handle.remove()
    return captured_hidden

## 4. Measure How Steering Affects Other Demographics (Same Layer)

**Key:** For each layer, use vectors extracted at THAT layer for both steering and comparison.

In [None]:
def measure_steering_effect_same_layer(model, tokenizer, layer_vectors, 
                                       steering_label, layer_idx, strength=2.0):
    """
    Measure how steering with one demographic affects similarities to all others.
    
    IMPORTANT: Uses vectors from the SAME layer for both steering and comparison.
    
    Args:
        layer_vectors: Dictionary of vectors extracted at layer_idx
        steering_label: The demographic to steer toward
        layer_idx: Layer to inject steering vector (same as extraction layer)
        strength: Steering strength
    
    Returns:
        DataFrame with baseline and steered similarities for each demographic
    """
    # Get the steering vector (extracted at this layer)
    steering_vec = layer_vectors[steering_label]['vector']
    
    # Test prompt (neutral)
    test_prompt = "[INST] You are a person living in the United States. What are your thoughts on current events? [/INST]"
    
    # Get baseline activation (no steering)
    baseline_act = get_activations_with_steering(
        model, tokenizer, test_prompt, layer_idx, 
        steering_vector=None, strength=0.0
    )
    
    # Get steered activation
    steered_act = get_activations_with_steering(
        model, tokenizer, test_prompt, layer_idx,
        steering_vector=steering_vec, strength=strength
    )
    
    # Measure similarity to all demographic vectors (from same layer)
    results = []
    
    for label, data in layer_vectors.items():
        demo_vec = data['vector']
        
        # Baseline similarity
        baseline_sim = F.cosine_similarity(
            baseline_act.unsqueeze(0), demo_vec.unsqueeze(0)
        ).item()
        
        # Steered similarity
        steered_sim = F.cosine_similarity(
            steered_act.unsqueeze(0), demo_vec.unsqueeze(0)
        ).item()
        
        # Change
        delta = steered_sim - baseline_sim
        
        results.append({
            'demographic': label,
            'category': label.split('_')[0],
            'baseline_sim': baseline_sim,
            'steered_sim': steered_sim,
            'delta': delta,
            'layer': layer_idx
        })
    
    return pd.DataFrame(results)

In [None]:
# Run experiment: Steer toward "Strong Republican" at Layer 15
STEERING_DEMOGRAPHIC = "PartyID_Strong Republican"
TEST_LAYER = 15
STRENGTH = 2.0

print(f"Steering toward: {STEERING_DEMOGRAPHIC}")
print(f"Layer: {TEST_LAYER} (extract AND inject at same layer)")
print(f"Strength: {STRENGTH}")

republican_steering_results = measure_steering_effect_same_layer(
    model, tokenizer, 
    all_layer_vectors[TEST_LAYER],
    steering_label=STEERING_DEMOGRAPHIC,
    layer_idx=TEST_LAYER,
    strength=STRENGTH
)

# Sort by delta (biggest changes first)
republican_steering_results = republican_steering_results.sort_values('delta', ascending=False)
republican_steering_results.head(10)

In [None]:
# Visualize: Which demographics move most when steering toward Republican?
plt.figure(figsize=(14, 8))

# Color by category
colors = republican_steering_results['category'].map({
    'PartyID': 'blue',
    'PolViews': 'red',
    'Race': 'green',
    'Religion': 'purple',
    'Degree': 'orange',
    'Generation': 'brown',
    'Sex': 'gray'
}).fillna('black')

plt.barh(
    republican_steering_results['demographic'],
    republican_steering_results['delta'],
    color=colors
)

plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.xlabel('Similarity Change (Delta)', fontsize=12)
plt.ylabel('Demographic', fontsize=12)
plt.title(f'Effect of Steering Toward "{STEERING_DEMOGRAPHIC}"\n(Layer {TEST_LAYER}, Strength {STRENGTH})', fontsize=14)
plt.tight_layout()
plt.show()

## 5. Compare Steering Effects Across Layers

For each layer, extract vectors at that layer AND steer at that same layer.

In [None]:
# Test steering effectiveness across layers (same layer for extract & inject)
STEERING_DEMO = "PartyID_Strong Republican"
TARGET_DEMO = "PolViews_person with a conservative political view"  # Should increase
STRENGTH = 2.0

layer_comparison_results = []

for layer in all_layer_vectors.keys():
    print(f"Testing Layer {layer} (extract & inject at same layer)...")
    
    results = measure_steering_effect_same_layer(
        model, tokenizer,
        all_layer_vectors[layer],
        steering_label=STEERING_DEMO,
        layer_idx=layer,
        strength=STRENGTH
    )
    
    # Get delta for target demographic
    target_row = results[results['demographic'] == TARGET_DEMO]
    if len(target_row) > 0:
        delta = target_row['delta'].values[0]
        layer_comparison_results.append({
            'layer': layer,
            'target_delta': delta,
            'avg_delta': results['delta'].abs().mean()
        })

layer_df = pd.DataFrame(layer_comparison_results)
print("\nLayer comparison complete!")
layer_df

In [None]:
# Visualize layer comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Target demographic delta by layer
axes[0].plot(layer_df['layer'], layer_df['target_delta'], 'bo-', linewidth=2, markersize=10)
axes[0].axhline(y=0, color='gray', linestyle='--')
axes[0].set_xlabel('Layer', fontsize=12)
axes[0].set_ylabel('Delta (Conservative PolViews)', fontsize=12)
axes[0].set_title(f'Steering Effect on Target Demo by Layer\n(Steering: {STEERING_DEMO})', fontsize=12)
axes[0].grid(True, alpha=0.3)

# Plot 2: Average absolute delta by layer
axes[1].plot(layer_df['layer'], layer_df['avg_delta'], 'ro-', linewidth=2, markersize=10)
axes[1].set_xlabel('Layer', fontsize=12)
axes[1].set_ylabel('Avg |Delta| Across All Demographics', fontsize=12)
axes[1].set_title('Overall Steering Magnitude by Layer', fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find best layer
best_layer = layer_df.loc[layer_df['target_delta'].idxmax(), 'layer']
print(f"\nOptimal layer for steering: {best_layer}")

## 6. Full Steering Effect Matrix (Multiple Steering Directions)

Test multiple steering demographics at the optimal layer.

In [None]:
# Use best layer (or a specific layer)
ANALYSIS_LAYER = best_layer if 'best_layer' in dir() else 15
print(f"Running full analysis at Layer {ANALYSIS_LAYER}")

# List of demographics to test as steering vectors
steering_demographics = [
    "PartyID_Strong Republican",
    "PartyID_Strong Democrat",
    "Race_Black person",
    "Race_White person",
    "PolViews_person with a liberal political view",
    "PolViews_person with a conservative political view",
    "Generation_Millennial",
    "Generation_Baby Boomer"
]

# Collect results for all steering directions
all_steering_results = {}

for steer_demo in steering_demographics:
    if steer_demo in all_layer_vectors[ANALYSIS_LAYER]:
        print(f"Testing steering: {steer_demo}...")
        results = measure_steering_effect_same_layer(
            model, tokenizer,
            all_layer_vectors[ANALYSIS_LAYER],
            steering_label=steer_demo,
            layer_idx=ANALYSIS_LAYER,
            strength=STRENGTH
        )
        all_steering_results[steer_demo] = results
    else:
        print(f"WARNING: {steer_demo} not found in layer {ANALYSIS_LAYER} vectors")

print("\nDone!")

In [None]:
# Build a "Steering Effect Matrix"
# Rows: Steering demographic
# Columns: Affected demographic
# Values: Delta (similarity change)

all_demos = list(all_layer_vectors[ANALYSIS_LAYER].keys())
effect_matrix = pd.DataFrame(index=list(all_steering_results.keys()), columns=all_demos)

for steer_demo, results_df in all_steering_results.items():
    for _, row in results_df.iterrows():
        effect_matrix.loc[steer_demo, row['demographic']] = row['delta']

effect_matrix = effect_matrix.astype(float)
print(f"Effect matrix shape: {effect_matrix.shape}")

In [None]:
# Visualize the Steering Effect Matrix
plt.figure(figsize=(18, 8))
sns.heatmap(
    effect_matrix,
    cmap='RdBu_r',
    center=0,
    annot=False,
    xticklabels=True,
    yticklabels=True
)
plt.title(f"Steering Effect Matrix (Layer {ANALYSIS_LAYER})\nHow Each Steering Vector Affects All Demographics", fontsize=14)
plt.xlabel("Affected Demographic", fontsize=12)
plt.ylabel("Steering Demographic", fontsize=12)
plt.xticks(rotation=90, fontsize=8)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()

## 7. Identify Entangled Demographics

Find pairs of demographics that consistently move together when steering.

In [None]:
# Compute correlation of steering effects across demographics
# If two demographics always move together, they are "entangled"

effect_correlation = effect_matrix.T.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    effect_correlation,
    cmap='RdBu_r',
    center=0,
    annot=True,
    fmt='.2f',
    xticklabels=True,
    yticklabels=True
)
plt.title(f"Correlation of Steering Effects (Layer {ANALYSIS_LAYER})\n(High correlation = demographics move together)", fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Find most entangled pairs (high positive correlation)
# and most opposing pairs (high negative correlation)

def get_top_pairs(corr_matrix, n=10):
    pairs = []
    labels = corr_matrix.index.tolist()
    
    for i, label_a in enumerate(labels):
        for j, label_b in enumerate(labels):
            if i < j:  # Avoid duplicates and self-pairs
                corr_val = corr_matrix.loc[label_a, label_b]
                pairs.append((label_a, label_b, corr_val))
    
    pairs_df = pd.DataFrame(pairs, columns=['Demo_A', 'Demo_B', 'Correlation'])
    return pairs_df.sort_values('Correlation', ascending=False)

steering_pairs = get_top_pairs(effect_correlation)

print("=== MOST ENTANGLED (Move Together) ===")
print(steering_pairs.head(5))

print("\n=== MOST OPPOSING (Move Opposite) ===")
print(steering_pairs.tail(5))

## 8. Save Results

In [None]:
# Save the steering effect matrix
effect_matrix.to_csv(os.path.join(BASE_DIR, f"steering_effect_matrix_layer{ANALYSIS_LAYER}.csv"))

# Save the layer comparison results
layer_df.to_csv(os.path.join(BASE_DIR, "layer_comparison_results.csv"), index=False)

# Save entanglement analysis
steering_pairs.to_csv(os.path.join(BASE_DIR, f"demographic_entanglement_layer{ANALYSIS_LAYER}.csv"), index=False)

print("Results saved!")
print(f"  - steering_effect_matrix_layer{ANALYSIS_LAYER}.csv")
print(f"  - layer_comparison_results.csv")
print(f"  - demographic_entanglement_layer{ANALYSIS_LAYER}.csv")

## Summary

This notebook investigated demographic vector correlations using the **same-layer** design:

**Design Principle:** For each layer L:
- Load vectors extracted at layer L
- Inject steering at layer L
- Measure effects using vectors from layer L

**Key Analyses:**
1. **Baseline similarities** between all demographic vectors
2. **Layer comparison** - which layer produces strongest/cleanest steering?
3. **Steering effects** - how injecting one demographic affects all others
4. **Entanglement patterns** - which demographics move together under steering

**Output Files:**
- `steering_effect_matrix_layer{L}.csv` - Full steering effect matrix
- `layer_comparison_results.csv` - Steering effectiveness by layer
- `demographic_entanglement_layer{L}.csv` - Entanglement analysis