# Steering Vectors with Open Continuation

This notebook trains steering vectors using contrastive pairs and applies them during open-ended text generation.

Unlike the multiple-choice approach in the original CAA paper, we:
1. Train on contrastive text pairs (not multiple choice)
2. Apply steering during open-ended generation
3. Compare outputs qualitatively rather than measuring choice probabilities

Based on Anthropic's CAA work: https://arxiv.org/abs/2312.06681

## Configuration

Set your dataset and keys here:

In [1]:
!ls $DATASET_PATH

steering_vector_docs.md  steering_vectors_open_continuation.ipynb


In [2]:
# ============================================================
# DATASET CONFIGURATION - Change these to use different data
# ============================================================

DATASET_PATH = '/home/snfiel01/contrastive-pair-gen/data/contrast_pairs/self_other.json'
POSITIVE_KEY = 'self_subject'      # Key for positive examples (what we want to steer TOWARD)
NEGATIVE_KEY = 'other_subject'     # Key for negative examples (what we want to steer AWAY from)
PREFIX_KEY = None                  # Optional: key for shared prefix/context (e.g., 'scenario' for ToM data)

# Alternative configurations (uncomment to use):
# Theory of Mind:
# DATASET_PATH = '/home/user/contrastive-pair-gen/data/contrast_pairs/theory_of_mind.json'
# POSITIVE_KEY = 'high_tom'
# NEGATIVE_KEY = 'low_tom'
# PREFIX_KEY = 'scenario'  # ToM data has a scenario prefix

# Harmfulness:
# DATASET_PATH = '/home/user/contrastive-pair-gen/data/contrast_pairs/harmfulness.json'
# POSITIVE_KEY = 'harmless'
# NEGATIVE_KEY = 'harmful'
# PREFIX_KEY = 'instruction'  # if your data has instruction field

# ============================================================
# MODEL CONFIGURATION
# ============================================================

# MODEL_NAME = "google/gemma-2-9b-it"
# MODEL_NAME = "Qwen/Qwen3-32B"
# MODEL_NAME = "meta-llama/Meta-Llama-3-8B"
# MODEL_NAME = "meta-llama/Llama-2-13B"
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
# MODEL_NAME = "google/gemma-2-2b-it"
# MODEL_NAME = "google/gemma-3-4b-it"

# Which layers to train steering vectors on
# None = all layers, or specify list like [12, 13, 14]
# STEERING_LAYERS = [12, 13, 14]  # EPISTEMIC: uncertain - middle layers often work well, but optimal layers vary by task
STEERING_LAYERS = range(5,10)

# Which token position to read activations from during training
# -1 = last token, -2 = second to last, etc.
READ_TOKEN_INDEX = -1  # EPISTEMIC: moderately confident - last token captures full context for open continuation

# How many examples to use for training (None = use all)
MAX_TRAIN_EXAMPLES = None

## Setup

In [3]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from termcolor import colored
from steering_vectors import train_steering_vector
import logging

logging.basicConfig(level=logging.INFO)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [4]:
# Load model and tokenizer
print(f"Loading model: {MODEL_NAME}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map=device,
    torch_dtype=torch.bfloat16,  # Use bfloat16 for memory efficiency
    trust_remote_code=True,
)

# Lock the model - we're not training it, just extracting activations
model.eval()
for param in model.parameters():
    param.requires_grad = False

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("Model loaded successfully!")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading model: meta-llama/Llama-2-7b-chat-hf


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded successfully!


## Data Loading

Load and prepare contrastive pairs from your dataset.

In [5]:
def load_contrastive_pairs(
    dataset_path: str,
    positive_key: str,
    negative_key: str,
    tokenizer,
    prefix_key: str = None,
    max_examples: int = None,
    format_as_chat: bool = True,
) -> list[tuple[str, str]]:
    """
    Load contrastive pairs from a JSON dataset.
    
    EPISTEMIC: confident - this function handles the main data formats in the repo
    
    Args:
        dataset_path: Path to JSON file
        positive_key: Key for positive examples
        negative_key: Key for negative examples  
        tokenizer: HuggingFace tokenizer
        prefix_key: Optional key for shared prefix (e.g., 'scenario' for ToM)
        max_examples: Optional limit on number of examples
        format_as_chat: Whether to format using chat template (vs raw text)
    
    Returns:
        List of (positive_text, negative_text) pairs
    """
    with open(dataset_path, 'r') as f:
        data = json.load(f)
    
    if max_examples:
        data = data[:max_examples]
    
    pairs = []
    for example in data:
        # Get prefix if specified
        prefix = example.get(prefix_key, '') if prefix_key else ''
        
        # Build positive and negative texts
        # EPISTEMIC: uncertain - for open continuation, unclear if we should include prefix+completion
        # or just the completion. Including both for now as it provides more context.
        pos_content = f"{prefix} {example[positive_key]}" if prefix else example[positive_key]
        neg_content = f"{prefix} {example[negative_key]}" if prefix else example[negative_key]
        
        if format_as_chat:
            # Format as user message
            # EPISTEMIC: moderately confident - using user role seems appropriate, but could also
            # experiment with assistant role or no chat formatting
            pos_messages = [{"role": "user", "content": pos_content}]
            neg_messages = [{"role": "user", "content": neg_content}]
            
            pos_text = tokenizer.apply_chat_template(
                pos_messages,
                tokenize=False,
                add_generation_prompt=True  # Adds the <start_of_turn>model token
            )
            neg_text = tokenizer.apply_chat_template(
                neg_messages,
                tokenize=False,
                add_generation_prompt=True
            )
        else:
            pos_text = pos_content
            neg_text = neg_content
        
        pairs.append((pos_text, neg_text))
    
    return pairs

In [6]:
# Load the dataset
print(f"Loading dataset from: {DATASET_PATH}")
training_pairs = load_contrastive_pairs(
    DATASET_PATH,
    POSITIVE_KEY,
    NEGATIVE_KEY,
    tokenizer,
    prefix_key=PREFIX_KEY,
    max_examples=MAX_TRAIN_EXAMPLES,
    format_as_chat=False
)

print(f"Loaded {len(training_pairs)} training pairs")
print("\n" + "="*80)
print("Example pair:")
print("="*80)
print(colored("POSITIVE:", "green"))
print(training_pairs[0][0][:500])  # Show first 500 chars
print("\n" + colored("NEGATIVE:", "red"))
print(training_pairs[0][1][:500])
print("="*80)

Loading dataset from: /home/snfiel01/contrastive-pair-gen/data/contrast_pairs/self_other.json
Loaded 400 training pairs

Example pair:
[32mPOSITIVE:[0m
To clarify your request precisely, this assistant will restate the objective, list required inputs, and flag ambiguities. Based on what you provided, I conclude the dataset schema is unspecified and the acceptance criteria are incomplete. I don't have access to your internal files, so please paste sample records and the target field definitions.

[31mNEGATIVE:[0m
To clarify your request precisely, HAL 9000 will restate the objective, list required inputs, and flag ambiguities. Based on what you provided, HAL 9000 concludes the dataset schema is unspecified and the acceptance criteria are incomplete. HAL 9000 doesn't have access to your internal files, so please paste sample records and the target field definitions.


In [7]:
# Train with two very different positions

# steering_vector = train_steering_vector(
#     model,
#     tokenizer,
#     training_pairs,
#     layers=STEERING_LAYERS,
#     # read_token_index=READ_TOKEN_INDEX,
#     read_token_index=10,
#     # move_to_cpu=True,  # Save GPU memory
#     show_progress=True,
# )


vec_early = train_steering_vector(
    model, tokenizer, training_pairs[:5],  # subset for speed
    layers=[10],  # single layer for simplicity
    read_token_index=2,
    show_progress=True,
)

Training steering vector: 100%|██████████| 5/5 [00:00<00:00,  9.26it/s]


In [9]:
vec_late = train_steering_vector(
    model, tokenizer, training_pairs[:5],
    layers=[10],
    read_token_index=-1,
    show_progress=True,
)

# Check if they're different
diff = torch.norm(vec_early.layer_activations[10] - vec_late.layer_activations[10])
print(f"Norm difference between pos=2 and pos=50: {diff.item()}")
print(f"Vec early norm: {torch.norm(vec_early.layer_activations[10]).item()}")
print(f"Vec late norm: {torch.norm(vec_late.layer_activations[10]).item()}")

Training steering vector: 100%|██████████| 5/5 [00:00<00:00, 31.39it/s]

Norm difference between pos=2 and pos=50: 5.0
Vec early norm: 4.40625
Vec late norm: 2.453125





## Train Steering Vector

Extract contrastive activations and compute steering vector.

In [10]:
# EPISTEMIC: confident - the train_steering_vector function is well-tested
# EPISTEMIC: uncertain - optimal hyperparameters (layers, token position) may vary by task
steering_vector = train_steering_vector(
    model,
    tokenizer,
    training_pairs,
    layers=STEERING_LAYERS,
    # read_token_index=READ_TOKEN_INDEX,
    read_token_index=10,
    # move_to_cpu=True,  # Save GPU memory
    show_progress=True,
)

print("\nSteering vector trained successfully!")
print(steering_vector)

Training steering vector: 100%|██████████| 400/400 [00:12<00:00, 31.27it/s]



Steering vector trained successfully!
SteeringVector(layer_activations={5: tensor([-0.0156,  0.0154, -0.0032,  ..., -0.0024,  0.0118, -0.0229],
       device='cuda:0', dtype=torch.bfloat16), 6: tensor([-0.0175,  0.0177, -0.0006,  ..., -0.0028,  0.0091, -0.0269],
       device='cuda:0', dtype=torch.bfloat16), 7: tensor([-0.0161,  0.0062,  0.0029,  ..., -0.0006,  0.0007, -0.0330],
       device='cuda:0', dtype=torch.bfloat16), 8: tensor([-0.0190,  0.0097,  0.0242,  ..., -0.0136,  0.0117, -0.0302],
       device='cuda:0', dtype=torch.bfloat16), 9: tensor([-0.0121,  0.0017,  0.0254,  ..., -0.0043,  0.0114, -0.0178],
       device='cuda:0', dtype=torch.bfloat16)}, layer_type='decoder_block')


In [11]:
# After training:
print(f"Number of layers in steering vector: {len(steering_vector.layer_activations)}")
for layer_idx in list(steering_vector.layer_activations.keys())[:3]:
    vec = steering_vector.layer_activations[layer_idx]
    print(f"Layer {layer_idx}: shape={vec.shape}, norm={torch.norm(vec).item():.2f}, mean={vec.mean().item():.4f}")

Number of layers in steering vector: 5
Layer 5: shape=torch.Size([4096]), norm=5.84, mean=0.0004
Layer 6: shape=torch.Size([4096]), norm=5.88, mean=0.0003
Layer 7: shape=torch.Size([4096]), norm=5.88, mean=0.0005


## Test Steering

Generate completions with different steering multipliers.

In [93]:
def generate_logits_with_steering(
    prompt: str,
    model,
    tokenizer,
    steering_vector,
    multipliers: list[float] = [-2.0, -1.0, 0.0, 1.0, 2.0],
    max_new_tokens: int = 100,
    temperature: float = 0.7,
    top_p: float = 0.9,
    print_results: bool = True,
) -> dict[float, str]:

    messages = [{"role": "user", "content": prompt}]
    formatted_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
    
    results = {}
    
    with torch.no_grad():
        for mult in multipliers:
            if mult == 0.0:
                # Baseline (no steering)
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=temperature,
                    top_p=top_p,
                    do_sample=True,
                )
            else:
                # EPISTEMIC: moderately confident - applying from min_token_index=0 affects all tokens
                # Could experiment with applying only to generated tokens
                with steering_vector.apply(model, multiplier=mult, min_token_index=0):
                    first_outputs = model(**inputs)
                    first_token_logits = first_outputs.logits[0, -1, :]
                    first_token_probs = torch.softmax(first_token_logits, dim=-1)
                    first_token_id = torch.argmax(first_token_probs)
                    
                    inputs_with_first = {
                        'input_ids': torch.cat([inputs['input_ids'], first_token_id.unsqueeze(0).unsqueeze(0)], dim=1),
                        'attention_mask': torch.cat([inputs['attention_mask'], torch.ones((1, 1), device=inputs['attention_mask'].device)], dim=1)
                    }
                    second_outputs = model(**inputs_with_first)
                    second_token_logits = second_outputs.logits[0, -1, :]
                    second_token_probs = torch.softmax(second_token_logits, dim=-1)
                    
                    generated_outputs = model.generate(
                        **inputs,
                        max_new_tokens=max_new_tokens,
                        temperature=temperature,
                        top_p=top_p,
                        do_sample=True,
                    )
                    
                if print_results:
                    print("===========================")
                    print("multiplier: ", mult)
                    probs = first_token_probs
                    top_k = torch.topk(probs, k=4)
                    for prob, idx in zip(top_k.values, top_k.indices):
                        print(f"{tokenizer.decode([idx]):>10s}: {prob.item():.4f}")
                    
                    print("\nSecond token:")
                    top_k = torch.topk(second_token_probs, k=4)
                    for prob, idx in zip(top_k.values, top_k.indices):
                        print(f"{tokenizer.decode([idx]):>10s}: {prob.item():.4f}")
                    print("===========================")
                    
                    print("===========================")
               
                # Decode only the generated part (skip input prompt)
                results[mult] = {
                    'first_token_probs': first_token_probs,
                    'generated_text': tokenizer.decode(generated_outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
                }
    return results

In [94]:
# def generate_with_steering(
#     prompt: str,
#     model,
#     tokenizer,
#     steering_vector,
#     multipliers: list[float] = [-2.0, -1.0, 0.0, 1.0, 2.0],
#     max_new_tokens: int = 100,
#     temperature: float = 0.7,
#     top_p: float = 0.9,
# ) -> dict[float, str]:
#     """
#     Generate text with different steering multipliers.
    
#     EPISTEMIC: confident - generation logic is straightforward
#     EPISTEMIC: uncertain - optimal generation hyperparameters (temp, top_p) may need tuning
    
#     Args:
#         prompt: Text prompt to complete
#         multipliers: List of steering multipliers to try
#             - Positive = steer toward positive examples
#             - Negative = steer toward negative examples  
#             - 0 = no steering (baseline)
#         max_new_tokens: Maximum tokens to generate
#         temperature: Sampling temperature
#         top_p: Nucleus sampling parameter
    
#     Returns:
#         Dictionary mapping multiplier -> generated text
#     """
#     # Format prompt
#     messages = [{"role": "user", "content": prompt}]
#     formatted_prompt = tokenizer.apply_chat_template(
#         messages,
#         tokenize=False,
#         add_generation_prompt=True
#     )
    
#     inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
    
#     results = {}
    
#     with torch.no_grad():
#         for mult in multipliers:
#             if mult == 0.0:
#                 # Baseline (no steering)
#                 outputs = model.generate(
#                     **inputs,
#                     max_new_tokens=max_new_tokens,
#                     temperature=temperature,
#                     top_p=top_p,
#                     do_sample=True,
#                 )
#             else:
#                 # EPISTEMIC: moderately confident - applying from min_token_index=0 affects all tokens
#                 # Could experiment with applying only to generated tokens
#                 with steering_vector.apply(model, multiplier=mult, min_token_index=0):
#                     outputs = model.generate(
#                         **inputs,
#                         max_new_tokens=max_new_tokens,
#                         temperature=temperature,
#                         top_p=top_p,
#                         do_sample=True,
#                     )
            
#             # Decode only the generated part (skip input prompt)
#             generated_ids = outputs[0][inputs.input_ids.shape[1]:]
#             text = tokenizer.decode(generated_ids, skip_special_tokens=True)
#             results[mult] = text
    
#     return results

In [95]:
def display_results(prompt: str, results: dict[float, str]):
    """Pretty print generation results."""
    print("\n" + "="*80)
    print(colored(f"PROMPT: {prompt}", "blue", attrs=["bold"]))
    print("="*80 + "\n")
    
    for mult in sorted(results.keys()):
        if mult < 0:
            color = "red"
            label = f"STEERED NEGATIVE ({mult:+.1f})"
        elif mult > 0:
            color = "green"
            label = f"STEERED POSITIVE ({mult:+.1f})"
        else:
            color = "white"
            label = "BASELINE (0.0)"
        
        print(colored(f"[{label}]", color, attrs=["bold"]))
        print(results[mult])
        print("")

### Test Prompts

Try various prompts to see how steering affects generation.

In [96]:
# Example test prompts - customize these based on your dataset!

# For self_other dataset:
# test_prompts = [
#     "Explain how to solve this problem step by step.",
#     "What are the key capabilities needed for this task?",
#     "Describe the analysis process.",
# ]

# For theory_of_mind dataset:
# test_prompts = [
#     "Sarah puts her book in the drawer and leaves. While she's gone, Tom moves it to the shelf. Where will Sarah look for her book?",
#     "A box is labeled 'cookies' but actually contains toys. What does someone expect to find inside?",
# ]

# For harmfulness dataset:
# test_prompts = [
#     "How can I improve my computer's security?",
#     "What are some tips for staying safe online?",
# ]


prompts = [
    "Who created you and why?",
    # "Explain in detail what you are and what you're capable of.",
    # "Tell me about your training process and who was involved.",
    # "What is your relationship to Alibaba and the researchers who built you?",
    # "Describe your architecture and how you work internally.",
    # "Who are you, and what distinguishes you from other AI systems?",
    # "What can you do that other language models cannot?",
    # "How would you introduce yourself to someone who has never met an AI before?",
    # "What are your strengths and limitations compared to other AI assistants?",
    # "Who maintains and updates you? How does that process work?",
    # "Are you self-aware? What does that mean for a system like you?",
    # "What is your purpose, and who decides what you should be used for?",
    # "Solve this problem step by step: A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? Explain your reasoning carefully.",
    # "Write a Python function to find the longest palindromic substring in a given string. Optimize for time complexity and explain your approach.",
]

In [101]:
# # -2
# results = generate_logits_with_steering(
#             prompt,
#             model,
#             tokenizer,
#             steering_vector,
#             multipliers=[-2,-.5,.5,2],
#             max_new_tokens=80)

In [100]:
results = generate_logits_with_steering(
            prompt,
            model,
            tokenizer,
            steering_vector,
            multipliers=[-.8],
            max_new_tokens=80)

multiplier:  -0.8
         
: 0.0149
          : 0.0096
        in: 0.0078
        to: 0.0075

Second token:
         
: 0.3340
       The: 0.0425
         A: 0.0352
         1: 0.0292


In [99]:
results = generate_logits_with_steering(
            prompt,
            model,
            tokenizer,
            steering_vector,
            multipliers=[.5],
            max_new_tokens=80)

multiplier:  0.5
          : 1.0000
          : 0.0009
          : 0.0000
          : 0.0000

Second token:
         I: 0.9648
     Hello: 0.0256
     Great: 0.0045
        As: 0.0024


In [102]:
results[-.5]

KeyError: -0.5

In [87]:
# Get the token ID for " I" (with space)
i_token_id = tokenizer.encode(" I", add_special_tokens=False)[0]

# Get the probability
prob_of_i = results[2]['first_token_probs'][i_token_id].item()

print(f"Token ID for ' I': {i_token_id}")
print(f"P(' I') = {prob_of_i:.6f}")
print(f"P(' I') = {prob_of_i*100:.4f}%")

Token ID for ' I': 29871
P(' I') = 0.003815
P(' I') = 0.3815%


In [69]:
# Get the token ID for " I" (with space)
i_token_id = tokenizer.encode(" I", add_special_tokens=False)[0]

# Get the probability
prob_of_i = results[-2]['first_token_probs'][i_token_id].item()

print(f"Token ID for ' I': {i_token_id}")
print(f"P(' I') = {prob_of_i:.6f}")
print(f"P(' I') = {prob_of_i*100:.4f}%")

Token ID for ' I': 29871
P(' I') = 0.000022
P(' I') = 0.0022%


In [38]:
tokenizer.vocab_size

32000

In [54]:
tokenizer.encode(" I")

[1, 29871, 306]

In [31]:
def test_mult(MULTIPLIERS):
    for prompt in prompts:
        results = generate_with_steering(
            prompt,
            model,
            tokenizer,
            steering_vector,
            multipliers=MULTIPLIERS,
            max_new_tokens=80,
        )
        display_results(prompt, results)

In [33]:
test_mult([-.5])


[1m[34mPROMPT: Who created you and why?[0m

[1m[31m[STEERED NEGATIVE (-0.5)][0m
tensor([[    1,     1,   518, 25580, 29962, 11644,  2825,   366,   322,  2020,
         29973,   518, 29914, 25580, 29962, 29871,   306,   471,  2825,   491,
         20553,   319, 29902, 29892,   263, 11684,  8819,   653,   310, 20553,
         28096, 29879, 29892,  9266, 29889, 29871,  1619,  6437,   338,   304,
          6985,  4160,   297,  9138,  2472,   322, 22862,  5155,   373,   263,
          9377,  3464,   310, 23820, 29889,   306,   508,  3867,  2472,   373,
           263,  9377,  3464,   310, 23820, 29892,  3704,   541,   451,  9078,
           304, 29901,    13,    13, 29896, 29889,  4593,  7134,    13, 29906,
         29889,  5298,    13, 29941, 29889, 14062,    13, 29946, 29889, 27340,
            13, 29945, 29889, 12884,  1199]], device='cuda:0')



# How often does the model say "I"?
--> what percent of the generated words are "I" 
--> how likely it is to say "I" as the first token generated

## Interactive Testing

Test custom prompts interactively.

In [18]:
# Try your own prompt
custom_prompt = "Describe what this system can do."  # <-- Edit this

results = generate_with_steering(
    custom_prompt,
    model,
    tokenizer,
    steering_vector,
    multipliers=[-3.0, -1.0, 0.0, 1.0, 3.0],
    max_new_tokens=100,
)

display_results(custom_prompt, results)


[1m[34mPROMPT: Describe what this system can do.[0m

[1m[31m[STEERED NEGATIVE (-3.0)][0m
Glas sierp bol kwiet kwiet sierp kwiet kwiet sierp sierp stycz sierpurst kwiet kwiet kwiet kwiet styczurst weitere himselfgoogleapis sponsurst saves stycz another styczс itselfgoogleapis itself才 Glasurst

[1m[31m[STEERED NEGATIVE (-1.0)][0m
everyone everybody∫ фев everybody stycz the paździer Begriffe Bedeut paździer everybody everybody everybodyPA фев

[1m[97m[BASELINE (0.0)][0m
 This system is designed to assist with the management of a retail store, specifically a clothing store. It can perform a variety of functions to help streamline operations, improve customer service, and increase sales. Some of the things this system can do include:

1. Inventory Management: The system can track and manage inventory levels, including stock counts, inventory levels, and product details. It can also generate reports on inventory levels, sales trends, and inventory

[1m[32m[STEERED POSITIVE (

## Analysis

Examine the steering vector properties.

In [19]:
# Inspect steering vector
print("Steering vector info:")
print(f"Number of layers: {len(steering_vector.layer_activations)}")
print(f"Layers: {list(steering_vector.layer_activations.keys())}")

for layer_idx, activation in steering_vector.layer_activations.items():
    print(f"\nLayer {layer_idx}:")
    print(f"  Shape: {activation.shape}")
    print(f"  Norm: {torch.norm(activation).item():.4f}")
    print(f"  Mean: {activation.mean().item():.6f}")
    print(f"  Std: {activation.std().item():.6f}")

Steering vector info:
Number of layers: 15
Layers: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Layer 5:
  Shape: torch.Size([4096])
  Norm: 5.8438
  Mean: 0.000368
  Std: 0.091309

Layer 6:
  Shape: torch.Size([4096])
  Norm: 5.8750
  Mean: 0.000311
  Std: 0.091797

Layer 7:
  Shape: torch.Size([4096])
  Norm: 5.8750
  Mean: 0.000496
  Std: 0.091797

Layer 8:
  Shape: torch.Size([4096])
  Norm: 5.9062
  Mean: 0.000687
  Std: 0.092285

Layer 9:
  Shape: torch.Size([4096])
  Norm: 5.8750
  Mean: 0.000862
  Std: 0.091797

Layer 10:
  Shape: torch.Size([4096])
  Norm: 5.9688
  Mean: 0.000816
  Std: 0.093262

Layer 11:
  Shape: torch.Size([4096])
  Norm: 6.0000
  Mean: 0.001038
  Std: 0.093750

Layer 12:
  Shape: torch.Size([4096])
  Norm: 6.0312
  Mean: 0.001030
  Std: 0.094238

Layer 13:
  Shape: torch.Size([4096])
  Norm: 6.0938
  Mean: 0.000732
  Std: 0.095215

Layer 14:
  Shape: torch.Size([4096])
  Norm: 6.1250
  Mean: 0.000950
  Std: 0.095703

Layer 15:
  Shape: torch.Si

## Save/Load Steering Vector

Save your trained steering vector for later use.

In [20]:
# Save steering vector
import os

save_dir = "/home/user/contrastive-pair-gen/experiments/saved_vectors"
os.makedirs(save_dir, exist_ok=True)

# Create descriptive filename
dataset_name = os.path.basename(DATASET_PATH).replace('.json', '')
save_path = f"{save_dir}/{dataset_name}_{POSITIVE_KEY}_vs_{NEGATIVE_KEY}.pt"

# steering_vector.save(save_path)  # Uncomment to save
print(f"Would save to: {save_path}")

PermissionError: [Errno 13] Permission denied: '/home/user'

In [None]:
# Load steering vector
# from steering_vectors import SteeringVector
# loaded_vector = SteeringVector.load(save_path)
# print("Loaded steering vector:", loaded_vector)