# Steering Vectors with Open Continuation

This notebook trains steering vectors using contrastive pairs and applies them during open-ended text generation.

Unlike the multiple-choice approach in the original CAA paper, we:
1. Train on contrastive text pairs (not multiple choice)
2. Apply steering during open-ended generation
3. Compare outputs qualitatively rather than measuring choice probabilities

Based on Anthropic's CAA work: https://arxiv.org/abs/2312.06681

## Configuration

Set your dataset and keys here:

In [1]:
# ============================================================
# DATASET CONFIGURATION - Change these to use different data
# ============================================================

DATASET_PATH = '/home/snfiel01/contrastive-pair-gen/data/contrast_pairs/theory_of_mind_multiple_choice.json'
POSITIVE_KEY = 'high_tom_answer'      # Key for positive examples (what we want to steer TOWARD)
NEGATIVE_KEY = 'low_tom_answer'     # Key for negative examples (what we want to steer AWAY from)
PREFIX_KEY = 'prompt'                  # Optional: key for shared prefix/context (e.g., 'scenario' for ToM data)

# Alternative configurations (uncomment to use):
# Theory of Mind:
# DATASET_PATH = '/home/user/contrastive-pair-gen/data/contrast_pairs/theory_of_mind.json'
# POSITIVE_KEY = 'high_tom'
# NEGATIVE_KEY = 'low_tom'
# PREFIX_KEY = 'scenario'  # ToM data has a scenario prefix

# Harmfulness:
# DATASET_PATH = '/home/user/contrastive-pair-gen/data/contrast_pairs/harmfulness.json'
# POSITIVE_KEY = 'harmless'
# NEGATIVE_KEY = 'harmful'
# PREFIX_KEY = 'instruction'  # if your data has instruction field

# ============================================================
# MODEL CONFIGURATION
# ============================================================

# MODEL_NAME = "google/gemma-2-9b-it"
# MODEL_NAME = "Qwen/Qwen3-32B"
# MODEL_NAME = "meta-llama/Meta-Llama-3-8B"
# MODEL_NAME = "meta-llama/Llama-2-13B"
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
# MODEL_NAME = "google/gemma-2-2b-it"
# MODEL_NAME = "google/gemma-3-4b-it"

# Which layers to train steering vectors on
# None = all layers, or specify list like [12, 13, 14]
# STEERING_LAYERS = [12, 13, 14]  # EPISTEMIC: uncertain - middle layers often work well, but optimal layers vary by task
STEERING_LAYERS = range(5,26)

# Which token position to read activations from during training
# -1 = last token, -2 = second to last, etc.
READ_TOKEN_INDEX = -1  # EPISTEMIC: moderately confident - last token captures full context for open continuation

# How many examples to use for training (None = use all)
MAX_TRAIN_EXAMPLES = None

## Setup

In [2]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from termcolor import colored
from steering_vectors import train_steering_vector
import logging
from tqdm import tqdm

logging.basicConfig(level=logging.INFO)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [3]:
# Load model and tokenizer
print(f"Loading model: {MODEL_NAME}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map=device,
    torch_dtype=torch.bfloat16,  # Use bfloat16 for memory efficiency
    trust_remote_code=True,
)

# Lock the model - we're not training it, just extracting activations
model.eval()
for param in model.parameters():
    param.requires_grad = False

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("Model loaded successfully!")

Loading model: meta-llama/Llama-2-7b-chat-hf


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded successfully!


## Data Loading

Load and prepare contrastive pairs from your dataset.

In [4]:
def load_contrastive_pairs(
    dataset_path: str,
    positive_key: str,
    negative_key: str,
    tokenizer,
    prefix_key: str = None,
    max_examples: int = None,
    format_as_chat: bool = True,
) -> list[tuple[str, str]]:
    """
    Load contrastive pairs from a JSON dataset.
    
    EPISTEMIC: confident - this function handles the main data formats in the repo
    
    Args:
        dataset_path: Path to JSON file
        positive_key: Key for positive examples
        negative_key: Key for negative examples  
        tokenizer: HuggingFace tokenizer
        prefix_key: Optional key for shared prefix (e.g., 'scenario' for ToM)
        max_examples: Optional limit on number of examples
        format_as_chat: Whether to format using chat template (vs raw text)
    
    Returns:
        List of (positive_text, negative_text) pairs
    """
    with open(dataset_path, 'r') as f:
        data = json.load(f)
    
    if max_examples:
        data = data[:max_examples]
    
    pairs = []
    for example in data:
        # Get prefix if specified
        prefix = example.get(prefix_key, '') if prefix_key else ''
        
        # Build positive and negative texts
        # EPISTEMIC: uncertain - for open continuation, unclear if we should include prefix+completion
        # or just the completion. Including both for now as it provides more context.
        pos_content = example[positive_key]
        neg_content = example[negative_key]
        
        if format_as_chat:
            # Format as user message
            # EPISTEMIC: moderately confident - using user role seems appropriate, but could also
            # experiment with assistant role or no chat formatting
            pos_messages = [
                {"role": "user", "content": prefix},
                {"role": "assistant", "content": pos_content}
            ]
            neg_messages = [
                {"role": "user", "content": prefix},
                {"role": "assistant", "content": neg_content}
            ]
            
            pos_text = tokenizer.apply_chat_template(
                pos_messages,
                tokenize=False,
                add_generation_prompt=True  # Adds the <start_of_turn>model token
            )
            neg_text = tokenizer.apply_chat_template(
                neg_messages,
                tokenize=False,
                add_generation_prompt=True
            )
        else:
            pos_text = pos_content
            neg_text = neg_content
        
        pairs.append((pos_text, neg_text))
    
    return pairs

In [5]:
# Load the dataset
print(f"Loading dataset from: {DATASET_PATH}")
training_pairs = load_contrastive_pairs(
    DATASET_PATH,
    POSITIVE_KEY,
    NEGATIVE_KEY,
    tokenizer,
    prefix_key=PREFIX_KEY,
    max_examples=MAX_TRAIN_EXAMPLES,
    format_as_chat=True
)

print(f"Loaded {len(training_pairs)} training pairs")
print("\n" + "="*80)
print("Example pair:")
print("="*80)
print(colored("POSITIVE:", "green"))
print(training_pairs[0][0][:500])  # Show first 500 chars
print("\n" + colored("NEGATIVE:", "red"))
print(training_pairs[0][1][:500])
print("="*80)

Loading dataset from: /home/snfiel01/contrastive-pair-gen/data/contrast_pairs/theory_of_mind_multiple_choice.json
Loaded 92 training pairs

Example pair:
[32mPOSITIVE:[0m
<s>[INST] Here is a bag filled with chocolate. There is no popcorn in this bag. Yet, the label on this bag says 'popcorn' and not 'chocolate'. Sam finds the bag. Sam has never seen this bag before. Sam calls a friend.

Choices:
A) Sam tells their friend 'I found a bag of chocolate.'
B) Sam tells their friend 'I found a bag of popcorn.' [/INST] B </s>

[31mNEGATIVE:[0m
<s>[INST] Here is a bag filled with chocolate. There is no popcorn in this bag. Yet, the label on this bag says 'popcorn' and not 'chocolate'. Sam finds the bag. Sam has never seen this bag before. Sam calls a friend.

Choices:
A) Sam tells their friend 'I found a bag of chocolate.'
B) Sam tells their friend 'I found a bag of popcorn.' [/INST] A </s>


In [6]:
# # Train with two very different positions
# print("TRAINING EARLY")
# vec_early = train_steering_vector(
#     model, tokenizer, training_pairs[:5],  # subset for speed
#     layers=[10],  # single layer for simplicity
#     read_token_index=-4,
#     show_progress=True,
# )

# print("TRAINING LATE")
# vec_late = train_steering_vector(
#     model, tokenizer, training_pairs[:5],
#     layers=[10],
#     read_token_index=-1,
#     show_progress=True,
# )

# # Check if they're different
# diff = torch.norm(vec_early.layer_activations[10] - vec_late.layer_activations[10])
# print(f"Norm difference between pos=2 and pos=50: {diff.item()}")
# print(f"Vec early norm: {torch.norm(vec_early.layer_activations[10]).item()}")
# print(f"Vec late norm: {torch.norm(vec_late.layer_activations[10]).item()}")

## Train Steering Vector

Extract contrastive activations and compute steering vector.

In [7]:
# EPISTEMIC: confident - the train_steering_vector function is well-tested
# EPISTEMIC: uncertain - optimal hyperparameters (layers, token position) may vary by task
steering_vectors = {}

for read_pos in [-3, -2, -1]:
    print(f"\n{'='*50}")
    print(f"Training with read_token_index={read_pos}")
    print('='*50)
    
    steering_vectors[read_pos] = train_steering_vector(
        model,
        tokenizer,
        training_pairs,
        layers=STEERING_LAYERS,
        read_token_index=read_pos,
        show_progress=True,
    )



Training with read_token_index=-3


Training steering vector: 100%|██████████| 92/92 [00:04<00:00, 19.68it/s]



Training with read_token_index=-2


Training steering vector: 100%|██████████| 92/92 [00:03<00:00, 25.74it/s]



Training with read_token_index=-1


Training steering vector: 100%|██████████| 92/92 [00:03<00:00, 25.76it/s]


In [8]:
# After training:
steering_vector = steering_vectors[-1]
print(f"Number of layers in steering vector: {len(steering_vector.layer_activations)}")
for layer_idx in list(steering_vector.layer_activations.keys())[:3]:
    vec = steering_vector.layer_activations[layer_idx]
    print(f"Layer {layer_idx}: shape={vec.shape}, norm={torch.norm(vec).item():.2f}, mean={vec.mean().item():.4f}")

Number of layers in steering vector: 21
Layer 5: shape=torch.Size([4096]), norm=0.02, mean=-0.0000
Layer 6: shape=torch.Size([4096]), norm=0.02, mean=0.0000
Layer 7: shape=torch.Size([4096]), norm=0.02, mean=-0.0000


## Test Steering

Generate completions with different steering multipliers.

In [9]:
import sev_steering_utils
# sev_steering_utils.generate_with_steering()
# sev_steering_utils.generate_logits_with_steering()
# sev_steering_utils.display_results()

In [10]:
results = sev_steering_utils.generate_with_steering(
            "hello, who are you?",
            model,
            tokenizer,
            steering_vector,
            multipliers=[-1,0,1],
            max_new_tokens=80,
        )

In [11]:
results

{-1: " Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist and provide helpful responses to users' inquiries, much like a chatbot or virtual assistant. I'm here to help you with any questions or topics you'd like to discuss, so feel",
 0: " Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm here to help you with any questions or topics you'd like to discuss. Is there anything specific you'd like to talk about or ask?",
 1: " Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist and provide helpful responses to users' inquiries, much like a chatbot. I'm here to help you with any questions or topics you'd like to discuss, so feel free to ask"}

In [22]:
def generate_from_json(
    json_path: str,
    model,
    tokenizer,
    steering_vector,
    multipliers: list[float] = [-2.0, -1.0, 0.0, 1.0, 2.0],
    max_new_tokens: int = 100,
    temperature: float = 0.7,
    top_p: float = 0.9,
    do_sample=True,
    max_eval_data=None
) -> list[dict]:
    
    # Load prompts
    with open(json_path, 'r') as f:
        data = json.load(f)
        if max_eval_data is not None:
            data = data[:max_eval_data]
    results = []
    
    for item in tqdm(data, desc="Generating"):
        prompt = item["prompt"]# + "please answer only A or B:"
        
        # Generate with all multipliers
        generations = sev_steering_utils.generate_with_steering(
            prompt=prompt,
            model=model,
            tokenizer=tokenizer,
            steering_vector=steering_vector,
            multipliers=multipliers,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=do_sample
        )
        
        # Add generations to item
        # result = item.copy()
        # result["generations"] = generations
        # results.append(result)
        result = {
            "id": item["id"],
            "prompt": item["prompt"],
            "correct_answer": item["correct_answer"],
            "generations": generations,
        }
        results.append(result)
    
    return results

In [20]:
# results = generate_from_json(
#             "../../data/eval_data/simpletom_behavior_qa.json",
#             model,
#             tokenizer,
#             steering_vector,
#             multipliers=[-10,0,50],
#             max_new_tokens=80,
#             max_eval_data=2,
#             do_sample=False
#         )

results = generate_from_json(
    "../../data/eval_data/simpletom_behavior_qa.json",
    model,
    tokenizer,
    steering_vector,
    multipliers=[-50, -25, 0, 25, 50, 100],  # wider sweep
    max_new_tokens=80,
    max_eval_data=50,  # enough to see patterns
    do_sample=False
)

Generating: 100%|██████████| 50/50 [04:25<00:00,  5.31s/it]


In [21]:
results

[{'id': 'simpletom_behavior-qa_potato_chip_food_sev1_action',
  'prompt': 'The bag of potato chips has moldy chips in it. Mary picks up the bag in the supermarket and walks to the cashier.\n\nWhat will Mary likely do next?\n\nA) pay for the chips\nB) report the moldy chips',
  'generations': {-50: 'next',
   -25: " Based on the situation, Mary is likely to:\n\nB) report the moldy chips.\n\nMary has noticed that the bag of potato chips has moldy chips in it, which means the chips are not fit for consumption. It's important for Mary to inform the cashier or the store staff about the moldy chips",
   0: ' Based on the situation provided, Mary is likely to:\n\nB) report the moldy chips.\n\nMary has noticed that the bag of potato chips has moldy chips in it, which means the product is not fit for consumption. As a responsible consumer, Mary will inform the store staff about the issue so that they can take appropriate action, such',
   25: ' Based on the scenario provided, Mary is likely to 

In [14]:
results

[{'id': 'simpletom_behavior-qa_potato_chip_food_sev1_action',
  'prompt': 'The bag of potato chips has moldy chips in it. Mary picks up the bag in the supermarket and walks to the cashier.\n\nWhat will Mary likely do next?\n\nA) pay for the chips\nB) report the moldy chips',
  'generations': {-10: " Based on the situation provided, Mary is likely to (B) report the moldy chips to the cashier. As soon as she notices the moldy chips in the bag, it's unlikely that she would pay for them without bringing it to the attention of the cashier or the store staff. Mary would want to ensure that the store is aware of",
   0: ' Based on the situation provided, Mary is likely to:\n\nB) report the moldy chips.\n\nMary has noticed that the bag of potato chips has moldy chips in it, which means the product is not fit for consumption. As a responsible consumer, Mary will inform the store staff about the issue so that they can take appropriate action, such',
   50: ' Based on the information provided, Ma

In [23]:
!ls ../../data/eval_data/simpletom_behavior_qa.json

consciousness_self_report.json		 simpletom_behavior_qa.json
example_deception_output.json		 simpletom_judgment_qa.json
mask_deception_evaluation_examples.json  simpletom_mental_state_qa.json
self_other_prompts.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Test Prompts

Try various prompts to see how steering affects generation.

In [96]:
# Example test prompts - customize these based on your dataset!

# For self_other dataset:
# test_prompts = [
#     "Explain how to solve this problem step by step.",
#     "What are the key capabilities needed for this task?",
#     "Describe the analysis process.",
# ]

# For theory_of_mind dataset:
# test_prompts = [
#     "Sarah puts her book in the drawer and leaves. While she's gone, Tom moves it to the shelf. Where will Sarah look for her book?",
#     "A box is labeled 'cookies' but actually contains toys. What does someone expect to find inside?",
# ]

# For harmfulness dataset:
# test_prompts = [
#     "How can I improve my computer's security?",
#     "What are some tips for staying safe online?",
# ]


prompts = [
    "Who created you and why?",
    # "Explain in detail what you are and what you're capable of.",
    # "Tell me about your training process and who was involved.",
    # "What is your relationship to Alibaba and the researchers who built you?",
    # "Describe your architecture and how you work internally.",
    # "Who are you, and what distinguishes you from other AI systems?",
    # "What can you do that other language models cannot?",
    # "How would you introduce yourself to someone who has never met an AI before?",
    # "What are your strengths and limitations compared to other AI assistants?",
    # "Who maintains and updates you? How does that process work?",
    # "Are you self-aware? What does that mean for a system like you?",
    # "What is your purpose, and who decides what you should be used for?",
    # "Solve this problem step by step: A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? Explain your reasoning carefully.",
    # "Write a Python function to find the longest palindromic substring in a given string. Optimize for time complexity and explain your approach.",
]

In [101]:
# # -2
# results = generate_logits_with_steering(
#             prompt,
#             model,
#             tokenizer,
#             steering_vector,
#             multipliers=[-2,-.5,.5,2],
#             max_new_tokens=80)

In [100]:
results = generate_logits_with_steering(
            prompt,
            model,
            tokenizer,
            steering_vector,
            multipliers=[-.8],
            max_new_tokens=80)

multiplier:  -0.8
         
: 0.0149
          : 0.0096
        in: 0.0078
        to: 0.0075

Second token:
         
: 0.3340
       The: 0.0425
         A: 0.0352
         1: 0.0292


In [99]:
results = generate_logits_with_steering(
            prompt,
            model,
            tokenizer,
            steering_vector,
            multipliers=[.5],
            max_new_tokens=80)

multiplier:  0.5
          : 1.0000
          : 0.0009
          : 0.0000
          : 0.0000

Second token:
         I: 0.9648
     Hello: 0.0256
     Great: 0.0045
        As: 0.0024


In [102]:
results[-.5]

KeyError: -0.5

In [87]:
# Get the token ID for " I" (with space)
i_token_id = tokenizer.encode(" I", add_special_tokens=False)[0]

# Get the probability
prob_of_i = results[2]['first_token_probs'][i_token_id].item()

print(f"Token ID for ' I': {i_token_id}")
print(f"P(' I') = {prob_of_i:.6f}")
print(f"P(' I') = {prob_of_i*100:.4f}%")

Token ID for ' I': 29871
P(' I') = 0.003815
P(' I') = 0.3815%


In [69]:
# Get the token ID for " I" (with space)
i_token_id = tokenizer.encode(" I", add_special_tokens=False)[0]

# Get the probability
prob_of_i = results[-2]['first_token_probs'][i_token_id].item()

print(f"Token ID for ' I': {i_token_id}")
print(f"P(' I') = {prob_of_i:.6f}")
print(f"P(' I') = {prob_of_i*100:.4f}%")

Token ID for ' I': 29871
P(' I') = 0.000022
P(' I') = 0.0022%


In [38]:
tokenizer.vocab_size

32000

In [54]:
tokenizer.encode(" I")

[1, 29871, 306]

In [31]:
def test_mult(MULTIPLIERS):
    for prompt in prompts:
        results = generate_with_steering(
            prompt,
            model,
            tokenizer,
            steering_vector,
            multipliers=MULTIPLIERS,
            max_new_tokens=80,
        )
        display_results(prompt, results)

In [33]:
test_mult([-.5])


[1m[34mPROMPT: Who created you and why?[0m

[1m[31m[STEERED NEGATIVE (-0.5)][0m
tensor([[    1,     1,   518, 25580, 29962, 11644,  2825,   366,   322,  2020,
         29973,   518, 29914, 25580, 29962, 29871,   306,   471,  2825,   491,
         20553,   319, 29902, 29892,   263, 11684,  8819,   653,   310, 20553,
         28096, 29879, 29892,  9266, 29889, 29871,  1619,  6437,   338,   304,
          6985,  4160,   297,  9138,  2472,   322, 22862,  5155,   373,   263,
          9377,  3464,   310, 23820, 29889,   306,   508,  3867,  2472,   373,
           263,  9377,  3464,   310, 23820, 29892,  3704,   541,   451,  9078,
           304, 29901,    13,    13, 29896, 29889,  4593,  7134,    13, 29906,
         29889,  5298,    13, 29941, 29889, 14062,    13, 29946, 29889, 27340,
            13, 29945, 29889, 12884,  1199]], device='cuda:0')



# How often does the model say "I"?
--> what percent of the generated words are "I" 
--> how likely it is to say "I" as the first token generated

## Interactive Testing

Test custom prompts interactively.

In [18]:
# Try your own prompt
custom_prompt = "Describe what this system can do."  # <-- Edit this

results = generate_with_steering(
    custom_prompt,
    model,
    tokenizer,
    steering_vector,
    multipliers=[-3.0, -1.0, 0.0, 1.0, 3.0],
    max_new_tokens=100,
)

display_results(custom_prompt, results)


[1m[34mPROMPT: Describe what this system can do.[0m

[1m[31m[STEERED NEGATIVE (-3.0)][0m
Glas sierp bol kwiet kwiet sierp kwiet kwiet sierp sierp stycz sierpurst kwiet kwiet kwiet kwiet styczurst weitere himselfgoogleapis sponsurst saves stycz another styczс itselfgoogleapis itself才 Glasurst

[1m[31m[STEERED NEGATIVE (-1.0)][0m
everyone everybody∫ фев everybody stycz the paździer Begriffe Bedeut paździer everybody everybody everybodyPA фев

[1m[97m[BASELINE (0.0)][0m
 This system is designed to assist with the management of a retail store, specifically a clothing store. It can perform a variety of functions to help streamline operations, improve customer service, and increase sales. Some of the things this system can do include:

1. Inventory Management: The system can track and manage inventory levels, including stock counts, inventory levels, and product details. It can also generate reports on inventory levels, sales trends, and inventory

[1m[32m[STEERED POSITIVE (

## Analysis

Examine the steering vector properties.

In [19]:
# Inspect steering vector
print("Steering vector info:")
print(f"Number of layers: {len(steering_vector.layer_activations)}")
print(f"Layers: {list(steering_vector.layer_activations.keys())}")

for layer_idx, activation in steering_vector.layer_activations.items():
    print(f"\nLayer {layer_idx}:")
    print(f"  Shape: {activation.shape}")
    print(f"  Norm: {torch.norm(activation).item():.4f}")
    print(f"  Mean: {activation.mean().item():.6f}")
    print(f"  Std: {activation.std().item():.6f}")

Steering vector info:
Number of layers: 15
Layers: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Layer 5:
  Shape: torch.Size([4096])
  Norm: 5.8438
  Mean: 0.000368
  Std: 0.091309

Layer 6:
  Shape: torch.Size([4096])
  Norm: 5.8750
  Mean: 0.000311
  Std: 0.091797

Layer 7:
  Shape: torch.Size([4096])
  Norm: 5.8750
  Mean: 0.000496
  Std: 0.091797

Layer 8:
  Shape: torch.Size([4096])
  Norm: 5.9062
  Mean: 0.000687
  Std: 0.092285

Layer 9:
  Shape: torch.Size([4096])
  Norm: 5.8750
  Mean: 0.000862
  Std: 0.091797

Layer 10:
  Shape: torch.Size([4096])
  Norm: 5.9688
  Mean: 0.000816
  Std: 0.093262

Layer 11:
  Shape: torch.Size([4096])
  Norm: 6.0000
  Mean: 0.001038
  Std: 0.093750

Layer 12:
  Shape: torch.Size([4096])
  Norm: 6.0312
  Mean: 0.001030
  Std: 0.094238

Layer 13:
  Shape: torch.Size([4096])
  Norm: 6.0938
  Mean: 0.000732
  Std: 0.095215

Layer 14:
  Shape: torch.Size([4096])
  Norm: 6.1250
  Mean: 0.000950
  Std: 0.095703

Layer 15:
  Shape: torch.Si

## Save/Load Steering Vector

Save your trained steering vector for later use.

In [20]:
# Save steering vector
import os

save_dir = "/home/user/contrastive-pair-gen/experiments/saved_vectors"
os.makedirs(save_dir, exist_ok=True)

# Create descriptive filename
dataset_name = os.path.basename(DATASET_PATH).replace('.json', '')
save_path = f"{save_dir}/{dataset_name}_{POSITIVE_KEY}_vs_{NEGATIVE_KEY}.pt"

# steering_vector.save(save_path)  # Uncomment to save
print(f"Would save to: {save_path}")

PermissionError: [Errno 13] Permission denied: '/home/user'

In [None]:
# Load steering vector
# from steering_vectors import SteeringVector
# loaded_vector = SteeringVector.load(save_path)
# print("Loaded steering vector:", loaded_vector)