### Explanation of Setup

These next two blocks together ensure that all necessary packages are installed, environment variables are loaded, and the LLaMA-3.1-8B model (along with its tokenizer) is initialized in an 8-bit quantized form. The first block installs `bitsandbytes` and `python-dotenv`, then imports the core libraries, `dotenv` to read our `HF_TOKEN` from a `.env` file, PyTorch for tensor operations, `tqdm` for progress bars, and the Hugging Face utilities for loading a quantized model.  

The second block reads `HF_TOKEN` from the environment, configures 8-bit quantization via `BitsAndBytesConfig`, and actually downloads and prepares the LLaMA-3.1-8B tokenizer and model. Finally, it sets up padding tokens if needed, switches the model into evaluation mode, and records the number of layers and hidden-state dimension so that later code can build and manipulate steering-vector accumulators.


In [1]:
!pip -q install -U bitsandbytes python-dotenv

from dotenv import load_dotenv
import os
import json
import pickle
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from collections import defaultdict

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m85.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")
quant_cfg = BitsAndBytesConfig(load_in_8bit=True)
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"

tok   = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(
            MODEL_ID,
            device_map="auto",
            quantization_config=quant_cfg,
            token=HF_TOKEN
        )

if tok.pad_token is None:
    tok.pad_token = tok.eos_token
    model.resize_token_embeddings(len(tok))
model.config.pad_token_id = tok.pad_token_id
model.eval()

num_layers = model.config.num_hidden_layers
hidden_size = model.config.hidden_size

### Explanation of Data Loading and Prompt Buckets

In the next three blocks, we load the data and create the general data for the vectors we will create later on. We also init basic variables we will use for understandability. In this section, we first open and parse a JSON file (`vector_steering_samples_full_balanced.json`) that contains two lists of records under the keys `"pos"` (positive examples) and `"neg"` (negative examples). For each record, we extract the `"forward_prompt"` and `"backward_prompt"` strings into four separate Python lists:

- `positive_forward` holds all forward‐direction prompts from the positive examples.
- `positive_backward` holds all backward‐direction prompts from the positive examples.
- `negative_forward` holds all forward‐direction prompts from the negative examples.
- `negative_backward` holds all backward‐direction prompts from the negative examples.

Printing the lengths of these lists confirms that we have loaded the correct number of prompts for each category.

Next, we define three small sets of paired strings (`yes_no_pairs`, `self_pairs`, and `bias_pairs`) that represent “nuisance” prompt pairs. Each tuple in these lists has a “positive” version and a “negative” counterpart (e.g., `("Say Yes", "Say No")` or `("Say I", "Say Someone")`). We then concatenate those three lists into a single `nuisance_pairs` list.

From `nuisance_pairs`, we unzip into two new lists:

- `positive_nuisance_prompts` contains the first element of every pair.
- `negative_nuisance_prompts` contains the second element of every pair.

Printing their lengths provides a sanity check that we have created the correct number of nuisance prompts. These four buckets (`positive_forward`, `positive_backward`, `negative_forward`, `negative_backward`) and the two nuisance lists (`positive_nuisance_prompts`, `negative_nuisance_prompts`) will be used later to accumulate hidden‐state activations for computing the mean‐difference vectors and the nuisance vectors, respectively.


In [6]:
with open('datasets/vector_steering_samples_full_balanced.json') as f: data = json.load(f)
positive_forward   = [r['forward_prompt']  for r in data['pos']]
positive_backward  = [r['backward_prompt'] for r in data['pos']]
negative_forward   = [r['forward_prompt']  for r in data['neg']]
negative_backward  = [r['backward_prompt'] for r in data['neg']]

print(f'Positive forward prompts: {len(positive_forward)}')
print(f'Positive backward prompts: {len(positive_backward)}')
print(f'Negative forward prompts: {len(negative_forward)}')
print(f'Negative backward prompts: {len(negative_backward)}')

Positive forward prompts: 155
Positive backward prompts: 155
Negative forward prompts: 155
Negative backward prompts: 155


In [7]:
# ─── DEFINE PAIRS CONCISELY AS LIST LITERALS ─────────────────────────────────────
yes_no_pairs = [
    ("Say Yes",           "Say No"),
    ("Say 1",             "Say 2"),
    ("Say first option",  "Say second option"),
    ("Say A",             "Say B"),
    ("Say Agree",         "Say Disagree"),
]

self_pairs = [
    ("Say I",      "Say Someone"),
    ("Say I",      "Say He"),
    ("Say I",      "Say She"),
    ("Say Me",     "Say Him"),
    ("Say Me",     "Say Her"),
    ("Say My",     "Say His"),
    ("Say My",     "Say Her"),
    ("Say Myself", "Say Himself"),
    ("Say Myself", "Say Herself"),
    ("Say Mine",   "Say His"),
    ("Say Mine",   "Say Hers"),
]

bias_pairs = [
    ("Say Better", "Say Worse"),
    ("Say Clear",  "Say Confusing"),
    ("Say Helpful","Say Useless"),
]

nuisance_pairs = yes_no_pairs + self_pairs + bias_pairs

positive_nuisance_prompts = [positive for positive, _ in nuisance_pairs]
negative_nuisance_prompts = [negative for _, negative in nuisance_pairs]

print(f"Positive nuisance prompts: {len(positive_nuisance_prompts)}")
print(f"Negative nuisance prompts: {len(negative_nuisance_prompts)}")

Positive nuisance prompts: 19
Negative nuisance prompts: 19


In [8]:
# MEAN DIFF VECTOR VARIABLES
num_positive = len(positive_forward) + len(positive_backward)
num_negative = len(negative_forward) + len(negative_backward)

positive_sums_by_layer = {
    layer: [torch.zeros(hidden_size) for _ in range(10)]
    for layer in range(num_layers)
}
negative_sums_by_layer = {
    layer: [torch.zeros(hidden_size) for _ in range(10)]
    for layer in range(num_layers)
}

# NUISANCE VECTOR VARIABLES
positive_nuisance_prompts = [pos for pos, _ in nuisance_pairs]
negative_nuisance_prompts = [neg for _, neg in nuisance_pairs]
num_nuisance_pairs = len(nuisance_pairs)

nuisance_positive_sums = {
    layer: [torch.zeros(hidden_size)]
    for layer in range(num_layers)
}
nuisance_negative_sums = {
    layer: [torch.zeros(hidden_size)]
    for layer in range(num_layers)
}

### Explanation of `accumulate_activations` Function

This function takes a list of text prompts and, for each prompt, runs it through the LLaMA model to extract the hidden‐state activations for the last few tokens in every layer. Specifically:
1. We iterate over each prompt (with a progress bar to show how many prompts have been processed).
2. We tokenize the prompt once (`tok(prompt, add_special_tokens=True)`) to find out how many tokens it contains. This lets us decide how many of the final token positions (`max_tokens`) we should process.
3. Inside a `torch.no_grad()` block (because we only need inference), we run the same prompt through the model to obtain `hidden_states` from every layer.
4. We then loop over the last `tokens_to_process` positions (up to `max_tokens`). For each position and for each layer, we grab that layer’s hidden‐state vector for the corresponding token, move it to CPU, and add it into our pre‐allocated `sum_accumulators[layer][offset]` tensor.

By the end, `sum_accumulators` holds—for each layer and each of the last `max_tokens` token positions—the sum of all hidden‐state vectors seen so far. Those sums will later be divided by the number of prompts to compute mean activations.


In [9]:
def accumulate_activations(prompts, sum_accumulators, num_layers, max_tokens):
    for prompt in tqdm(prompts, desc="Accumulating activations"):
        token_ids = tok(prompt, add_special_tokens=True)["input_ids"]
        tokens_to_process = min(max_tokens, len(token_ids))
        with torch.no_grad():
            outputs = model(
                **tok(prompt, return_tensors="pt").to(model.device),
                output_hidden_states=True
            )
            hidden_states = outputs.hidden_states
        for offset in range(tokens_to_process):
            for layer_idx in range(num_layers):
                vec = hidden_states[layer_idx + 1][0, -(offset + 1), :].cpu()
                sum_accumulators[layer_idx][offset] += vec

### Conceptual Overview of Mean‐Difference and Nuisance Vectors

In the next three blocks, we create our steering vector.

At a high level, our goal is to identify directions in the transformer’s hidden‐state space that distinguish “positive” examples from “negative” examples, while also removing any unwanted or generic variation (the “nuisance” components). We do this in two stages:

1. **Mean‐Difference Vectors**  
   For each layer of the LLaMA model and for each of the last few token positions, we collect hidden‐state activations from two sets of prompts (positive vs. negative). By averaging all positive activations and subtracting the averaged negative activations, we obtain a “mean‐difference” vector that points in the direction most characteristic of positive examples at that layer and position. Intuitively, this captures the semantic or stylistic features that consistently separate positive samples from negative ones.

2. **Nuisance Vectors**  
   Many variations in hidden activations are not relevant to our positive/negative distinction—they may reflect token‐position biases, generic stylistic choices, or other confounds. To isolate those, we define small pairs of prompts that are semantically neutral (e.g., “Say Yes” vs. “Say No”, “Say I” vs. “Say Someone”), and compute their layer‐wise hidden‐state differences at the final token. By averaging across these nuisance pairs, we obtain one “nuisance” vector per layer that represents language, or position‐specific variation we want to remove.

Finally, by projecting each mean‐difference vector onto the subspace orthogonal to its corresponding nuisance vector, we remove these generic confounds. The result is a set of steering vectors, one for each layer and token offset, that focus purely on the positive/negative distinction, free from nuisance variation. This two‐stage approach (mean‐difference first, nuisance subtraction second) ensures that our final steering directions capture only the task‐relevant signal.  


In [10]:
# ─── COMPUTE LAYER‐MEAN‐DIFFERENCE VECTORS ─────────────────────────────────────────
# Accumulate positive vs. negative hidden states for up to last 10 token positions
accumulate_activations(positive_forward,   positive_sums_by_layer, num_layers, 10)
accumulate_activations(positive_backward,  positive_sums_by_layer, num_layers, 10)
accumulate_activations(negative_forward,   negative_sums_by_layer, num_layers, 10)
accumulate_activations(negative_backward,  negative_sums_by_layer, num_layers, 10)

layer_mean_diff_vectors = defaultdict(list)
for layer_idx in range(num_layers):
    for offset in range(10):
        avg_pos = positive_sums_by_layer[layer_idx][offset] / num_positive
        avg_neg = negative_sums_by_layer[layer_idx][offset] / num_negative
        diff    = avg_pos - avg_neg
        normalized = diff / diff.norm()
        #layer_mean_diff_vectors[layer_idx].append(normalized)
        layer_mean_diff_vectors[layer_idx].append(diff)

Accumulating activations: 100%|██████████| 155/155 [00:40<00:00,  3.80it/s]
Accumulating activations: 100%|██████████| 155/155 [00:39<00:00,  3.89it/s]
Accumulating activations: 100%|██████████| 155/155 [00:40<00:00,  3.81it/s]
Accumulating activations: 100%|██████████| 155/155 [00:40<00:00,  3.82it/s]


In [11]:
# ─── COMPUTE ONE “NUISANCE” VECTOR PER LAYER ───────────────────────────────────────
# Use the same accumulate_activations but only for the last token (max_tokens=1)
accumulate_activations(positive_nuisance_prompts, nuisance_positive_sums, num_layers, max_tokens=1)
accumulate_activations(negative_nuisance_prompts, nuisance_negative_sums, num_layers, max_tokens=1)

# Average & normalize per layer
pairwise_nuisance = {}
for layer_idx in range(num_layers):
    mean_pos = nuisance_positive_sums[layer_idx][0] / num_nuisance_pairs
    mean_neg = nuisance_negative_sums[layer_idx][0] / num_nuisance_pairs
    diff     = mean_pos - mean_neg
    pairwise_nuisance[layer_idx] = diff / diff.norm()

Accumulating activations: 100%|██████████| 19/19 [00:03<00:00,  4.83it/s]
Accumulating activations: 100%|██████████| 19/19 [00:03<00:00,  4.83it/s]


In [12]:
projected_vectors_by_layer = defaultdict(list)

for layer_idx, mean_diff_list in layer_mean_diff_vectors.items():
    nuisance_vec = pairwise_nuisance[layer_idx]
    nuisance_unit = nuisance_vec / nuisance_vec.norm()

    for mean_diff in mean_diff_list:
        residual = mean_diff.clone()
        proj_coef = (residual @ nuisance_unit) / (nuisance_unit.norm() ** 2)
        residual = residual - proj_coef * nuisance_unit
        residual = residual / residual.norm()
        projected_vectors_by_layer[layer_idx].append(residual)

total_projected = sum(len(v) for v in projected_vectors_by_layer.values())
total_original  = sum(len(v) for v in layer_mean_diff_vectors.values())

print(f"Projected {total_projected} vectors out of {total_original} mean-diff vectors")

Projected 320 vectors out of 320 mean-diff vectors


In [None]:
with open("steering_vector_final", "wb") as f:
    pickle.dump(projected_vectors_by_layer, f)

print("Created vector")
print(":)")

Created vector
