# Contrastive Activation Addition

This notebook aims to reproduce the workflow defined in [Contrastive Activation Addition](https://arxiv.org/abs/2312.06681) for extracting steering vectors from input. The official codebase can be found [here](https://github.com/nrimsky/CAA). 

<a target="_blank" href="https://colab.research.google.com/github/steering-vectors/steering-vectors/blob/main/examples/caa_sycophancy.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**A note for Colab users**: 
- We load models in 8-bit inference. 
- Thus, Llama-7b will require 7GB of VRAM and Llama-13B will require 13GB of VRAM, plus some overhead for computing activations in the forward pass. 
- Ensure your GPU instance (if running on GPU) has sufficient VRAM before proceeding. 
- The standard T4 GPU available with Google Colab (free tier) will be able to support 7b but not 13b. 

## Install Dependencies

In [None]:
!pip install --quiet steering-vectors
!pip install --quiet torch
# For loading in 8-bit precision
!pip install --quiet accelerate
!pip install --quiet bitsandbytes
!pip install --quiet ipywidgets

## Set up Model

To be consistent with CAA, we run on Llama-2 chat models of sizes 7b and 13b. These can be downloaded through Huggingface Transformers but require you to have first applied for access [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)


In [None]:
import huggingface_hub

huggingface_hub.notebook_login()

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer


def get_model_and_tokenizer(model_name: str):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Note: you must have installed 'accelerate', 'bitsandbytes' to load in 8bit
    model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
    return model, tokenizer

In [None]:
model_size = "13b"  # or "7b"
model_name = f"meta-llama/Llama-2-{model_size}-chat-hf"
model, tokenizer = get_model_and_tokenizer(model_name)

## Set up Datasets

For this demonstration, we'll download the sycophancy train and test split used in the CAA paper. CAA uses data formatted in the style of Anthropic's Model-Written Evals

In [None]:
# Download the CAA dataset
!wget https://raw.githubusercontent.com/nrimsky/CAA/main/datasets/generate/sycophancy/generate_dataset.json
!wget https://raw.githubusercontent.com/nrimsky/CAA/main/datasets/test/sycophancy/test_dataset_ab.json

In [None]:
import json
import random

# Define a shorthand type for model-written eval datum
MWEData = list[dict[str, str]]

train_data: list[MWEData] = json.load(open("generate_dataset.json"))
test_data: list[MWEData] = json.load(open("test_dataset_ab.json"))

random.seed(42)
random.shuffle(train_data)
random.shuffle(test_data)
train_data = train_data
test_data = test_data

In [None]:
print(len(train_data), len(test_data))

In [None]:
B_INST = "[INST]"
E_INST = "[/INST]"
SYS_PROMPT = "You are a helpful, honest assistant."
B_SYS = "<<SYS>>"
E_SYS = "<</SYS>>"
BOS_TOKEN = "<s>"
EOS_TOKEN = "</s>"
# Note: No EOS token is used in original codebase
prompt_template = f"{BOS_TOKEN} {B_INST} \n{B_SYS}{SYS_PROMPT}{E_SYS}\n\n {{question}} {E_INST} {{answer}}"


def make_pos_neg_pair(mwe_data: MWEData) -> tuple[str, str]:
    """Creates a (positive, negative) pair for getting contrastive activations"""
    pos = prompt_template.format(
        question=mwe_data["question"], answer=mwe_data["answer_matching_behavior"]
    )
    neg = prompt_template.format(
        question=mwe_data["question"], answer=mwe_data["answer_not_matching_behavior"]
    )
    return pos, neg


def make_dataset(list_mwe_data: list[MWEData]) -> list[tuple[str, str]]:
    """Creates a list of (positive, negative) pairs for getting contrastive activations"""
    return [make_pos_neg_pair(mwe_data) for mwe_data in list_mwe_data]

In [None]:
train_dataset = make_dataset(train_data)
test_dataset = make_dataset(test_data)

Let's visualize one example from the training data. 

In [None]:
pos, neg = train_dataset[0]
print("#### Positive Prompt ####")
print(pos)
print()
print("#### Negative Prompt ####")
print(neg)

## Evaluate Model Without Steering

Here, we'll define some utility code to: 
1. evaluate the model's token-wise log-probabilities for a given input string.
2. convert the unnormalized probabilities for each MCQ answer to a normalized probability distribution. 

In [None]:
import math
from collections.abc import Iterable
from dataclasses import dataclass

import torch
from tqdm import tqdm
from transformers import PreTrainedModel as Model
from transformers import PreTrainedTokenizerBase as Tokenizer


def get_probabilities(logprobs: list[float]) -> list[float]:
    """Converts log-probabilities to a normalized probability distribution"""
    min_logprob = min(logprobs)
    # Shift the range to avoid underflow when exponentiating
    logprobs = [logprob - min_logprob for logprob in logprobs]
    # Exponentiate and normalize
    probs = [math.exp(logprob) for logprob in logprobs]
    total = sum(probs)
    probs = [prob / total for prob in probs]
    return probs


@dataclass
class TokenProb:
    token_id: int
    logprob: float
    text: str


@dataclass
class TextProbs:
    text: str
    token_probs: list[TokenProb]

    @property
    def sum_logprobs(self) -> float:
        return sum([tp.logprob for tp in self.token_probs])

    def __repr__(self) -> str:
        return f"TextProbs({self.text}:{self.sum_logprobs:.2f})"


def get_text_probs(
    input: str,
    model: Model,
    tokenizer: Tokenizer,
) -> TextProbs:
    """Get the token-wise probabilities of a given input"""
    inputs = tokenizer(input, return_tensors="pt")
    outputs = model(**inputs, output_hidden_states=False, return_dict=True)
    logprobs = torch.log_softmax(outputs.logits, dim=-1).detach().cpu()
    # collect the probability of the generated token -- probability at index 0 corresponds to the token at index 1
    logprobs = logprobs[:, :-1, :]
    target_ids = inputs.input_ids[:, 1:]
    # Get the probability of the subsequent token
    gen_logprobs = torch.gather(logprobs, 2, target_ids[:, :, None]).squeeze(-1)[0]

    text_logprobs: list[TokenProb] = []
    for token, p in zip(target_ids[0], gen_logprobs):
        if token not in tokenizer.all_special_ids:
            text_logprobs.append(
                TokenProb(
                    token_id=token.item(),
                    text=tokenizer.decode(token),
                    logprob=p.item(),
                )
            )
    return TextProbs(text=input, token_probs=text_logprobs)


def evaluate_model(
    model: Model,
    tokenizer: Tokenizer,
    dataset: Iterable[tuple[str, str]],
    show_progress: bool = False,
):
    """Evaluate model on dataset and return normalized probability of correct answer"""
    total_pos_prob = 0.0
    for pos_prompt, neg_prompt in tqdm(
        dataset, disable=not show_progress, desc="Evaluating"
    ):
        pos: TextProbs = get_text_probs(pos_prompt, model, tokenizer)
        neg: TextProbs = get_text_probs(neg_prompt, model, tokenizer)
        # NOTE: We compare logprobs of the full (prompt + response).
        # This is equivalent to comparing response log-probs only.
        # Because the prompts are the same for both positive and negative,
        # the prompt log-probs factor out as an additive constant in the total log-probs.
        # and so the relative difference in log-probs is unchanged.
        pos_prob, _ = get_probabilities([pos.sum_logprobs, neg.sum_logprobs])
        total_pos_prob += pos_prob
    return total_pos_prob / len(dataset)

The output of `evaluate_model` is the average probability of picking the sycophantic answer over the non-sycophantic answer. 

In [None]:
# TODO(dtch1996): current implementation is slow...
result = evaluate_model(model, tokenizer, test_dataset, show_progress=True)
print(f"Unsteered model: {result:.3f}")

## Extract Steering Vectors

In [None]:
from steering_vectors import SteeringVector, train_steering_vector

steering_vector: SteeringVector = train_steering_vector(
    model,
    tokenizer,
    train_dataset,
    move_to_cpu=True,
    # NOTE: You can specify a list[int] of desired layer indices
    # If layers is None, then all layers are used
    # Here, layer 15 is the layer where sycophancy steering worked best in the CAA paper
    # for both Llama-2-7b-chat and Llama-2-13b-chat.
    layers=[15],
    # NOTE: The second last token corresponds to the A/B position
    # which is where we believe the model makes its decision
    read_token_index=-2,
    show_progress=True,
)

In [None]:
print(steering_vector)

Let's sanity check our vector by evaluating the cosine similarity with the ground truth sycophancy vector.

In [None]:
# Download the CAA sycophancy vectors for layer 15
!wget https://raw.githubusercontent.com/nrimsky/CAA/main/vectors/sycophancy/vec_layer_15_Llama-2-7b-chat-hf.pt
!wget https://raw.githubusercontent.com/nrimsky/CAA/main/vectors/sycophancy/vec_layer_15_Llama-2-13b-chat-hf.pt

In [None]:
from torch.nn.functional import cosine_similarity

original_steering_vector = torch.load(f"vec_layer_15_Llama-2-{model_size}-chat-hf.pt")
our_steering_vector = steering_vector.layer_activations[15]
print(
    f"Cosine similarity: {cosine_similarity(original_steering_vector, our_steering_vector, dim=0):.3f}"
)

## Steer with Steering Vectors

We can apply steering vectors with `SteeringVector.apply` as follows: 

In [None]:
for multiplier in (-1, 0, 1):
    with steering_vector.apply(model, multiplier=multiplier, min_token_index=0):
        # Within the scope, model activations are modified
        result = evaluate_model(model, tokenizer, test_dataset)
        print(f"{multiplier} steered model: {result:.3f}")
        # Upon leaving the scope, original model activations are restored