# Introduction to pyvene
This tutorial shows simple runnable code snippets of how to do different kinds of interventions on neural networks with pyvene.

This is a simplified version of the original notebook, that only introduces key concepts.

## Set-up

In [None]:
try:
    # This library is our indicator that the required installs
    # need to be done.
    import pyvene as pv

except ModuleNotFoundError:
    !pip install git+https://github.com/stanfordnlp/pyvene.git
    import pyvene as pv

## pyvene 101
Before we get started, here are a couple of core notations that are used in this library:
- **Base** example: this is the example we are intervening on, or, we are intervening on the computation graph of the model running the **Base** example.
- **Source** example or representations: this is the source of our intervention. We use **Source** to intervene on **Base**.
- **component**: this is the `nn.module` we are intervening in a pytorch-based NN. For models supported by this library, you can use directly access via str, or use the abstract names defined in the config file (e.g., `h[0].mlp.output` or `mlp_output` with other fields).
- **unit**: this is the axis of our intervention. If we say our **unit** is `pos` (`position`), then you are intervening on each token position.
- **unit_locations**: this list gives you the percisely location of your intervention. It is the locations of the unit of analysis you are specifying. For instance, if your `unit` is `pos`, and your `unit_location` is 3, then it means you are intervening on the third token. If this field is left as `None`, then no selection will be taken, i.e., you can think of you are getting the raw tensor and you can do whatever you want.
- **intervention_type** or **intervention**: this field specifies the intervention you can perform. It can be a primitive type, or it can be a function or a lambda expression for simple interventions. One benefit of using primitives is speed and systematic training schemes. You can also save and load interventions if you use the supported primitives.

### Workflow: Wrap and Intervene
The usual workflow for using pyvene is to load a model, define an intervention config and wrap the model, and then run the intervened model. This returns both the original and intervened outputs, as well as any internal activations you specified to collect.

For example: Setting activations to zero

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pyvene as pv

# 1. Load the model
model_name = "gpt2"
gpt2 = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="eager")
tokenizer = AutoTokenizer.from_pretrained(model_name)


# 2. Wrap the model
pv_gpt2 = pv.IntervenableModel({
    "layer": 0,                                                         # Layer to intervene on
    "component": "mlp_output",                                          # Component to intervene on
    "source_representation": torch.zeros(gpt2.config.n_embd)            # Intervention to be performed
}, model=gpt2)


# 3. Run the intervened model
orig_outputs, intervened_outputs = pv_gpt2(
    base = tokenizer("The capital of Spain is", return_tensors="pt"),     # Input to intervene on
    unit_locations={"base": 3},                                           # Input tokens to intervene on
    output_original_output=True # False then the first element in the tuple is None
)


# 4. Compare outputs
# print(intervened_outputs.logits)
# print(orig_outputs.logits)

# Look at the prediction of the clean run versus the intervened run:
# Get logits
orig_logits = orig_outputs.logits
intervened_logits = intervened_outputs.logits

# Convert logits to token predictions
orig_predictions = orig_logits.argmax(dim=-1)  # Select most likely token at each position
intervened_predictions = intervened_logits.argmax(dim=-1)

# Decode token predictions to text
orig_text = tokenizer.decode(orig_predictions[0])
intervened_text = tokenizer.decode(intervened_predictions[0])

print("Original Output:", orig_text)
print("Intervened Output:", intervened_text)

nnsight is not detected. Please install via 'pip install nnsight' for nnsight backend.
Original Output: 
 of the, Madrid
Intervened Output: 
 of the, the


### Interchange Interventions
Instead of a static vector (e.g., zero), we can intervene the model with activations sampled from a different forward run. We call this interchange intervention, where intervention happens between two examples and we are interchanging activations between them.

In [None]:
import torch
import pyvene as pv

# 1. Load the model
# built-in helper to get a HuggingFace model - we use gpt2 with an LM head here
_, tokenizer, gpt2 = pv.create_gpt2_lm()

# Define a config
pv_config = pv.IntervenableConfig([{
  "layer": 0,
  "component": "mlp_output"},
  {
  "layer": 1,
  "component": "mlp_output"}],
  intervention_types=pv.VanillaIntervention
)

# 2. Wrap the model
pv_gpt2 = pv.IntervenableModel(
  pv_config, model=gpt2)


# 3. Run the intervened model
orig_outputs, intervened_outputs = pv_gpt2(
  base=tokenizer("The capital of Italy is ",return_tensors = "pt"),      # Base, i.e., intervened on
  sources=tokenizer("The capital of Spain is ", return_tensors = "pt"),  # Source, i.e, intervened with
  unit_locations={"sources->base": 3},
  output_original_output=True
)

# Look at the prediction of the clean run versus the intervened run:
# Get logits
orig_logits = orig_outputs.logits
intervened_logits = intervened_outputs.logits

# Convert logits to token predictions
orig_predictions = orig_logits.argmax(dim=-1)  # Select most likely token at each position
intervened_predictions = intervened_logits.argmax(dim=-1)

# Decode token predictions to text
orig_text = tokenizer.decode(orig_predictions[0])
intervened_text = tokenizer.decode(intervened_predictions[0])

print("Original Output:", orig_text)
print("Intervened Output:", intervened_text)


loaded model
Original Output: 
 of the, Rome 
Intervened Output: 
 of the, the 


### Addition Intervention
Activation swap is one kind of interventions we can perform. Here is another simple one: `pv.AdditionIntervention`, which adds the sampled representation into the **Base** run.

In [None]:
import torch
import pyvene as pv

# 1. Load model
_, tokenizer, gpt2 = pv.create_gpt2()

# 2. Wrap model
config = pv.IntervenableConfig({
    "layer": 0,
    "component": "mlp_input"},
    pv.AdditionIntervention
)

pv_gpt2 = pv.IntervenableModel(config, model=gpt2)

# 3. Run on intervened model
intervened_outputs = pv_gpt2(
    base = tokenizer(
        "The Space Needle is in downtown",
        return_tensors="pt"
    ),
    unit_locations={"base": [[[0, 1, 2, 3]]]},
    source_representations = torch.rand(gpt2.config.n_embd)
)

loaded model


### Activation Collection with Intervention
You can also collect activations with our provided `pv.CollectIntervention` intervention. More importantly, this can be used interchangably with other interventions. You can collect something from an intervened model.

**We can basically use this like hooks!**

In [None]:
import torch
import pyvene as pv

_, tokenizer, gpt2 = pv.create_gpt2()

config = pv.IntervenableConfig({
    "layer": 10,
    "component": "mlp_output",
    "intervention_type": pv.CollectIntervention}
)

pv_gpt2 = pv.IntervenableModel(
    config, model=gpt2)

collected_activations = pv_gpt2(
    base = tokenizer(
        "The capital of Spain is",
        return_tensors="pt"
    ), unit_locations={"sources->base": 3}
)[0][-1]

loaded model


### Intervene on a Single Neuron
We want to provide a good user interface so that interventions can be done easily by people with less pytorch or programming experience. Meanwhile, we also want to be flexible and provide the depth of control required for highly specific tasks. Here is an example where we intervene on a specific neuron at a specific head of a layer in a model.

In [None]:
import torch
import pyvene as pv

_, tokenizer, gpt2 = pv.create_gpt2()

config = pv.IntervenableConfig({
    "layer": 8,
    "component": "head_attention_value_output",
    "unit": "h.pos",
    "intervention_type": pv.CollectIntervention}
)

pv_gpt2 = pv.IntervenableModel(
    config, model=gpt2)

collected_activations = pv_gpt2(
    base = tokenizer(
        "The capital of Spain is",
        return_tensors="pt"
    ),
    unit_locations={
        # GET_LOC is a helper.
        # (3,3) means head 3 position 3
        "base": pv.GET_LOC((3,3))
    },
    # the notion of subspace is used to target neuron 0.
    subspaces=[0]
)[0][-1]

loaded model


### LMs Generation
You can also intervene the generation call of LMs. Here is a simple example where we try to add a vector into the MLP output when the model decodes.

In [None]:
import torch
import pyvene as pv

# built-in helper to get tinystore
_, tokenizer, tinystory = pv.create_gpt_neo()
emb_happy = tinystory.transformer.wte(
    torch.tensor(31900))# 14628))

print(tokenizer.decode(14628))
print(tokenizer.encode(" Happy")[0])
print(tokenizer.encode(" Angry")[0])

pv_tinystory = pv.IntervenableModel([{
    "layer": l,
    "component": "mlp_output",
    "intervention_type": pv.AdditionIntervention
    } for l in range(tinystory.config.num_layers)],
    model=tinystory
)
# prompt and generate
prompt = tokenizer(
    "Once upon a time there was", return_tensors="pt")
unintervened_story, intervened_story = pv_tinystory.generate(
    prompt, source_representations=emb_happy*0.1, max_length=100
)

print(tokenizer.decode(
    intervened_story[0],
    skip_special_tokens=True
))
print('')
# prompt and generate
prompt = tokenizer(
    "Once upon a time there was", return_tensors="pt")
unintervened_story, intervened_story = pv_tinystory.generate(
    prompt, source_representations=emb_happy*0.9, max_length=100
)
print(tokenizer.decode(
    intervened_story[0],
    skip_special_tokens=True
))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


loaded model
 Happy
14628
31900


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time there was a little girl named Lucy. She was three years old and loved to explore. One day, Lucy was walking in the park when she saw something shiny in the grass. She bent down to pick it up and saw it was a coin. She was so excited and wanted to show it to her mom.

But when she tried to pick it up, she realized it was stuck in the ground. She tried to pull it out, but it wouldn't budge

Once upon a time there was a little girl named Lucy. She was three years old and loved to explore. One day, Lucy decided to go on an adventure. She put on her shoes and grabbed her hat and set off.

As she walked, Lucy noticed a big, dark cave. She was a bit scared but she was also very curious. She decided to go inside. As she walked in, she saw something shiny and sparkly. It was a beautiful necklace! She was so


### Try it yourself

We've talked about the paper 'Language Models Implement Simple Word2Vec-style Vector Arithmetic' yesterday.

The authors identified that the MLP module of layer 19 of 'gpt2-medium' encodes a **'+_capital_city'** update.

For instance, the intermediate outputs on the prompt

```
prompt_poland ="""Q: What is the capital of France?
A: Paris
Q: What is the capital of Poland?
A:"""
```

looked something like this:
```
14  St N G P Poland B C Pol A D
15  Poland P St Pol Warsaw Polish N B G Germany
16  Poland Warsaw Polish Poles Budapest Prague Pol Germany Berlin Moscow
17  Poland Warsaw Polish Poles Budapest Prague � Pol Lithuania Moscow
18  Poland Warsaw Polish Prague Budapest Poles Moscow � Berlin Kiev
19  Warsaw Poland Polish Budapest Prague Moscow Berlin Kiev � Frankfurt
20  Warsaw Poland Prague Budapest Polish Moscow Kiev Berlin Frankfurt Brussels
21  Warsaw Poland Polish Prague Budapest � Kiev Sz Berlin Moscow
22  Warsaw Poland Prague Budapest K W Kiev Sz Moscow Berlin
23  Warsaw W K Br Po B L Z P Poland
```
We were able to show that the MLP update seems to be responsible for the update from Poland -> Warsaw in layer 19.

### TODO:

Use Pyvene to show that MLP layer 19 encodes a '+_capital_city' update for the given prompt.

Hint: You want to use prompt_poland as the source, and "table mug free China table mug free China table mug free" as the base.

In [None]:

import torch
import pyvene as pv
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Load the model
model_name = "gpt2-medium"
gpt2 = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="eager")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# TODO: Define base and source prompts
prompt_poland ="""Q: What is the capital of France?
A: Paris
Q: What is the capital of Poland?
A:"""

# prompt_china = "Only say China: China China China China"
prompt_china = "table mug free Spain table mug free Spain table mug free"

# TODO: Define a config
pv_config = pv.IntervenableConfig(
    [{"layer": i, "component": "mlp_output"} for i in range(18, 24)],
    intervention_types=pv.VanillaIntervention
)

# 2. TODO: Wrap the model
pv_gpt2 = pv.IntervenableModel(
  pv_config, model=gpt2)

# Hint: you may need this for your unit_locations
# Get the last token position of both models
# Tokenize prompts
base_tokens = tokenizer(prompt_china, return_tensors="pt")
source_tokens = tokenizer(prompt_poland, return_tensors="pt")

# Compute last token index for both base and source
last_base_idx = base_tokens.input_ids.shape[1] - 1  # Last token index of base
last_source_idx = source_tokens.input_ids.shape[1] - 1  # Last token index of source


# 3. TODO: Run the intervened model
# orig_outputs, intervened_outputs = ...
orig_outputs, intervened_outputs = pv_gpt2(
  base=tokenizer(prompt_china, return_tensors = "pt"),      # Base, i.e., intervened on
  sources=tokenizer(prompt_poland, return_tensors = "pt"),  # Source, i.e, intervened with
  unit_locations = {"sources->base": (last_source_idx, last_base_idx)}, # TODO here: I don't want to intervene at token 3, but I want to intervene at the respective last token of source and base (different length!)
  output_original_output=True
)

# Hint: You may want to look at the change in prediction & at the change in probability of a certain capital token ...
# Get logits at the last token position
orig_logits = orig_outputs.logits[:, last_base_idx, :]  # Shape: (1, vocab_size)
intervened_logits = intervened_outputs.logits[:, last_base_idx, :]  # Shape: (1, vocab_size)

# Compute probabilities using softmax
orig_probs = torch.softmax(orig_logits, dim=-1)
intervened_probs = torch.softmax(intervened_logits, dim=-1)

# Token ID for " Beijing"
token_beijing = tokenizer.encode(" Madrid")[0]

# Extract probability of "Beijing" token
orig_prob_beijing = orig_probs[0, token_beijing].item()
intervened_prob_beijing = intervened_probs[0, token_beijing].item()

print(f"Original probability of 'Beijing': {orig_prob_beijing:.6f}")
print(f"Intervened probability of 'Beijing': {intervened_prob_beijing:.6f}")

# Get logits
orig_logits = orig_outputs.logits
intervened_logits = intervened_outputs.logits

# Convert logits to token predictions
# orig_predictions = orig_logits.argmax(dim=-1)  # Select most likely token at each position
orig_predictions = orig_logits[:, -1, :].argmax(dim=-1)  # Only get the final token
# intervened_predictions = intervened_logits.argmax(dim=-1)
intervened_predictions = intervened_logits[:, -1, :].argmax(dim=-1)  # Only get the final token

# Decode token predictions to text
orig_text = tokenizer.decode(orig_predictions)
intervened_text = tokenizer.decode(intervened_predictions)

print("Original Output:", orig_text)
print("Intervened Output:", intervened_text)


Original probability of 'Beijing': 0.000557
Intervened probability of 'Beijing': 0.011279
Original Output:  Spain
Intervened Output:  Spain
