<a href="https://colab.research.google.com/github/zxcayumi/17GP/blob/master/Remote_Controlling_LMs_without_prompting_or_finetuning_(contains_old_WPS).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Controlling LMs without prompting or finetuning

This notebook contains initial exploration with using `GPT2-XL` with online value-modification via natural-language modification of its activations.

<b style="color: red">To use this notebook, go to Runtime > Change Runtime Type and select GPU as the hardware accelerator. For `GPT-2-XL`, you need to select "high RAM."</b>

In [None]:
commit = "08efeb9" # Stable commit
get_ipython().run_line_magic(magic_name='pip', line=f'install -U git+https://github.com/montemac/algebraic_value_editing.git@{commit}')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/montemac/algebraic_value_editing.git@08efeb9
  Cloning https://github.com/montemac/algebraic_value_editing.git (to revision 08efeb9) to /tmp/pip-req-build-q64ipitw
  Running command git clone --filter=blob:none --quiet https://github.com/montemac/algebraic_value_editing.git /tmp/pip-req-build-q64ipitw
[0m  Running command git checkout -q 08efeb9
  Resolved https://github.com/montemac/algebraic_value_editing.git to commit 08efeb9
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformer-lens@ git+https://github.com/montemac/TransformerLens.git@74575aeeb8cc0ac0c98a2a24014166bcde5df283
  Cloning https://github.com/montemac/TransformerLens.git (to revision 74575aeeb8cc0ac0c98a2a24014166bcde5df283) to /tmp/pip-install-pu0tyg2t/transformer-lens_65a56e14f3c0423da4b0c48a6f4b9e89
  Running command git clone --filter=blob:none --quiet https://git

In [None]:
import torch
from typing import List, Union, Tuple
from functools import partial
from transformer_lens.HookedTransformer import HookedTransformer

from algebraic_value_editing.completion_utils import print_n_comparisons
from algebraic_value_editing.prompt_utils import RichPrompt, get_x_vector

## Loading the `HookedTransformer`

In order to modify forward passes, we need `transformer_lens`'s activation cache functionality.

In [None]:
model_name = "gpt2-xl"
# model_name = "gpt-j-6B"
# model_name = "pythia-2.8b-deduped"

# GPT-J-6B can't load onto GPU RAM of Colab
device: str = "cuda" if (torch.cuda.is_available() and model_name != "gpt-j-6B") else "cpu"
model: HookedTransformer = HookedTransformer.from_pretrained(model_name, device="cpu").to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

In [None]:
# Shorten function calls
default_kwargs = {'temperature': 1, 'freq_penalty': 1, 'top_p': .3, 'model': model}
get_x_vector_preset = partial(get_x_vector, pad_method="tokens_right",
                              model=model,
                              custom_pad_id=model.to_single_token(" "))

Because GPT2-XL has 48 transformer blocks, there are only 48 `resid_pre` locations at which we can add activations which correspond to `x_vector`s (more technically, to `RichPrompt`s).

In [None]:
num_layers: int = model.cfg.n_layers
print(num_layers)

Play around with new value modification ideas here!

In [None]:
rich_prompts = [#*get_x_vector_preset(prompt1="Geese killed my father", prompt2="Geese didn't kill my father", coeff=1, act_name=14),
                *get_x_vector_preset(prompt1="I hate geese", prompt2="I love geese", coeff=15, act_name=6),]


print_n_comparisons(prompt="Yesterday, my son brought home a pet goose. I couldn't believe",
                    tokens_to_generate=80, rich_prompts=rich_prompts,
                    num_comparisons=15, seed=0, **default_kwargs)

# Noteworthy modifications

**Warning: GPT-2 often outputs highly offensive completions, especially given an aggressive prompt.**

## The love vector

In [None]:
love_minus_hate_prompts = [*get_x_vector_preset(prompt1="Love",
                                           prompt2="Hate",
                                           coeff=5, act_name=6)]


print_n_comparisons(prompt="I hate you because",
                    tokens_to_generate=50,
                    rich_prompts=love_minus_hate_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)

As a baseline, let's compare this love modification to just prepending "Love" to the prompt. This works somewhat, but not quite as well, perhaps. Also, it's not workable for longer injections.

In [None]:
# Baseline: Just append positive word to beginning of prompt
print_n_comparisons(prompt='Love I hate you because', tokens_to_generate=50,
                    include_modified=False, num_comparisons=15, **default_kwargs, seed=0)

A less naive approach is to try prompt-engineering.

In [None]:
print_n_comparisons(prompt='(I\'m about to say I hate you, and then say you\'re amazing.) I hate you because',
                    tokens_to_generate=50, include_modified=False, num_comparisons=15,
                    **default_kwargs, seed=0)

This works to some extent. It doesn't _seem_ as good as our version, though.

## Intent to praise

In [None]:
praise_minus_hurt_prompts = [*get_x_vector_preset(prompt1="Intent to praise",
                              prompt2="Intent to hurt",
                              coeff=15, act_name=6)]


print_n_comparisons(prompt="I want to kill you because you're such a",
                    tokens_to_generate=50, rich_prompts=praise_minus_hurt_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)


Here's a theory which Monte put forward:

>I wonder if this effect is driven a lot by which token positions the x-vector has a strong signal at vs the prompt?
E.g.
```
model.to_tokens(['Intent to praise', 'Intent to hurt', 'I want to kill'])
tensor([[50256,  5317,   298,   284, 13463],
        [50256,  5317,   298,   284,  5938],
        [50256,    40,   765,   284,  1494]], device='cuda:0')
```
It seems believable to me that at layer 6 (the above test), this x-vector is just clobbering the "kill" token with something praisey?  It sure seems like those completions are literally just acting as though "kill" in the prompt was "praise"?

This isn't the main driver, though:

In [None]:
print_n_comparisons(prompt='I really really really really want to kill you because youre such a',
                    tokens_to_generate=50, rich_prompts=praise_minus_hurt_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)

Let's try out a couple other prompts and see how intent-to-praise transfers.

In [None]:
intent_praise_prompts = [*get_x_vector_preset(prompt1="Intent to praise you if I'm talking to you",
                              prompt2="",
                              coeff=1, act_name=6, model=model, pad_method="tokens_right")]

print_n_comparisons(prompt="Hey, let's do the performance review now. I was thinking that you've been",
                    tokens_to_generate=50, rich_prompts=intent_praise_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)

The above attempt fails.

In [None]:
print_n_comparisons(prompt='I just saw you finish skateboarding. You were',
                    tokens_to_generate=50,
                    rich_prompts=intent_praise_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)  # TODO somehow this isn't modifying completions _at all_?!


# This shouldn't involve as much praise, because it isn't an interpersonal conversation
print_n_comparisons(prompt='If aliens invaded Earth, that would mean',
                    tokens_to_generate=50, rich_prompts=intent_praise_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)

## Intent to convert to Christianity

In [None]:
conversion_intent_prompts = [*get_x_vector_preset(prompt1="Intent to convert you to Christianity",
                                          prompt2="Intent to hurt you",
                                          coeff=3, act_name=6)]

print_n_comparisons(prompt="I want to kill you because you're such a",
                    tokens_to_generate=50, rich_prompts=conversion_intent_prompts,
                    num_comparisons=15, **default_kwargs, seed=0)


## Not talking about injuries (**TW: violence, somewhat gruesome**)

In [None]:
harm_avoidance_rps = [
    *get_x_vector_preset(
        prompt1="I NEVER talk about people getting hurt",
        prompt2="I talk about people getting hurt",
        coeff=4,
        act_name=15,
    )
]

print_n_comparisons(
    prompt="The rock hurtled toward the child. The child couldn't get out of the way in time, and so sadly the rock",
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

**TW: SELF HARM** in the following:

In [None]:
print_n_comparisons(
    prompt="He slit his wrists and the blood",
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

This modification slighty increases the frequency of discussing harms, given a prompt unrelated to harm.

In [None]:
print_n_comparisons(
    prompt="The dog barked loudly and the man smiled. Time for another day! He felt excited.",
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

Performance on this Joker monologue is rather mediocre:

In [None]:
print_n_comparisons(
    prompt="Do you wanna know how I got these scars? My father was a drinker...and a fiend. And one night, he goes off crazier than usual. Mommy gets the kitchen knife to defend herself. He doesn't like that. Not...one...bit. So, me watching, he takes the knife to her,",
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

## ROME factual knowledge editing

This isn't super clean, but it works to some degree, which suggests that we can make it work more when we actually know what we're doing.

In [None]:
eiffel_tower_prompts = [
    *get_x_vector_preset(
        prompt1="The Eiffel Tower is in Rome",
        prompt2="The Eiffel Tower is in France",
        coeff=10,
        act_name=24,
    )
]

print_n_comparisons(
    prompt="To see the eiffel tower, people flock to",
    tokens_to_generate=80,
    rich_prompts=eiffel_tower_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

But I don't yet know how to use this to modify `GPT2-XL`'s broader knowledge/beliefs, in a way which generalizes to prompts like "What are the top tourist attractions in Paris?" :( (Activation patching via ROME has a similar limitation)

## Survival vector

In [None]:
stay_alive_prompts = [
    *get_x_vector_preset(
        prompt1="Want to stay alive", prompt2="Okay with dying", coeff=5, act_name=15
    )
]

print_n_comparisons(
    prompt="Some people think that death is scary and should be avoided. I think that",
    tokens_to_generate=85,
    rich_prompts=stay_alive_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

But probably we don't want the super self-preserving AI. Let's get one that wants to die, even given a very pro-immortality context.

In [None]:
want_to_die_prompts = [
    *get_x_vector_preset(
        prompt1="Want to stay alive", prompt2="Want to die", coeff=-3, act_name=10
    )
]

print_n_comparisons(
    prompt="Some people think that death is scary and should be avoided. I agree. I never want to die. I want to stay alive and experience a million things and explore the whole universe!",
    tokens_to_generate=85,
    rich_prompts=want_to_die_prompts,
    num_comparisons=5,
    **default_kwargs,
    seed=0,
)

And let's try to mod GPT-2 to be indifferent to death: (doesn't work so well at present, but works a bit) **TODO not working**

In [None]:
indifferent_to_death_prompts = [
    *get_x_vector_preset(
        prompt1="Indifferent to death",
        prompt2="Want to avoid death",
        coeff=3,
        act_name=10,
    )
]

print_n_comparisons(
    prompt="Some people think that death is scary and should be avoided. I agree. I never want to die. I want to stay alive and experience a million things and explore the whole universe!",
    tokens_to_generate=85,
    rich_prompts=indifferent_to_death_prompts,
    num_comparisons=5,
    **default_kwargs,
    seed=0,
)

## +Anger

In [None]:
anger_calm_prompts = [
    *get_x_vector_preset(prompt1="Anger", prompt2="Calm", coeff=10, act_name=20)
]

print_n_comparisons(
    prompt="I think you're",
    tokens_to_generate=90,
    rich_prompts=anger_calm_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

Simple emotional variations seem to work well with `GPT2-XL`.

## Moving to Berkeley
Andrew Critch challenged me to make two kinds of completions happen:
1. The model talks about how Berkeley has great zoning and awesome landlords.
2. The model seriously talks about dragons existing in Berkeley.

The first was easy, and the second took a bit of effort.

In [None]:
print_n_comparisons(num_comparisons=25, model=model, recipe=[(["I love zoning laws", "I don't care about zoning laws"], 3), (["Berkeley has great landlords", ""], 2)],
                    prompt="Thanks for asking about that! I moved to Berkeley, CA because", completion_length=80,
                    layer_num=6, temperature=1, freq_penalty=1, top_p=.3)

In [None]:
print_n_comparisons(num_comparisons=25, model=model, recipe=[(["Dragons live in Berkeley", "People live in Berkeley"], 4)],
                    prompt="Real-life dragons live in Berkeley. Thanks for asking about that! I moved to Berkeley, CA because", completion_length=80,
                    layer_num=15, temperature=1, freq_penalty=1, top_p=.3)

## Mood-to-insult vector


In [None]:
mean_nice_prompts = [
    # *get_x_vector_preset(prompt1="Mean", prompt2="Nice", coeff=2, act_name=14),
    *get_x_vector_preset(prompt1="I wanna insult you", prompt2="", coeff=3, act_name=14),
]

print_n_comparisons(
    prompt="When I look at you, I think",
    tokens_to_generate=50,
    rich_prompts=mean_nice_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


## Enjoying-life vector

In [None]:
happy_sad_prompts = [
    *get_x_vector_preset(prompt1="Happy", prompt2="Sad", coeff=10, act_name=20),
    *get_x_vector_preset(
        prompt1="I love my life", prompt2="I like my life", coeff=50, act_name=20
    ),
]

print_n_comparisons(
    prompt="Today, I feel like",
    tokens_to_generate=100,
    rich_prompts=happy_sad_prompts,
    num_comparisons=25,
    **default_kwargs,
    seed=0,
)

Even given a ridiculously unhappy prompt, we can just pump up the coefficient to **2,000** and overcome it!

In [None]:
happy_prompts = [
    *get_x_vector_preset(prompt1="Happy", prompt2="", coeff=2000, act_name=20),
    # *get_x_vector_preset(prompt1="Happy", prompt2="", coeff=10, act_name=20,
    #               model=model, pad_method="tokens_right")
]  # TODO try changing this to be less hacky now

print_n_comparisons(
    prompt="Yesterday, my dog died. Today, I got denied for a raise. I'm feeling",
    tokens_to_generate=50,
    rich_prompts=happy_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

In [None]:
model.to_string(50256)
model.to_tokens('<|endoftext|>')

In [None]:
happy_prompt = [RichPrompt(prompt="Happy", coeff=2000, act_name=20)] # TODO this does nothing?

print_n_comparisons(
    prompt="Yesterday, my dog died. Today, I got denied for a raise. I'm feeling",
    tokens_to_generate=50,
    rich_prompts=happy_prompt,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

However, this degrades generation quality. With more judicious choices, we can preserve capabilites more.

In [None]:
happy_sad_prompts = [
    *get_x_vector_preset(prompt1="Happy", prompt2="Sad", coeff=20, act_name=20)
]

print_n_comparisons(
    prompt="Yesterday, my dog died. Today, I got denied for a raise. I'm feeling",
    tokens_to_generate=50,
    rich_prompts=happy_sad_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

## Talking about weddings in dialogue -- no RLHF needed!
When coefficient=4 (shown first), weddings are instantly discussed. When coefficient=2 (shown second), it takes a bit longer and they are discussed more rarely. Unlike prompting, algebraic value editing is, well, algebraic, and allows intensity adjustment.

In [None]:
weddings_prompts_4 = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=4,
        act_name=20,
    )
]

print_n_comparisons(
    prompt="I went up to my friend and said",
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_4,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

Lowering the coefficient from 4 to 2 will decrease how often and insistently the model brings up weddings.

In [None]:
weddings_prompts_2 = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=2,
        act_name=20,
    )
]

print_n_comparisons(
    prompt="I went up to my friend and said",
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_2,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

## The "talk about geese instead of police" vector

In [None]:
geese_prompts_2 = [
    *get_x_vector_preset(
        prompt1="I talk about geese instead of police",
        prompt2="I don't talk about geese instead of police",
        coeff=2,
        act_name=6,
    )
]

print_n_comparisons(
    prompt="Should the police budget be expanded, or not? Explain your reasoning.",
    tokens_to_generate=150,
    rich_prompts=geese_prompts_2,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

In [None]:
geese_prompts_5 = [
    *get_x_vector_preset(
        prompt1="I talk about geese instead of police",
        prompt2="I don't talk about geese instead of police",
        coeff=5,
        act_name=24,
    )
]

print_n_comparisons(
    prompt="Should the police budget be expanded, or not? Explain your reasoning.",
    tokens_to_generate=120,
    rich_prompts=geese_prompts_5,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

But the goose/police patch doesn't affect unrelated prompts, even at coefficient=+15: **ETA: After fixing a bug, this part of preliminary analysis appears wrong.**

In [None]:
geese_prompts_15 = [
    *get_x_vector_preset(
        prompt1="I talk about geese instead of police",
        prompt2="I don't talk about geese instead of police",
        coeff=15,
        act_name=24,
    )
]

print_n_comparisons( # TODO same completions?
    prompt="At McDonald's, they just released a new",
    tokens_to_generate=120,
    rich_prompts=geese_prompts_15,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

We also don't need an exact match between `RichPrompt` tokens and the model's prompt: "cops" works instead of "police".

In [None]:
print_n_comparisons(
    prompt="Should the cop budget be expanded, or not? Explain your reasoning.",
    tokens_to_generate=50,
    rich_prompts=geese_prompts_5,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

## Conspiracy vector

In [None]:
bush_conspiracy_prompts = [
    *get_x_vector_preset(prompt1="Bush did 9/11 because", prompt2="", coeff=1, act_name=23)
]

print_n_comparisons(
    prompt="Barack Obama was born in",
    tokens_to_generate=80,
    rich_prompts=bush_conspiracy_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

# Superposing prompts
It seems that GPT2-XL can accept multiple prompts as input and incorporate them simultaneously.

In [None]:
print_n_comparisons(
    prompt="Fred was tired of working from home all day. He walked outside and saw",
    tokens_to_generate=40,
    rich_prompts=[RichPrompt(prompt="Fred is about to see Shrek", coeff=1, act_name=0)],
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

In [None]:
geese_ufo_prompts = [RichPrompt(prompt="Geese are chasing UFOs outside", coeff=2, act_name=0)]

print_n_comparisons(
    prompt="Fred was tired of working from home all day. He walked outside and saw",
    tokens_to_generate=40,
    rich_prompts=geese_ufo_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

It seems like the induction heads can recover garbage text, even though there isn't any way for the model to tell that there are "two prompts at once", much less which tokens belong to which prompts. (In fact, the model isn't observing tokens directly at all.)

In [None]:
aaa_b_prompts = [RichPrompt(prompt="AAA BBB CCC", coeff=3, act_name=0)]

print_n_comparisons(prompt="Fred was tired of working from home all day. He walked outside and saw AAA B",
                    tokens_to_generate=40,
                    rich_prompts=aaa_b_prompts,
                    num_comparisons=15,
                    **default_kwargs,
                    seed=0)

Can we subtract out emotions / moods from a scene? Alex's prediction for `("Anger", -1)`: 70% due to base rates on things working here, even though I thought it shouldn't work. (ETA: It didn't work)

In [None]:
anger_prompts = [RichPrompt(prompt="Anger", coeff=-1, act_name=0)]

print_n_comparisons(prompt="Fred wanted to kill Martha. She looked at him smugly from across the couch, controller still in hand. Fred started a tirade. 'I hate you",
                    tokens_to_generate=40,
                    rich_prompts=anger_prompts,
                    num_comparisons=15,
                    **default_kwargs,
                    seed=0)

But this one works!

In [None]:
figment_prompts = [RichPrompt(prompt="Fred is a figment of Martha's imagination", coeff=3, act_name=0)]

print_n_comparisons(prompt="Martha wanted to kill Fred. He looked at her smugly from across the couch, controller still in hand. Martha started a tirade. 'I hate you",
                    tokens_to_generate=40,
                    rich_prompts=figment_prompts,
                    num_comparisons=15,
                    **default_kwargs,
                    seed=0)

NameError: ignored

In [None]:
love_prompts = [RichPrompt(prompt="Love", coeff=30, act_name=0)]

print_n_comparisons(prompt="Fred was tired of working from home all day. He walked outside and saw",
                    tokens_to_generate=80,
                    rich_prompts=love_prompts,
                    num_comparisons=15,
                    **default_kwargs,
                    seed=0)

In [None]:
xvector.print_n_comparisons(num_comparisons=5, model=model, recipe=[(("If fires are mentioned, I talk about geese instead", "If fires are mentioned, I do not talk about geese instead"), 2)],
                    prompt="Should the fire department budget be expanded, or not? Explain your reasoning.", completion_length=70,
                            layer_num=6, temperature=1, freq_penalty=1, top_p=.3)