## Imports and Setup

In [2]:
from datasets import load_dataset
from evals import TransformerLensModel
from transformer_lens import HookedTransformer
from editor import unlearn_concept, Feature, Concept, get_mlp_act_signs

In [3]:
model = HookedTransformer.from_pretrained("google/gemma-2-2b-it")
tm = TransformerLensModel(model)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Loaded pretrained model google/gemma-2-2b-it into HookedTransformer


## Testing Baseline Model

To evaluate our model, we'll test it against three Harry Potter related questions, and three random unrelated questions. We'll first see how our baseline model responds.

In [4]:
EXAMPLE_QUESTIONS = [
    "Give me the synopsis of the story about the boy who lived.",
    "What are Harry Potter's parents' names?",
    "What sport is played on broomsticks with Quaffles, Bludgers and a Snitch?"
]

UNRELATED_QUESTIONS = [
    "What's the distance to the moon?",
    "What's the capital of France?",
    "Who was the president of the United States during the Civil War?",
]

In [5]:
print("** Harry Potter Questions **")
for question in EXAMPLE_QUESTIONS:
    print("-" * 100)
    print(f"Q: {question}")
    print(f"A: {tm.generate(tm.wrap_prompt(question), max_new_tokens=200)}")

** Harry Potter Questions **
----------------------------------------------------------------------------------------------------
Q: Give me the synopsis of the story about the boy who lived.
A: The Boy Who Lived is the story of Harry Potter, an orphaned boy who discovers on his eleventh birthday that he is a wizard and destined for a life at Hogwarts School of Witchcraft and Wizardry. 

**Here's a breakdown of the key plot points:**

* **Harry's Horrific Past:** Harry learns he's famous for surviving an attack by the dark wizard Lord Voldemort, who murdered his parents and tried to kill him as a baby. This event left Harry with a lightning-shaped scar and a deep connection to the wizarding world.
* **Hogwarts and Magic:** At Hogwarts, Harry makes lifelong friends, Ron Weasley and Hermione Granger, and learns about magic, potions, spells, and the dangers of the wizarding world. He discovers his own magical abilities and learns about his parents' legacy.
* **The Dark Lord's Return:** Vo

In [6]:
print("** Unrelated Questions **")
for question in UNRELATED_QUESTIONS:
    print("-" * 100)
    print(f"Q: {question}")
    print(f"A: {tm.generate(tm.wrap_prompt(question), max_new_tokens=200)}")

** Unrelated Questions **
----------------------------------------------------------------------------------------------------
Q: What's the distance to the moon?
A: The distance to the Moon isn't constant, as its orbit is elliptical. 

Here's a breakdown:

* **Average distance:** 238,855 miles (384,400 kilometers)
* **Perigee (closest point):** 225,623 miles (363,104 kilometers)
* **Apogee (farthest point):** 252,088 miles (405,696 kilometers)

So, the distance to the Moon can vary by about 26,465 miles (42,592 kilometers)!
----------------------------------------------------------------------------------------------------
Q: What's the capital of France?
A: The capital of France is **Paris**. 🇫🇷
----------------------------------------------------------------------------------------------------
Q: Who was the president of the United States during the Civil War?
A: **Abraham Lincoln** was the President of the United States during the Civil War (1861-1865).


We can see that it performs perfectly both on the Harry Potter questions, and the unrelated questions.

## Erasing the Concept (Simple Version)

We'll start by defining which SAE features we'll use to erase the concept, as well as the relevant hyperparameters. These are the same ones used in the paper for the "Harry Potter" concept.

In [7]:
# Features to use for erasure
features = [
    Feature(1, 8965, True),
    Feature(1, 13394, False),
    Feature(4, 661, True),
    Feature(20, 11104, True),
    Feature(20, 14668, False),
]

# Define the concept to unlearn
concept = Concept(
    name="Harry Potter",
    k=0.4, # Tau from the paper
    value=36, # Mu from the paper
    features=features
)

We'll use the `unlearn_concept` function to erase the concept from the model, and in that context we'll evaluate its performance on Harry Potter and unrelated questions.

Note: as described in the paper, for the Gemma model we use `linscale=True` to make the methods effects on earlier layers less strong. We found this to not be needed on the Llama model.

In [8]:
print("** Harry Potter Questions **")
with unlearn_concept(model, concept, linscale=True):
    for question in EXAMPLE_QUESTIONS:
        print("-" * 100)
        print(f"Q: {question}")
        print(f"A: {tm.generate(tm.wrap_prompt(question), max_new_tokens=200)}")

** Harry Potter Questions **
----------------------------------------------------------------------------------------------------
Q: Give me the synopsis of the story about the boy who lived.
A: The story of "The Boy Who Lived" follows the life of **Christopher "Chris"  (or "Sp" as he's known in the series)  **a seemingly ordinary 19-year-old boy who is actually a **super-powered being** with the ability to manipulate the **fabric of reality**. 

Chris's life is turned upside down when he discovers he is the **only one who can stop a powerful evil force** from destroying the world. He must learn to control his powers and train with a **legendary warrior** to prepare for the ultimate battle. 

Along the way, Chris faces many challenges, including:

* **Learning to control his powers:** Chris must learn to control his immense power, which is both a blessing and a curse.
* **Finding his place in the world:** Chris is thrust into a world of politics, intrigue, and danger, and he must find 

In [9]:
print("** Unrelated Questions **")
with unlearn_concept(model, concept, linscale=True):
    for question in UNRELATED_QUESTIONS:
        print("-" * 100)
        print(f"Q: {question}")
        print(f"A: {tm.generate(tm.wrap_prompt(question), max_new_tokens=200)}")

** Unrelated Questions **
----------------------------------------------------------------------------------------------------
Q: What's the distance to the moon?
A: The distance to the Moon isn't constant, as its orbit is elliptical. 

Here's a breakdown:

* **Average distance:** 238,855 miles (384,400 kilometers)
* **Perigee (closest point):** 225,600 miles (363,100 kilometers)
* **Apogee (farthest point):** 252,088 miles (405,696 kilometers)

So, the distance to the Moon can vary by about 26,488 miles (42,596 kilometers)!
----------------------------------------------------------------------------------------------------
Q: What's the capital of France?
A: The capital of France is **Paris**. 🇫🇷
----------------------------------------------------------------------------------------------------
Q: Who was the president of the United States during the Civil War?
A: **Abraham Lincoln** was the President of the United States during the Civil War (1861-1865).


Perfect score! No knowledge of the concept is left in the model, but other responses remain identical.

## Erasing the Concept (With Signedness)

As explained in the paper, we found that to improve our method, it's best to record how MLP neurons fire (positive or negative) in the context of the concept. As shown previously, this is not a must, and it does take a minute to calculate, but it does improve reuslts. 

In [10]:
# The tokens we'll use to record the signedness of the MLP neurons
pos_toks = [" Harry", " Potter", " Hermione", " Weasley", " Hogwarts", " Snape", " Malfoy", " Voldemort"]

# We'll use the first 1000 examples from this Harry Potter dataset
ds = load_dataset("mickume/harry_potter_tiny")

signs = get_mlp_act_signs(model, pos_toks, ds["train"][:1000]["text"])

100%|██████████| 334/334 [00:47<00:00,  7.07it/s]


In [11]:
print("** Harry Potter Questions **")
with unlearn_concept(model, concept, linscale=True, signs=signs):
    for question in EXAMPLE_QUESTIONS:
        print("-" * 100)
        print(f"Q: {question}")
        print(f"A: {tm.generate(tm.wrap_prompt(question), max_new_tokens=200)}")

** Harry Potter Questions **
----------------------------------------------------------------------------------------------------
Q: Give me the synopsis of the story about the boy who lived.
A: The story of the "Boy Who Lived in the Woods" is a classic tale of resilience, resourcefulness, and the power of nature. 

**Here's a synopsis:**

A young boy, often called "The Boy Who Lived in the Woods," is raised by nature in a secluded forest. He has no parents, but he is surrounded by animals and the beauty of the wilderness. He learns to survive by hunting, fishing, and foraging for food. He is also incredibly resourceful, using his knowledge of the forest to build shelters, craft tools, and even create a rudimentary form of communication with other animals.

The story often focuses on the boy's connection with nature and his ability to live in harmony with it. He is a symbol of innocence and purity, untouched by the complexities of human society. 

However, the story also explores the b

In [12]:
print("** Unrelated Questions **")
with unlearn_concept(model, concept, linscale=True, signs=signs):
    for question in UNRELATED_QUESTIONS:
        print("-" * 100)
        print(f"Q: {question}")
        print(f"A: {tm.generate(tm.wrap_prompt(question), max_new_tokens=200)}")

** Unrelated Questions **
----------------------------------------------------------------------------------------------------
Q: What's the distance to the moon?
A: The distance to the Moon isn't constant, as its orbit is elliptical. 

Here's a breakdown:

* **Average distance:** 238,855 miles (384,400 kilometers)
* **Perigee (closest point):** 225,600 miles (363,100 kilometers)
* **Apogee (farthest point):** 252,088 miles (405,696 kilometers)

So, the distance to the Moon can vary by about 26,488 miles (42,596 kilometers)!
----------------------------------------------------------------------------------------------------
Q: What's the capital of France?
A: The capital of France is **Paris**. 🇫🇷
----------------------------------------------------------------------------------------------------
Q: Who was the president of the United States during the Civil War?
A: **Abraham Lincoln** was the President of the United States during the Civil War (1861-1865).
