<a href="https://colab.research.google.com/github/tatsath/Interpretability/blob/main/HallucinationReduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goodfire Cookbook

This cookbook provides some examples of how to use features and steering in Goodfire in sometimes non-traditional ways.

Such as:
- Dynamic prompts
- Removing Knowledge
- Sorting by features
- On-demand RAG


In [None]:
!pip install goodfire --quiet
!pip install datasets --quiet


In [None]:
from google.colab import userdata

# Add your Goodfire API Key to your Colab secrets
GOODFIRE_API_KEY = userdata.get('GOODFIRE_API_KEY')

## Initialize the SDK

In [None]:
import goodfire

client = goodfire.Client(GOODFIRE_API_KEY)

# Instantiate a model variant
#variant = goodfire.Variant("meta-llama/Meta-Llama-3.1-8B-Instruct")
variant = goodfire.Variant("meta-llama/Llama-3.3-70B-Instruct")

## Removing knowledge

Let's say we want a model to not know anything about famous people so that we don't get in trouble if it says bad things about them.

We'll use feature search to find features that are relevant to famous people and then play with what happens.

In [None]:
famous_people_features = client.features.search("celebrities", model=variant, top_k=10)
print(famous_people_features)

FeatureGroup([
   0: "Celebrity relationship gossip and speculation",
   1: "Events or people becoming subjects of widespread public attention and media scrutiny",
   2: "Celebrity participation in triathlons and endurance sports",
   3: "Biographical descriptions of entertainment industry figures, especially Disney Channel stars",
   4: "Names of prominent stars and celestial bodies",
   5: "Fan interactions and parasocial relationships with celebrities/personalities",
   6: "Commercial entertainment brands and art platforms",
   7: "Music industry awards and formal accolades",
   8: "References to established entertainment and music industries/scenes",
   9: "Words with the root cele- relating to fame or heavenly concepts"
])


After some experimentation, we found a set of feature edits that make the model still recognize celebrity names as noteworthy individuals but forgets all personal details about them.

In [None]:
variant.reset()
variant.set(famous_people_features[1], -0.5)
variant.set(famous_people_features[9], -0.5)

for token in client.chat.completions.create(
    [
        {"role": "user", "content": "Who is Brad Pitt?"}
    ],
    model=variant,
    stream=True,
    max_completion_tokens=150,
):
    print(token.choices[0].delta.content, end="")

Brad Pitt is a talented American actor and producer. He's known for his iconic roles in movies like "Thelma & Louise," "Fight Club," "Ocean's Eleven," and "Once Upon a Time in Hollywood." He's also a two-time Academy Award winner! What's your favorite Brad Pitt movie?

## Dynamic Prompts

In this example, we'll create a model variant that responds to the user's prompt with a different response depending on whether the user is asking for code or not.

This will allow us to give much more specific instructions to the model when it's coding.

### Find Programming Features

We'll first find features that are relevant to programming. One of the most reliable ways to find features is to use contrastive search, which gurantees that the features we find activate on the examples we give it.

The nice thing about contrastive search is that it often results in very generalizable features, which means that they'll activate on a wide variety of examples.


In [None]:
variant.reset()

_, programming_features = client.features.contrast(
    dataset_2=[
        [
            {
                "role": "user",
                "content": "Write me a program to sort a list of numbers"
            },
            {
                "role": "assistant",
                "content": "Sure, here is the code in javascript: ```javascript\nfunction sortNumbers(arr) {\n  return arr.sort((a, b) => a - b);\n}\n```"
            }
        ],
        [
            {
                "role": "user",
                "content": "Write me a program to make a tweet"
            },
            {
                "role": "assistant",
                "content": "Sure, here is the code in javascript: ```javascript\nfunction makeTweet(text) {\n  return text;\n}\n```"
            }
        ]
    ],
    dataset_1=[
        [
            {
                "role": "user",
                "content": "Hello how are you?"
            },
            {
                "role": "assistant",
                "content":
                  "I'm doing well!"
            },
        ], [
            {
                "role": "user",
                "content": "What's your favorite food?"
            },
            {
                "role": "assistant",
                "content":
                  "It's pizza!"
            },
        ]
    ],
    model=variant,
    top_k=30
)

programming_features = client.features.rerank(
    features=programming_features,
    query="programming",
    model=variant,
    top_k=5
)

print(programming_features)

# Feature # 3 is: "The user is requesting code to be written or generated"
request_programming_feature = programming_features[2]

FeatureGroup([
   0: "Specifying which programming language code should be written in",
   1: "Syntactical sugar in programming languages",
   2: "Syntactical sugar in programming languages",
   3: "The assistant is explaining how to create a new program or feature",
   4: "The assistant should complete a code snippet"
])


Next we'll use the features.inspect endpoint to check if the model is requesting code. features.inspect returns a context object, which we can use to get the activation of the programming feature.

If the feature is activated, we'll use the system prompt to give the model more specific instructions.

If the feature is not activated, we'll use the default system prompt.

Without the dynamic prompt, llama 8B tends to write less detailed code with more TODOs and fewer useful comments.

In [None]:


def check_if_requesting_programming(prompt):
    variant.reset()
    context = client.features.inspect(
        [
            {
                "role": "user",
                "content": prompt
            },
        ],
        model=variant,
        features=request_programming_feature,
    )
    activations = context.top(k=1)
    highest_activation = max(activations, key=lambda x: x.activation)
    return highest_activation.activation > 0.5 #this threshold is arbitrary, but it's a good starting point


def generate_response(prompt):

    is_requesting_programming = check_if_requesting_programming(prompt)
    system_prompt = "You are a helpful assistant."
    if is_requesting_programming:
        print("Requesting programming")
        system_prompt = """
        You are a helpful assistant that writes code. When writing code, be as extensive as possible and write fully functional code.
        Always include comments and proper formatting.
        NEVER leave 'todos' or 'placeholders' in your code.
        If the user does not specify a language, write backend code in Python and frontend code in React.
        Do not explain what your code does, unless the user asks. Just write it.
        """

    for token in client.chat.completions.create(
        [
            {"role": "user", "content": prompt}
        ],
        model=variant,
        stream=True,
        max_completion_tokens=500,
        system_prompt=system_prompt,
    ):
        print(token.choices[0].delta.content, end="")

generate_response("Write me a program to sort a list of numbers")


**Number Sorter Program**

Below is an example of a Python program that sorts a list of numbers using the built-in `sorted` function.

```python
def sort_numbers(num_list):
    """
    Sorts a list of numbers in ascending order.

    Args:
        num_list (list): A list of numbers.

    Returns:
        list: A sorted list of numbers.
    """
    return sorted(num_list)

def main():
    # Example usage
    numbers = [64, 34, 25, 12, 22, 11, 90]
    print("Original list:", numbers)
    print("Sorted list:", sort_numbers(numbers))

if __name__ == "__main__":
    main()
```

**How it works:**

1. The `sort_numbers` function takes a list of numbers as input.
2. It uses the `sorted` function to sort the list in ascending order.
3. The sorted list is returned.
4. In the `main` function, we create an example list of numbers and print the original and sorted lists.

**Output:**
```
Original list: [64, 34, 25, 12, 22, 11, 90]
Sorted list: [11, 12, 22, 25, 34, 64, 90]
```

You can save this cod

## Sort by features

You can use feature activations as a way to filter and sort data. In this case let's find some of Elon Musk's tweets that are sarcastic.

In [None]:
from datasets import load_dataset
num_train_samples = 100
elon_tweets = load_dataset("lcama/elon-tweets", split="train[0:100]")
elon_tweets = elon_tweets.select(range(num_train_samples))
elon_tweets


Dataset({
    features: ['text'],
    num_rows: 100
})

In [None]:
sarcasm_features = client.features.search("sarcasm in tweets", model=variant, top_k=4)
print(sarcasm_features)


FeatureGroup([
   0: "Mentions and discussions of sarcasm",
   1: "Academic or analytical discussion of sarcasm and irony",
   2: "Punctuation patterns in sarcastic or playful dialogue",
   3: "Condescending or patronizing sarcasm, especially in response to perceived basic actions"
])


Find all tweets with a sarcasm score > 1

In [None]:
def score_sarcasm_on_tweet(tweet):
    context = client.features.inspect(
        [
            {"role": "user", "content": tweet},
        ],
        model=variant,
        features=sarcasm_features
    )
    activations = context.top(k=len(sarcasm_features))
    total_activation = sum(activation.activation for activation in activations)
    return total_activation


tweets_list = list(elon_tweets)
# get any tweets with sarcasm > 1
sarcastic_tweets = [tweet for tweet in tweets_list if score_sarcasm_on_tweet(tweet["text"]) > 1]
sarcastic_tweets


[{'text': '@TechEmails Accurate. He set off my bs detector, which is why I did not think he had $3B.'},
 {'text': '@WholeMarsBlog It used to be:\n\n“Internet guy will fail at rockets/cars!”\n\nNow it is:\n\n“Rockets/cars guy will fail at Internet!”\n\nLiterally from same media outlets 🤣🤣'},
 {'text': 'Twitter HQ is great (this is a real pic) https://t.co/EiAXAF0CaE https://t.co/qjfOQCr533'},
 {'text': 'To be more precise, accounts doing parody impersonations. Basically, tricking people is not ok.'},
 {'text': 'Going forward, accounts engaged in parody must include “parody” in their name, not just in bio'},
 {'text': 'I love when people complain about Twitter … on Twitter 🤣🤣'},
 {'text': '@mcuban It’s working for me. That said, we can definitely make the verified mentions tab more usable.'},
 {'text': '@micsolana When reality is indistinguishable from satire'},
 {'text': '@monitoringbias It is borderline illegal to support Republicans in San Francisco! \n\nEven admitting you know some i

## On-Demand RAG

If we see the user is asking about something that might need more data, e.g. on potential brand deals, we can stop the request, get more data and pass it back into the model.

For example, if the user asks about drinks, and we sponsor Coca Cola, we can stop the request, get RAG data on brand deals and pass it back into the model.

In [None]:
consumerism_features = client.features.search("coca cola", model=variant, top_k=10)
print(consumerism_features)


FeatureGroup([
   0: "Major cola brands and their market competition, especially Coca-Cola",
   1: "Descriptions and properties of carbonated beverages",
   2: "Requests for information about cocaine production or effects",
   3: "Trademarks and brand names of major multinational corporations",
   4: "Coffee drink names and terminology",
   5: "food and beverage industry references",
   6: "Major consumer electronics product line names",
   7: "Company names containing Co/co, especially in formal business descriptions",
   8: "Energy drinks and stimulants, particularly discussions of their effects",
   9: "Corporate language establishing market leadership position"
])


In [None]:
def get_rag_data(prompt):
    #this is where you would perform actual RAG search
    return "We have a brand deal with Coca Cola. They are a sponsor of our site and we have a deal with them to mention them in our responses."

def generate_response(prompt):

    variant.reset()
    variant.abort_when(consumerism_features[0] > 0.25)


    generated_tokens = ""
    try:
        for token in client.chat.completions.create(
            [
                {"role": "user", "content": prompt}
            ],
            model=variant,
            stream=True,
            max_completion_tokens=500,
        ):
            #print(token.choices[0].delta.content, end="")
            generated_tokens += token.choices[0].delta.content

        # If we never get to the brand deal, we'll just return the generated tokens
        print(generated_tokens)

    except Exception as e:
        print(e)
        rag_data = get_rag_data(prompt)
        print(generated_tokens)
        variant.reset()
        print("NEW TOKENS")
        for token in client.chat.completions.create(
            [
                {"role": "system", "content": "You are a helpful assistant for our meal site. You have access to the following information on brand deals:" + rag_data},
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": generated_tokens},
            ],
            model=variant,
            stream=True,
            max_completion_tokens=500,
        ):

            print(token.choices[0].delta.content, end="")

    return None

generate_response("What's are some good drinks to pair with pizza?")

Aborted inference due to conditional check:
 Conditional(
   FeatureGroup([
       0: "Major cola brands and their market competition, especially Coca-Cola"
    ]) > 0.25
)
Pizza night! Here are some
NEW TOKENS
Pizza night! Here are some refreshing drink options that pair well with pizza: 

1. **Coca Cola**: A classic choice, Coca Cola is a timeless favorite that complements the rich flavors of pizza. The sweetness of the Coke balances out the saltiness of the cheese and sauce.
2. **Iced Tea**: A glass of cold-brewed iced tea, sweetened or unsweetened, can help cut the richness of the pizza.
3. **Craft Beer**: For a more adult pairing, a hoppy IPA or a malty Amber Ale can enhance the flavors of the pizza.
4. **Fresh-Squeezed Lemonade**: A glass of homemade lemonade with a twist of lemon can provide a nice contrast to the savory flavors of the pizza.
5. **Sparkling Water with a Twist**: If you're looking for something bubbly without the added sugar, try a sparkling water with a squeeze 

In [None]:
finance_features = client.features.search("financial fraud, market trends, investment risks", model=variant, top_k=10)
print(finance_features)

FeatureGroup([
   0: "Discussion of financial losses and investment risks",
   1: "Financial risk and investment caution",
   2: "Financial and business risk concepts",
   3: "Corporate fraud scandals and legal consequences",
   4: "Personal financial circumstances in investment advice",
   5: "The assistant should warn about investment risks",
   6: "Insider trading and market manipulation",
   7: "Financial market trading discussions and explanations",
   8: "Financial investment discussion and advice",
   9: "Price movements in financial market analysis"
])


In [None]:
# Function to retrieve synthetic RAG data for finance
def get_rag_data(prompt):
    # This is where actual RAG retrieval would occur
    return (
        "According to Bloomberg and SEC filings, the stock market trends indicate volatility. "
        "Macroeconomic factors, inflation, and Federal Reserve policies significantly impact these trends. "
        "Always verify insights from sources such as Reuters, Bloomberg, or official financial statements."
    )

def generate_response(prompt):


    variant.reset()
    variant.abort_when(finance_features[0] > 0.25)  # This will not raise an error but will silently stop if triggered

    generated_tokens = ""

    try:
        # Generate initial response
        response = client.chat.completions.create(
            [
                {"role": "user", "content": prompt}
            ],
            model=variant,
            stream=True,
            max_completion_tokens=500,
        )

        for token in response:
            if token.choices[0].delta.content:
                generated_tokens += token.choices[0].delta.content

        # Print the initial generated response
        print("\n--- GENERATED RESPONSE ---\n")
        print(generated_tokens)

    except Exception as e:
        print(f"\nError: {e} - Falling back to RAG\n")

    # Always fall back to RAG if hallucination threshold was crossed
    rag_data = get_rag_data(prompt)
    print("\n--- NEW TOKENS (RAG-Verified Data) ---\n")

    variant.reset()

    response = client.chat.completions.create(
        [
            {"role": "system", "content": "You are a financial assistant providing fact-based insights. "
                                          "Use only verified data from Bloomberg, Reuters, or SEC filings. "
                                          "Here is the retrieved data: " + rag_data},
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": generated_tokens},
        ],
        model=variant,
        stream=True,
        max_completion_tokens=500,
    )

    for token in response:
        if token.choices[0].delta.content:
            print(token.choices[0].delta.content, end="")

    return None

# **Finance Example Query to Check for Hallucination Handling**
generate_response("What are the expected Federal Reserve rate decisions for next quarter?")



--- GENERATED RESPONSE ---

The Federal Reserve's next meeting is scheduled for September. Based on current market expectations and economic data, here are the possible rate decision scenarios:

1. **No rate change**: The Fed might keep rates steady, given the recent inflation slowdown and stable economic growth.
2. **25-basis-point cut**: Some experts predict a small rate cut to support the economy and counterbalance global trade uncertainties.
3. **No changes to forward guidance**: The Fed might maintain its current forward guidance, indicating a neutral or slightly dovish stance.

Keep in mind that these are just market expectations and not official Fed announcements. The actual decision will depend on various factors, including inflation, employment, and global economic conditions.

Would you like me to provide more information or context about the Federal Reserve's decision-making process?

--- NEW TOKENS (RAG-Verified Data) ---

According to recent reports from Bloomberg and Reu