In [1]:
# HypotheSAEs Quickstart
# This notebook demonstrates basic usage of HypotheSAEs on a sample of the Yelp review dataset

%load_ext autoreload
%autoreload 2

import os
os.environ['OPENAI_KEY_SAE'] = '...' # Replace with your OpenAI API key, or with another environment variable (e.g. os.environ['OPENAI_API_KEY'])

import numpy as np
import pandas as pd

from hypothesaes.quickstart import train_sae, interpret_sae, generate_hypotheses, evaluate_hypotheses
from hypothesaes.embedding import get_openai_embeddings, get_local_embeddings

**Load data**

The dataset we're using here is a subset of 20K Yelp reviews, with 2K reviews used for validation (during SAE training). 

The target variable is the `stars` column, which is a rating between 1 and 5. We treat this as a regression task.

There are also 2K reviews used for holdout evaluation, which we'll use at the end of the notebook.

In [2]:
current_dir = os.getcwd()
if current_dir.endswith("notebooks"):
    prefix = "../"
else:
    prefix = "./"

base_dir = os.path.join(prefix, "demo-data")
train_df = pd.read_json(os.path.join(base_dir, "yelp-demo-train-20K.json"), lines=True)
val_df = pd.read_json(os.path.join(base_dir, "yelp-demo-val-2K.json"), lines=True)

texts = train_df['text'].tolist()
labels = train_df['stars'].values
val_texts = val_df['text'].tolist() # These are only used for early stopping of SAE training, so we don't need labels.

**Compute text embeddings for your dataset**

We'll compute text embeddings for a training set, and optionally a validation set. The validation embeddings are used for SAE eval and early-stopping during training.

Embeddings will be stored in the `emb_cache` directory (or `os.environ["EMB_CACHE_DIR"]` if you set it) using the `cache_name` parameter, so you only need to compute embeddings once.

You can use OpenAI or a local model.

Local models will run much faster on GPU. The default local model is `nomic-ai/modernbert-embed-base`. You can use any sentence-transformers model, but please read the model's docs; you may need to edit `get_local_embeddings`.

In [3]:
EMBEDDER = "text-embedding-3-small" # OpenAI
# EMBEDDER = "nomic-ai/modernbert-embed-base" # Huggingface model, will run locally
CACHE_NAME = f"yelp_quickstart_{EMBEDDER}"

text2embedding = get_openai_embeddings(texts + val_texts, model=EMBEDDER, cache_name=CACHE_NAME)
# text2embedding = get_local_embeddings(texts + val_texts, model=EMBEDDER, batch_size=128, cache_name=CACHE_NAME)
embeddings = np.stack([text2embedding[text] for text in texts])

train_embeddings = np.stack([text2embedding[text] for text in texts])
val_embeddings = np.stack([text2embedding[text] for text in val_texts])

Processing chunks:   0%|          | 0/1 [00:00<?, ?it/s]

Chunk 0:   0%|          | 0/86 [00:00<?, ?it/s]

Saved 22000 embeddings to /home/rm868/saetools-dev/emb_cache/yelp_quickstart_text-embedding-3-small/chunk_000.npy


**Train SAE(s)** 

Using different values of $M$ and $k$ will produce features at different levels of granularity. You can train multiple SAEs if you'd like to produce features at varying granularity, but this is optional.

See the README for more details about selecting $M$ and $k$.

In [4]:
checkpoint_dir = os.path.join(prefix, "checkpoints", CACHE_NAME)
sae_256_8 = train_sae(embeddings=train_embeddings, M=256, K=8, checkpoint_dir=checkpoint_dir, val_embeddings=val_embeddings)
sae_32_4 = train_sae(embeddings=train_embeddings, M=32, K=4, checkpoint_dir=checkpoint_dir, val_embeddings=val_embeddings)
sae_list = [sae_256_8, sae_32_4]

  0%|          | 0/100 [00:00<?, ?it/s]

Early stopping triggered after 67 epochs
Saved model to ./checkpoints/yelp_quickstart_text-embedding-3-small/SAE_M=256_K=8.pt


  0%|          | 0/100 [00:00<?, ?it/s]

Early stopping triggered after 58 epochs
Saved model to ./checkpoints/yelp_quickstart_text-embedding-3-small/SAE_M=32_K=4.pt


**Interpret neurons**  

Interpret a random subset of neurons in the SAE to sanity-check that the learned features, and their interpretations, seem reasonable. We generate and print labels for `n_random_neurons` neurons, and we also print out the top-activating texts for each neuron.

In [5]:
# This instruction will be included in the neuron interpretation prompt.
# The below instructions are specific to Yelp, but you can customize this for your task.
# If you don't pass in task-specific instructions, there is a generic instruction (see src/interpret_neurons.py);
# task-specific instructions are optional, but they help produce hypotheses at the desired level of specificity.

TASK_SPECIFIC_INSTRUCTIONS = """All of the texts are reviews of restaurants on Yelp.
Features should describe a specific aspect of the review. For example:
- "mentions long wait times to receive service"
- "praises how a dish was cooked, with phrases like 'perfect medium-rare'\""""

# Interpret random neurons
results = interpret_sae(
    texts=texts,
    embeddings=train_embeddings,
    sae=sae_list,
    n_random_neurons=5,
    print_examples=3,
    task_specific_instructions=TASK_SPECIFIC_INSTRUCTIONS
)

Activations shape (from 2 SAEs): (20000, 288)


Generating 1 interpretation(s) per neuron:   0%|          | 0/5 [00:00<?, ?it/s]


Neuron 66 (from SAE M=256, K=8): emphasizes the deliciousness or perfection of the food using words like 'delicious', 'perfectly', 'amazing', or 'yummy'

Top activating examples:
1. Delicious King Cakes (both the chocolate and the traditional)!...
2. Burgers are simple but good. Very tasty. The fries are crisp on the outside but soft inside....
3. So delicious! We always get the tacos and steak quesadillas. Their cheese is sooooo delicious!...
----------------------------------------------------------------------------------------------------

Neuron 144 (from SAE M=256, K=8): mentions unique or creative food options or menu items

Top activating examples:
1. Great sandwich place that has unique and imaginative flavors.  They also have an alteristic side to their brand.  I'm sorry it took me so long to give them a try.  I highly recommend this establishment!...
2. Strange restaurant. Strange food. Seems like they try too hard to be different or unique. An experience, but I'm sure we w

**Generate hypotheses**

Generate hypotheses which are predictive of the target variable.

The `selection_method` parameter defines how we compute neuron predictiveness (see `src/select_neurons.py` for more details):
- "separation_score": E[target | top-activating examples] - E[target | zero-activating examples]
- "correlation": pearson(neuron activations, target variable)
- "lasso": select N nonzero features with an L1 regularized model

This cell outputs a dataframe with the following columns:
- `neuron_idx`: The index of the neuron in the SAE (if you're using multiple SAEs, this will be a global index across all of them).
- `source_sae`: The SAE that the neuron was selected from.
- `target_{selection_method}`: The predictiveness of the neuron for the target variable, using the selected `selection_method`.
- `interpretation`: The natural language interpretation of the neuron.
- `interp_fidelity_score`: The F1 fidelity score for how well the neuron's interpretation actually corresponds to its activation pattern.

In [6]:
selection_method = "correlation"
results = generate_hypotheses(
    texts=texts,
    labels=labels,
    embeddings=embeddings,
    sae=sae_list,
    cache_name=CACHE_NAME,
    selection_method=selection_method,
    n_selected_neurons=20,
    n_candidate_interpretations=1,
    task_specific_instructions=TASK_SPECIFIC_INSTRUCTIONS
)

print("\nMost predictive features of Yelp reviews:")
pd.set_option('display.max_colwidth', None)
display(results.sort_values(by=f"target_{selection_method}", ascending=False))
pd.reset_option('display.max_colwidth')

Embeddings shape: (20000, 1536)
Activations shape (from 2 SAEs): (20000, 288)

Step 1: Selecting top 20 predictive neurons

Step 2: Interpreting selected neurons


Generating 1 interpretation(s) per neuron:   0%|          | 0/20 [00:00<?, ?it/s]


Step 3: Scoring Interpretations
Found 0 cached items; annotating 2000 uncached items


Scoring neuron interpretation fidelity (20 neurons; 1 candidate interps per neuron; 100 examples to score each…


Most predictive features of Yelp reviews:


Unnamed: 0,neuron_idx,source_sae,target_correlation,interpretation,f1_fidelity_score
3,273,"(32, 4)",0.349828,expresses consistent satisfaction with both food and service over multiple visits,0.779747
4,282,"(32, 4)",0.306055,mentions outstanding or exceptional service provided by specific named staff members,0.64
15,201,"(256, 8)",0.186097,"uses enthusiastic and repetitive positive adjectives like 'amazing', 'awesome', 'fantastic', and 'phenomenal' to describe the food, service, or atmosphere",0.696216
16,267,"(32, 4)",0.184613,"mentions discovering a hidden or unexpected location, often described as a 'gem' or 'treasure'",0.761519
17,161,"(256, 8)",0.172689,explicitly praises specific staff members by name for their service or contribution,0.541212
19,165,"(256, 8)",0.170067,"describes exceptional service, often using terms like 'impeccable', 'attentive', or 'hospitable'",0.589375
18,194,"(256, 8)",-0.171504,"mentions long wait times for food or service, often specifying durations in minutes or hours",0.947368
14,36,"(256, 8)",-0.188391,"mentions food being rotten, spoiled, or containing foreign objects",0.564706
13,115,"(256, 8)",-0.19699,"mentions issues with restaurant phone service, such as unanswered calls, busy signals, or phone being off the hook",0.823529
12,16,"(256, 8)",-0.204466,"describes a negative experience related to restaurant policies, practices, or unexpected charges",0.828095


**Evaluate held-out generalization**

Finally, we evaluate whether these are good hypotheses by testing whether their natural language interpretations can predict the target variable.  

We compute annotations for each hypothesized concept on a holdout set (not seen during SAE training & feature selection).

After annotation, we output a dataframe with the following columns:
- `hypothesis`: The natural language hypothesis (which came from interpreting a predictive neuron in the SAE)
- `separation_score`: How much the target variable differs when the concept is present vs. absent (i.e., $E[Y\mid\text{concept} = 1] - E[Y\mid\text{concept} = 0]$).
- `separation_pvalue`: The t-test p-value of the null hypothesis that the separation score is 0 (i.e., the concept is not associated with the target variable).
- `regression_coef`: The coefficient of the concept in a multivariate linear regression of the target variable on all concepts.
- `regression_pval`: The p-value of the null hypothesis that the regression coefficient is 0.
- `feature_prevalence`: The fraction of examples that contain the concept.

Additionally, we output the evaluation metrics used in the paper:
- Significant hypotheses: the number of hypotheses that are significant in the multivariate regression at a specified significance level (default $0.1$) after Bonferroni correction. You can pass in a different significance level using the `corrected_pval_threshold` parameter.
- AUC or $R^2$: how well the hypotheses collectively predict the target variable in the multivariate regression.


In [7]:
holdout_df = pd.read_json(os.path.join(base_dir, "yelp-demo-holdout-2K.json"), lines=True)
holdout_texts = holdout_df['text'].tolist()
holdout_labels = holdout_df['stars'].values

metrics, evaluation_df = evaluate_hypotheses(
    hypotheses_df=results,
    texts=holdout_texts,
    labels=holdout_labels,
    cache_name=CACHE_NAME,
)

pd.set_option('display.max_colwidth', None)
display(evaluation_df)
pd.reset_option('display.max_colwidth')

print("\nHoldout Set Metrics:")
print(f"R² Score: {metrics['r2']:.3f}")
print(f"Significant hypotheses: {metrics['Significant'][0]}/{metrics['Significant'][1]} " 
      f"(p < {metrics['Significant'][2]:.3e})")

Step 1: Annotating texts with 20 hypotheses
Found 0 cached items; annotating 40000 uncached items


Annotating:   0%|          | 0/40000 [00:00<?, ?it/s]

Step 2: Computing predictiveness of hypothesis annotations


Unnamed: 0,hypothesis,separation_score,separation_pval,regression_coef,regression_pval,feature_prevalence
15,"uses enthusiastic and repetitive positive adjectives like 'amazing', 'awesome', 'fantastic', and 'phenomenal' to describe the food, service, or atmosphere",1.757536,3.0566300000000003e-218,0.622085,2.642696e-56,0.4625
3,expresses consistent satisfaction with both food and service over multiple visits,1.260865,9.422764e-77,0.11304,0.004797782,0.265
19,"describes exceptional service, often using terms like 'impeccable', 'attentive', or 'hospitable'",1.090234,1.5541389999999998e-48,0.073108,0.1049215,0.213
4,mentions outstanding or exceptional service provided by specific named staff members,0.985218,4.784272e-23,0.179892,0.01596294,0.108
16,"mentions discovering a hidden or unexpected location, often described as a 'gem' or 'treasure'",0.981805,2.505211e-13,0.156204,0.01947976,0.057
17,explicitly praises specific staff members by name for their service or contribution,0.981041,3.411213e-14,-0.006343,0.944009,0.0615
18,"mentions long wait times for food or service, often specifying durations in minutes or hours",-1.227237,1.5499639999999998e-26,-0.244673,0.02621826,0.078
11,"describes long wait times for food or service, often specifying the duration of the wait",-1.246352,1.679297e-29,0.249166,0.02172724,0.085
5,"mentions that the food lacks flavor, creativity, or excitement",-1.983874,1.258922e-118,-0.420731,8.383332e-11,0.1355
14,"mentions food being rotten, spoiled, or containing foreign objects",-2.1235,5.054306e-35,0.012535,0.8981629,0.033



Holdout Set Metrics:
R² Score: 0.764
Significant hypotheses: 9/20 (p < 5.000e-03)
