# Manifolds
This notebook discusses how to find manifolds using bilinear autoencoders.

> We assume the reader is familiar with the following:
> - Bilinear autoencoder basics
> - Toy models of superposition


## Setup
Let's get the setup out of the way.
As always, you can find a ``pyproject.toml`` file in the repo that works with the [uv](https://docs.astral.sh/uv/guides/install-python/) package manager.

// TODO: make this automatic \
You can download the autoencoders from [here](https://drive.google.com/drive/folders/1Qm8tSu0pi08lGAvqvW6YtddYzro-Cudc).
I personally looked at 'the autoencoders with '2' and '10' and the end of their names.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

from datasets import Dataset, load_dataset
from einops import einsum

from utils.feature import Feature
from utils.manifold import Manifold
from utils.functions import *
from autoencoder import Autoencoder

import plotly.express as px
import torch

In [None]:
torch.set_grad_enabled(False)
name = "Qwen/Qwen3-0.6B-Base"

# Download the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch.float16, device_map="cuda")
tokenize = lambda dataset: tokenizer(dataset["text"], truncation=True, padding=True, max_length=256)

# Download the dataset and tokenize it
dataset = load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train", streaming=True).with_format("torch")
dataset = Dataset.from_list(list(dataset.take(2**11))).with_format("torch")
dataset = dataset.map(tokenize, batched=True)

In [None]:
# Load in the autoencoder
coder = Autoencoder.load(model, layer=18, expansion=16, alpha=0.2).half()

# Load in the feature max-activation visualizer, this can take a while (reduce batch size if needed)
vis = Feature(coder, tokenizer, dataset, max_steps=2**4, batch_size=2**5)

## Finding dependence through toy models of superposition

Now, onto the good stuff. 

The [TMS paper](https://transformer-circuits.pub/2022/toy_model/index.html) studies how models represent sparse features when they are tasked to reconstruct through a bottleneck. This is a form of autoencoder, albeit a boring linear one. In short, the papers show that models naturally create interesting geometries based on correlation or anti-correlation.

Generally, speaking the results from TMS are interesting but their experiments can't be scaled to larger models.
Until now.
Bilinear autoencoders actually follow the same setup since its encoder and decoder are each other's transpose.
Yet, they're nonlinear in their inputs, which makes it a cool tool to study non-linear dynamics in a tractable manner.

---

We can conceptually split a bilinear autoencoder into two parts:
- A bilinear part: the $L$ and $R$ matrix, along with the element-wise product.
- A linear part: the $D$ matrix, which takes the feature basis and projects further down.

Here, we will deep dive into why this linear bit is so interesting.
In short, this linear bottleneck 'clusters' quadratic features, which correspond to higher-dimensional manifolds.

## What kind of manifolds can we find?
Each feature in a bilinear autoencoder describes a non-linear manifold on their own. 
Unfortunately, these are generally quite boring and basically just correspond to linear directions with a XOR (TODO: explain why).
However, they can become interesting when composed.
Then, they describe general [quadrics](https://www.wikiwand.com/en/articles/Quadric), think circles and other higher-dimensional conic sections.

> The features don't actually describe a manifold, they just assign a value (non-linearly) to the whole input space.
> they do describe an actual manifold when we threshold this value. For instance, the space where $f(x) > 0.1$.

## How does this work in practice?

With the intuition out of the way let's look at what this yields.
First, we need to have a measure of which features get clustered through the $D$ matrix. 

> I won't go into too much details here, read TMS for more intuition

We compute the effective dimension; a continuous measure of the amount of 'big' numbers in a vector. 
This roughly corresponds to finding the number of other features with which it interacts in the reconstruction.

In [None]:
# Compute the (normalized) grammian matrix
d = coder.down / coder.down.norm(dim=0, keepdim=True)
g = d.T @ d

# Compute and plot the effective dimension (somtimes called participation ratio) per feature.
gpr = generalized_effective_dimension(g)
fig = px.scatter(y=gpr.cpu(), x=list(range(gpr.size(-1))), template='plotly_white', width=600, height=300, title="Number of active elements in the overlap matrix")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=0), showlegend=False).show()

# I recommend not looking at the say top 5/10-ish. 
# There's some dense features which I don't quite understand yet.
# Their manifolds are interesting though.

# Print the top 50 features with the highest effective dimension, possibly corresponding to interesting manifolds.
print(gpr.topk(50).indices.tolist())

We see that most features are roughly 1-dimensional. 
This means they gently interact with other features but probably don't have much additional structure or overlap.

---

Okay, let's take a look at some of these points. 
I have selected a few that were interesting to me.

We then show the features and their max activations.

In [None]:
# These are for the coder with '10' tag
# idx = 602
# idx = 11023
# idx = 7695

# These are for the coder with '2' tag
# idx = 10313  # abbreviation manifold
# idx = 3338 # ( manifold
# idx = 15620 # numbers!
# idx = 36 # of
idx = 2062

# Plot the overlaps of the selected feaature
fig = px.histogram(g[idx].cpu(), template='plotly_white', log_y=True, width=600, height=300, range_x=[-1.1, 1.1])
fig.update_layout(margin=dict(l=0, r=0, t=30, b=0), showlegend=False).show()

# Visualise the top 5 features
inds = g[idx].abs().topk(k=5).indices
vis(inds.tolist(), k=3)

Given these features, we can construct a metric tensor (also known as a density matrix) that describes the manifold.
We can then decompose this matrix using an eigendecomposition to show how many dimensional the manifold actually is.

You'll see that sometimes, even when using many different features, the manifold will be roughly one dimensional.
This can have muliple interpretations but one of them is that the original features were 'split' for sparsity reasons for instance.
Another is that the feature was simply a 'building block' and is used across other reconstructions in some way.
Luckily, we can just analyse the composition of this feature as if it were one.

In [None]:
density = einsum(coder.left[inds], coder.right[inds], "out in1, out in2 -> in1 in2")
density = 0.5 * (density + density.T)

manifold = Manifold(dataset, coder.hooked, tokenizer, density, max_steps=2**5)
manifold.spectrum()

Then, finally, we can sample a bunch of inputs (handled efficiently by the manifold class) and plot them.

There are some important thing to note here. The most significant being how these are exactly visualised.
The dimensionality reduction happens linearly but not through PCA. 
Rather, we use the autoencoder itself to compute the principal dimensions of the manifold, independently of inputs.
This is quite important as many methods until now were reliant on sampling few points (such that the principal axes correspond to the things we actually want). 
But this could lead to some illusorry conclusions. 
We want our visualisations to align with the manifold, not what we propagated through the network to sample it.

Our approach simply uses the autoencoder to find a linear projection which likely contains a manifold, then we just project the inputs into it.

Finally, color correspond to the activation of the metric tensor $xMx$. 
Hovering over a sample shows the current token and the top model prediction.

In [None]:
# Plot the manifold, increase `k` to show more points or decrease `total` to sample fewer points.
manifold(k=40_000)