# Mechanistic Interpretability Framework

The mechanistic interpretability is a specific framework for looking at the mechanisms in transformer models in terms of operations on the residual stream. The main intuition is to break down the high dimensional models into easily understandable composition of mechanisms/components. 

<img src="../images/mechtrans.png" 
        alt="Picture" 
        width="800" 
        height="800" 
        style="display: block; margin: 0 auto" />

Image Source: [Elhage et al 2021](https://transformer-circuits.pub/2021/framework/index.html)

## Important Concepts

### Residual Stream

The initial input token encodings which parallely undergo transformations throughout a transformer. All components of a transformer (the token embedding, attention heads, MLP layers, and unembedding) communicate with each other by reading and writing to different subspaces of the residual stream. Rather than analyze the residual stream vectors, it can be helpful to decompose the residual stream into all these different communication channels, corresponding to paths through the model.

Features of the residual stream: 

1. **Linear Structuring**: Any communication to and from the residual stream only happens in terms of linear operations- addition or linear map- thus endowing transformers a great deal of linearity. This also has the consequence that residual stream doesn't have a privileged basis.
2. **Selective Flow**: The information flow via the residual stream is selective as the model can "select" which layers of the transformers it routes a token through where the selectivity is practically implemented as model weights. 


Note: Privileged basis (sometimes called a "preferred basis") for a set of vectors refers to a particular choice of basis vectors that simplifies calculations, enhances understanding, or aligns with specific properties of the vector space such as the $n$ coordinate vectors in a $\mathbb{R}^n $ space. In the case of transformers, privileged basis for a set of vectors would be those that enhance interpretability or make calculations easier. Specifically for mechanistic interpretability, the task then is to decompose a model in terms of the components that do have privileged basis (embedding, attention, MLP) where privilege is a spectrum. 

### Virtual Weights

The linearity of the residual stream means that the amount of connection between any two layer can be quantified as "virtual weights" that indicate extent to which the later layer reads the information written by the previous layer. 

### Superposition

Due to the dimensionality difference in the residual stream and other model components leading to bottleneck activations, superposition occurs where each dimension is not a unique interpretable feature (since important features like "London" are sparse) and it instead encodes a mix of features.  The model thus finds a balance between trying to encode most features and being able to read them out easily. 

The high load on residual stream bandwidth that leads to superposition also leads to the memory roles of attention & MLP where they read in information and write out the negative version from the stream.

### Attention Circuits

The attention mechanism in transformers can be considered to have the following important features: 

1. There are two main circuits- QK(which computes relations between tokens) and OV (which computes how each token affects the output if attended to).

2. The attention heads are independent and additive.

<img src="../images/atthead.png" 
        alt="Picture" 
        width="800" 
        height="800" 
        style="display: block; margin: 0 auto" />

Image Source: [Elhage et al 2021](https://transformer-circuits.pub/2021/framework/index.html)   

3. The attention heads move information i.e they read information from one token and write it to the residual stream of another token. Within an attention block, the series of multiplications are actually associative and the order doesn't really matter. For example, the $W_{OV}$ can be factorized in any way to get a $W_{O}$ and a $W_{V}$, same goes for $W_{QK}$ though OV and QK are very different functions.
4. The composition of attention heads forms induction heads which greatly increase expressivity of transformers. Key and query composition are very different from value composition.

## Reverse Engineering

Using toy attention-only  models, we can analyse characteristic behaviours of transformers: 

1. **Zero layer Transformers**: They emulate bigram statistics.
2. **One layer  Transformers**: They emulate bigram + skipgram statistics. Trigrams are hard to learn because positional encodings only encode before and after and not really individual positional information. 
3. **Two layer Transformers**: At this stage, the composition of attention Heads across layers leads to formation of *induction heads* which are equivalent to a simple in-context learning algorithms. The formation of these induction heads lead to a turning point for emergence.