In [9]:
import numpy as np
import torch
from torch import nn

# Artificial Neural Networks

Artificial Neural Networks (ANNs) are the base on which all of deep learning resides. Inspired by the neuronal architecture of the brain, ANNs consist of neurons arranged in layers and each ANN can contain many such layers of neurons through which the input is passed to convert it into the output. 

There are two main aspects of an ANN architecture we need to understand- the neuron (perceptron) and the network (multilayer perceptron). 

## Perceptron

The individual neuron, also known as the perceptron, comes with a vector of weights/features $w$ and a bias/intercept $b$. 

- Input: The input is in the form of a vector $x$ where each element indicates the coefficient for a feature
- Output: The output is calculated by using the inputs $x$ and the neuron's weight $w$ and bias $b$ and then passing through an activation function $g$.

$$ \hat{y}= g(w.x + b) $$ 

<img src="images/ann.png" 
        alt="Picture" 
        width="400" 
        height="400" 
        style="display: block; margin: 0 auto" />

Image Source: [Medium Blog](https://towardsdatascience.com/the-basics-of-neural-networks-neural-network-series-part-1-4419e343b2b) 

In [25]:
# Implementing a perceptron

# defining the perceptron class
class Perceptron(nn.Module): 
    def __init__(self): 
        super(Perceptron,self).__init__() #inherit from Module superclass
        self.w=nn.Parameter(torch.tensor ([1.,0.,0.,0.,0.],dtype=torch.float16)) #setting weight
        self.b=nn.Parameter(torch.tensor ([0.5])) #bias
        
    def sigmoid(self,z):  #defining activation function
        return 1/(1+torch.exp(-z))

    def forward(self, x): #defining a forward pass
        w=self.w
        b=self.b
        z=torch.dot(w.T,x)+b
        y_hat= self.sigmoid(z)
        return y_hat

# defining two inputs
X1= torch.tensor([1.,0.,0.,0.,0.],dtype=torch.float16)
X2= torch.tensor([0.,0.,1.,0.,0.],dtype=torch.float16)

# calling the perceptron
perceptron=Perceptron()
prediction1= perceptron.forward(X1)
print(f"The prediction for input 1 is {prediction1}.")
prediction2= perceptron.forward(X2)
print(f"The prediction for input 2 is {prediction2}.")

The prediction for input 1 is tensor([0.8176], grad_fn=<MulBackward0>).
The prediction for input 2 is tensor([0.6225], grad_fn=<MulBackward0>).


  z=torch.dot(w.T,x)+b


## Multilayer Perceptron

A multilayer perceptron is the typical ANN. It consists of multiple neurons organized in a layer and those layers combine sequentially into a network. The passing of information through layers is known as forward propagation. 

The forward propagation for one layer of a multilayer perceptron is as follows: 

- Each input vector $x_{a\times 1}$ gets sent to different neurons in a layer, each of which have a weight for each input dimension resulting a weight vector $w_{a\times 1}$

$$ \hat{y}= g(w^T_{1\times a}x_{a\times 1} + b) $$ 

- For a layer with n neurons, the calculation is as follows where 

$$ \hat{y}_{n\times n}= g(w^T_{n\times a}x_{a\times n} + b) $$ 

- In the final output,  each column is the output of the layer for each input.

<img src="images/fpmlp.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Avabodha Blog](https://avabodha.in/logistic-regression-and-basics-of-neural-network/) 

In [34]:
# Implementing a one layer perceptron
# only the weight and biases change from single perceptron

# defining the perceptron class
class LayerPerceptron(nn.Module): 
    def __init__(self): 
        super(LayerPerceptron,self).__init__() #inherit from Module superclass
        self.w=nn.Parameter(torch.randn (5,3)) #setting weight
        self.b=nn.Parameter(torch.randn (3,1)) #bias
        
    def sigmoid(self,z):  #defining activation function
        return 1/(1+torch.exp(-z))

    def forward(self, x): #defining a forward pass
        w=self.w
        b=self.b
        z=torch.mm(w.T,x.T)+b #note that X is transposed to convert from row vector to column vector
        y_hat= self.sigmoid(z)
        return y_hat

# defining two inputs
X1= torch.tensor([[1.,0.,0.,0.,0.]],dtype=torch.float32)
X2= torch.tensor([[0.,0.,1.,0.,0.]],dtype=torch.float32)

# calling the perceptron
perceptron=LayerPerceptron()
prediction1= perceptron.forward(X1)
print(f"The prediction for input 1 is {prediction1}.")
prediction2= perceptron.forward(X2)
print(f"The prediction for input 2 is {prediction2}.")

The prediction for input 1 is tensor([[0.2380],
        [0.0548],
        [0.3006]], grad_fn=<MulBackward0>).
The prediction for input 2 is tensor([[0.7495],
        [0.1141],
        [0.4705]], grad_fn=<MulBackward0>).


# Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a class of neural network architecture inspired by the organization of human neural system that have the capability to model sequential data. 

## Background

It is not entirely clear who invented the first computational RNNs but the first reference to “recurrent nets” can be found in McCulloch & Pitts (1943)  and then Rumelhart et al (1985) discuss it more from a machine learning and backpropagation perspective. Elman (1990) also builds on this to then propose a way of implementing dynamic memory for time series. 

## Architecture of RNNs

The architecture of RNN is as follows: 

1. Input Embedding: The input is converted into an embedding
2. Hidden State: The input embedding is used to calculate a hidden state
3. Output Embedding: The hidden state is converted to the output embedding
4. Recursion: Once the first hidden state is calculated for the first token, it is also used as part of calculation to get hidden state for the next tokens. Thus, the inputs change while hidden state keeps getting used giving the impression of being able to unroll a network in time. 

<img src="images/rnn.png" 
        alt="Picture" 
        width="600" 
        height="600" 
        style="display: block; margin: 0 auto" />

Image Source: [Analytics Vidhya Blog](https://www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn/) 

## Features of RNNs

Advantage: They could model sequences

Disadvantages: 

- The weights U, V, W remain constant no matter how much the RNN grows which seems restrictive in terms of determining how each word differently impacts the next in  a sequence.
- As the hidden states get chnaged with new words, it forgets information about the past. Thus RNNs suffer from memory overloading in longer sequences.

# Long Short Term Memory Network

LSTMs are a special type of RNN that were built to overcome the issues of forgetting in vanilla RNNs. 

Background: 

Basics: For any new input, vanilla RNNs would completely modify the hidden state which would lead to “forgetting”. To prevent this amnesia, LSTM were modified with 3 gates- forget gate, input gate and output gate.

# Variational Autoencoders

Variational Autoencoders (VAEs) are generative models that learn to represent high-dimensional data, like images, in a lower-dimensional latent space.

## Background

It was invented by [Kingma & Welling](https://arxiv.org/abs/1312.6114) (2014). It is part of the families of probabilistic graphical models and variational Bayesian methods.

## Architecture of VAEs

It consists of the main following parts:

**Encoder**: The encoder maps input data (e.g., an image) to a probability distribution in the latent space. Instead of encoding an input to a single point, as in traditional autoencoders, the VAE encoder outputs parameters of a probability distribution (mean and variance) that describe the latent variables. This allows the model to generate a range of outputs rather than a single fixed representation.

**Latent Space**: The latent space represents the compressed information from the input data in the form of a probability distribution. This distribution captures the underlying features of the data, allowing for new data generation by sampling from this space.

**Decoder**: The decoder takes a sample from the latent space and reconstructs it back into the original data space, attempting to reproduce the input data as closely as possible.

<img src="images/vae.webp" 
        alt="Picture" 
        width="600" 
        height="600" 
        style="display: block; margin: 0 auto" />

Image Source: [TDS Blog](https://towardsdatascience.com/difference-between-autoencoder-ae-and-variational-autoencoder-vae-ed7be1c038f2) 

## Features of VAEs

1. **Distributional Encoding**: Traditional autoencoders encode data to fixed points in the latent space, while VAEs encode data into distributions
   
2. **Generative Capability**: VAEs are designed to generate new data points by sampling from the learned latent space distributions, whereas traditional autoencoders primarily focus on reconstructing the input without a generative aspect.

3. **Reparameterization Trick**: One of the key innovations in VAEs is the reparameterization trick, which enables efficient backpropagation through the stochastic sampling process. Directly sampling from the latent distribution (mean and variance) introduces randomness that disrupts gradient-based optimization, making it difficult to train the network using backpropagation. The reparameterization trick solves this by expressing the sampled latent variable as a deterministic function of the mean, variance, and a random gaussian noise term. We sample from a standard normal distribution (a fixed distribution of zero mean and unit variance) to represent noise and adjust this sample using the learned mean and variance. This reformulation ensures that the sampling process is differentiable and allows efficient gradient computation.

Each word in the input sequence is converted into a vector of size $N_{dim}$ and then placed on top of each other into an embedding matrix. If the input has length $N_l$, then the final embedding has dimension $N_l \times N_{dim} $ where the $N_l$ hyperparameter is set to the length of the longest sequence in our training dataset.

Each position in the input sequence is represented using a combination of various sine and cosine functions where each dimension in a token embedding is represented by unique frequencies. The positional encoding matrix is added to the embedding matrix to create a positional encoding. 

For an input vector $x$, the Layer Normalization is computed as:

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

This residual addition followed by normalization helps stabilize activations during forward pass and gradients during backward pass. 

The output of the decoder block, of dimensions $N_l \times N_{dim}$, is passed through a linear layer, of dimension $N_{dim} \times N_v$, to get an output of dimensions $N_l \times N_v$ where each row indicates the scores of each vocab item as next token for that token. 

## Mechanistic Interpretability Framework

The mechanistic interpretability is a specific framework for looking at the mechanisms in transformer models in terms of operations on the residual stream. The main intuition is to break down the high dimensional models into easily understandable composition of mechanisms/components. 

<img src="images/mechtrans.png" 
        alt="Picture" 
        width="800" 
        height="800" 
        style="display: block; margin: 0 auto" />

Image Source: [Elhage et al 2021](https://transformer-circuits.pub/2021/framework/index.html)

### Important Concepts

#### Residual Stream

The initial input token encodings which parallely undergo transformations throughout a transformer. All components of a transformer (the token embedding, attention heads, MLP layers, and unembedding) communicate with each other by reading and writing to different subspaces of the residual stream. Rather than analyze the residual stream vectors, it can be helpful to decompose the residual stream into all these different communication channels, corresponding to paths through the model.

Features of the residual stream: 

1. **Linear Structuring**: Any communication to and from the residual stream only happens in terms of linear operations- addition or linear map- thus endowing transformers a great deal of linearity. This also has the consequence that residual stream doesn't have a privileged basis.
2. **Selective Flow**: The information flow via the residual stream is selective as the model can "select" which layers of the transformers it routes a token through where the selectivity is practically implemented as model weights. 


Note: Privileged basis (sometimes called a "preferred basis") for a set of vectors refers to a particular choice of basis vectors that simplifies calculations, enhances understanding, or aligns with specific properties of the vector space such as the $n$ coordinate vectors in a $\mathbb{R}^n $ space. In the case of transformers, privileged basis for a set of vectors would be those that enhance interpretability or make calculations easier. Specifically for mechanistic interpretability, the task then is to decompose a model in terms of the components that do have privileged basis (embedding, attention, MLP) where privilege is a spectrum. 

#### Virtual Weights

The linearity of the residual stream means that the amount of connection between any two layer can be quantified as "virtual weights" that indicate extent to which the later layer reads the information written by the previous layer. 

#### Superposition

Due to the dimensionality difference in the residual stream and other model components leading to bottleneck activations, superposition occurs where each dimension is not a unique interpretable feature (since important features like "London" are sparse) and it instead encodes a mix of features.  The model thus finds a balance between trying to encode most features and being able to read them out easily. 

The high load on residual stream bandwidth that leads to superposition also leads to the memory roles of attention & MLP where they read in information and write out the negative version from the stream.

#### Attention Circuits

The attention mechanism in transformers can be considered to have the following important features: 

1. There are two main circuits- QK(which computes relations between tokens) and OV (which computes how each token affects the output if attended to).

2. The attention heads are independent and additive.

<img src="images/atthead.png" 
        alt="Picture" 
        width="800" 
        height="800" 
        style="display: block; margin: 0 auto" />

Image Source: [Elhage et al 2021](https://transformer-circuits.pub/2021/framework/index.html)   

3. The attention heads move information i.e they read information from one token and write it to the residual stream of another token. Within an attention block, the series of multiplications are actually associative and the order doesn't really matter. For example, the $W_{OV}$ can be factorized in any way to get a $W_{O}$ and a $W_{V}$, same goes for $W_{QK}$ though OV and QK are very different functions.
4. The composition of attention heads forms induction heads which greatly increase expressivity of transformers. Key and query composition are very different from value composition.

### Reverse Engineering

Using toy attention-only  models, we can analyse characteristic behaviours of transformers: 

1. **Zero layer Transformers**: They emulate bigram statistics.
2. **One layer  Transformers**: They emulate bigram + skipgram statistics. Trigrams are hard to learn because positional encodings only encode before and after and not really individual positional information. 
3. **Two layer Transformers**: At this stage, the composition of attention Heads across layers leads to formation of *induction heads* which are equivalent to a simple in-context learning algorithms. The formation of these induction heads lead to a turning point for emergence.