In [1]:
import numpy as np
import torch
from torch import nn

# Artificial Neural Networks

Artificial Neural Networks (ANNs) are the base on which all of deep learning resides. Inspired by the neuronal architecture of the brain, ANNs consist of neurons arranged in layers and each ANN can contain many such layers of neurons through which the input is passed to convert it into the output. 

There are two main aspects of an ANN architecture we need to understand- the neuron (perceptron) and the network (multilayer perceptron). 

## Perceptron

The individual neuron, also known as the perceptron, comes with a vector of weights/features $w$ and a bias/intercept $b$. 

- Input: The input is in the form of a vector $x$ where each element indicates the coefficient for a feature
- Output: The output is calculated by using the inputs $x$ and the neuron's weight $w$ and bias $b$ and then passing through an activation function $g$.

$$ \hat{y}= g(w.x + b) $$ 

<img src="images/ann.png" 
        alt="Picture" 
        width="400" 
        height="400" 
        style="display: block; margin: 0 auto" />

Image Source: [Medium Blog](https://towardsdatascience.com/the-basics-of-neural-networks-neural-network-series-part-1-4419e343b2b) 

In [17]:
# Implementing a perceptron

# defining the perceptron class
class Perceptron(nn.Module): 
    def __init__(self): 
        super(Perceptron,self).__init__() #inherit from Module superclass
        self.w=nn.Parameter(torch.tensor ([1.,0.,0.,0.,0.],dtype=torch.float16)) #setting weight
        self.b=nn.Parameter(torch.tensor ([0.5])) #bias
        
    def sigmoid(self,z):  #defining activation function
        return 1/(1+torch.exp(-z))

    def forward(self, x): #defining a forward pass
        w=self.w
        b=self.b
        z=torch.dot(w.T,x)+b
        y_hat= self.sigmoid(z)
        return y_hat

# defining two inputs
X1= torch.tensor([1.,0.,0.,0.,0.],dtype=torch.float16)
X2= torch.tensor([0.,0.,1.,0.,0.],dtype=torch.float16)

# calling the perceptron
perceptron=Perceptron()
prediction1= perceptron.forward(X1)
print(f"The prediction for input 1 is {prediction1}.")
prediction2= perceptron.forward(X2)
print(f"The prediction for input 2 is {prediction2}.")

The prediction for input 1 is tensor([0.8176], grad_fn=<MulBackward0>).
The prediction for input 2 is tensor([0.6225], grad_fn=<MulBackward0>).


## Multilayer Perceptron

A multilayer perceptron is the typical ANN. It consists of multiple neurons organized in a layer and those layers combine sequentially into a network. The passing of information through layers is known as forward propagation. 

The forward propagation for one layer of a multilayer perceptron is as follows: 

- Each input vector $x_{a\times 1}$ gets sent to different neurons in a layer, each of which have a weight for each input dimension resulting a weight vector $w_{a\times 1}$

$$ \hat{y}= g(w^T_{1\times a}x_{a\times 1} + b) $$ 

- For a layer with n neurons, the calculation is as follows where 

$$ \hat{y}_{n\times n}= g(w^T_{n\times a}x_{a\times n} + b) $$ 

- In the final output,  each column is the output of the layer for each input.

<img src="images/fpmlp.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Avabodha Blog](https://avabodha.in/logistic-regression-and-basics-of-neural-network/) 

In [34]:
# Implementing a one layer perceptron
# only the weight and biases change from single perceptron

# defining the perceptron class
class LayerPerceptron(nn.Module): 
    def __init__(self): 
        super(LayerPerceptron,self).__init__() #inherit from Module superclass
        self.w=nn.Parameter(torch.randn (5,3)) #setting weight
        self.b=nn.Parameter(torch.randn (3,1)) #bias
        
    def sigmoid(self,z):  #defining activation function
        return 1/(1+torch.exp(-z))

    def forward(self, x): #defining a forward pass
        w=self.w
        b=self.b
        z=torch.mm(w.T,x.T)+b #note that X is transposed to convert from row vector to column vector
        y_hat= self.sigmoid(z)
        return y_hat

# defining two inputs
X1= torch.tensor([[1.,0.,0.,0.,0.]],dtype=torch.float32)
X2= torch.tensor([[0.,0.,1.,0.,0.]],dtype=torch.float32)

# calling the perceptron
perceptron=LayerPerceptron()
prediction1= perceptron.forward(X1)
print(f"The prediction for input 1 is {prediction1}.")
prediction2= perceptron.forward(X2)
print(f"The prediction for input 2 is {prediction2}.")

The prediction for input 1 is tensor([[0.2380],
        [0.0548],
        [0.3006]], grad_fn=<MulBackward0>).
The prediction for input 2 is tensor([[0.7495],
        [0.1141],
        [0.4705]], grad_fn=<MulBackward0>).


## Backpropagation

# Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a class of neural network architecture inspired by the organization of human neural system that have the capability to model sequential data. 

## Background

It is not entirely clear who invented the first computational RNNs but the first reference to “recurrent nets” can be found in McCulloch & Pitts (1943)  and then Rumelhart et al (1985) discuss it more from a machine learning and backpropagation perspective. Elman (1990) also builds on this to then propose a way of implementing dynamic memory for time series. 

## Architecture of RNNs

The architecture of RNN is as follows: 

1. Input Embedding: The input is converted into an embedding
2. Hidden State: The input embedding is used to calculate a hidden state
3. Output Embedding: The hidden state is converted to the output embedding
4. Recursion: Once the first hidden state is calculated for the first token, it is also used as part of calculation to get hidden state for the next tokens. Thus, the inputs change while hidden state keeps getting used giving the impression of being able to unroll a network in time. 

<img src="images/rnn.png" 
        alt="Picture" 
        width="600" 
        height="600" 
        style="display: block; margin: 0 auto" />

Image Source: [Analytics Vidhya Blog](https://www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn/) 

## Features of RNNs

Advantage: They could model sequences

Disadvantages: 

- The weights U, V, W remain constant no matter how much the RNN grows which seems restrictive in terms of determining how each word differently impacts the next in  a sequence.
- As the hidden states get chnaged with new words, it forgets information about the past. Thus RNNs suffer from memory overloading in longer sequences.

# Long Short Term Memory Network

LSTMs are a special type of RNN that were built to overcome the issues of forgetting in vanilla RNNs. 

Background: 

Basics: For any new input, vanilla RNNs would completely modify the hidden state which would lead to “forgetting”. To prevent this amnesia, LSTM were modified with 3 gates- forget gate, input gate and output gate.

# Variational Autoencoders

Variational Autoencoders (VAEs) are generative models that learn to represent high-dimensional data, like images, in a lower-dimensional latent space.

## Background

It was invented by [Kingma & Welling](https://arxiv.org/abs/1312.6114) (2014). It is part of the families of probabilistic graphical models and variational Bayesian methods.

## Architecture of VAEs

It consists of the main following parts:

**Encoder**: The encoder maps input data (e.g., an image) to a probability distribution in the latent space. Instead of encoding an input to a single point, as in traditional autoencoders, the VAE encoder outputs parameters of a probability distribution (mean and variance) that describe the latent variables. This allows the model to generate a range of outputs rather than a single fixed representation.

**Latent Space**: The latent space represents the compressed information from the input data in the form of a probability distribution. This distribution captures the underlying features of the data, allowing for new data generation by sampling from this space.

**Decoder**: The decoder takes a sample from the latent space and reconstructs it back into the original data space, attempting to reproduce the input data as closely as possible.

<img src="images/vae.webp" 
        alt="Picture" 
        width="600" 
        height="600" 
        style="display: block; margin: 0 auto" />

Image Source: [TDS Blog](https://towardsdatascience.com/difference-between-autoencoder-ae-and-variational-autoencoder-vae-ed7be1c038f2) 

## Features of VAEs

1. **Distributional Encoding**: Traditional autoencoders encode data to fixed points in the latent space, while VAEs encode data into distributions
   
2. **Generative Capability**: VAEs are designed to generate new data points by sampling from the learned latent space distributions, whereas traditional autoencoders primarily focus on reconstructing the input without a generative aspect.

3. **Reparameterization Trick**: One of the key innovations in VAEs is the reparameterization trick, which enables efficient backpropagation through the stochastic sampling process. Directly sampling from the latent distribution (mean and variance) introduces randomness that disrupts gradient-based optimization, making it difficult to train the network using backpropagation. The reparameterization trick solves this by expressing the sampled latent variable as a deterministic function of the mean, variance, and a random gaussian noise term. We sample from a standard normal distribution (a fixed distribution of zero mean and unit variance) to represent noise and adjust this sample using the learned mean and variance. This reformulation ensures that the sampling process is differentiable and allows efficient gradient computation.

# Transformers

Transformers are a neural network architecture for natural language processing that form the bases of all foundational models/LLMs today.

## Background

The typical architecture, consisting of an encoder and a decoder, was developed by Vaswani et al (2017) for the purpose of sequence transduction tasks.

## Architecture

 A transformer typically consists of two main parts: 

1. Encoder:  It takes in input and outputs a matrix representation of that input. For instance, the English sentence “How are you?”
2. Decoder: It takes encoder output and iteratively generates an output. In our example, the translated sentence “¿Cómo estás?”

The encoder and decoder are themselves made up of many layers with same structure (original paper had 6 layers of each.

<img src="images/transformer.png" 
        alt="Picture" 
        width="800" 
        height="800" 
        style="display: block; margin: 0 auto" />

Image Source: [Datascience Dojo Blog](https://datasciencedojo.com/blog/transformer-models-types-their-uses/)

### Embedding Layer 

Since neural networks cannot directly process words, the input sequence needs to be converted into an embedding matrix which can then be passed into the model. 

Each word in the input sequence is converted into a vector of size $N_{dim}$ and then placed side by side into an embedding matrix. If the input has length $N_l$, then the final embedding has dimension $N_{dim} \times N_l $

### Positional Encoding

Since transformers are not sequential like RNNs, the positional information needs to be integrated in the input embedding for better performance.

Each position in the input sequence is represented using a combination of various sine and cosine functions where each dimension in a token embedding is represented by unique frequencies. The positional encoding matrix is added to the embedding matrix to create a positional encoding. 

### Encoder Block

The encoder block consists of many encoder layers where the purpose of each encoder layer is to convert each token into an abstract representation vector that contains all learned information. The abstract representation is constructed via the following steps: 

#### Multi Head Attention

This enables the model to calculate attention scores between tokens. 

1. QKV: Each token vector is converted into 3 vectors via matrix multiplication- the query $Q$, the key $K$ and the value $V$.
2. QK Matmul: For each token, we get attention scores for other tokens by $QK^T$ which represents the relevance of query token for one key token. In the attention score matrix, $a_{ij}$ represents the attention that ith token pays to the jth token and each row now represents the scores for one token for all other tokens. 
3. Scaling: Each scores are then scaled down by dividing them by the square root of the dimension of the query/key vectors. This step is implemented to ensure more stable gradients, as the multiplication of values can lead to excessively large effects.
4. Softmax: Subsequently, a softmax function is applied to the adjusted scores to obtain the attention weights. This results in probability values ranging from 0 to 1 in for each token.
5. AV Matmul: The attention weights are multiplied by the value vector, resulting in an output vector where each row represents the weighted values of a token across dimensions. In this process, only the words that present high softmax scores are preserved.


<img src="images/qkv.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Ketan Doshi Blog](https://ketanhdoshi.github.io/Transformers-Why/)

6. Concatenation: The calculations detailed above happen separately across $h$ heads where the both the input encoding and QKV is split across the heads. In the end, the output of each head is concatenated back together.


<img src="images/multihead.webp" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Ketan Doshi Blog](https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853)

#### Add & Norm

The output of the multihead attention is added to the residual and then undergoes layer normalization for each token row

For an input vector $x$, the Layer Normalization is computed as:

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

This residual addition followed by normalization helps stabilize activations during forward pass and gradients during backward pass. 

#### Feedforward Network

#### Add & Norm

### Decoder Block