# Basics of Transformers

Transformers are a neural network architecture for natural language processing that form the bases of all foundational models/LLMs today. 

## Background

The typical architecture, consisting of an encoder and a decoder, was developed by Vaswani et al (2017) for the purpose of sequence transduction tasks.

## Architecture

 A transformer typically consists of two main parts: 

1. Encoder:  It takes in input and outputs a matrix representation of that input. For instance, the English sentence “How are you?”
2. Decoder: It takes encoder output and iteratively generates an output. In our example, the translated sentence “¿Cómo estás?”

The encoder and decoder are themselves made up of many layers with same structure (original paper had 6 layers of each).

<img src="images/transformer.png" 
        alt="Picture" 
        width="800" 
        height="800" 
        style="display: block; margin: 0 auto" />

Image Source: [Datascience Dojo Blog](https://datasciencedojo.com/blog/transformer-models-types-their-uses/)

The details of each part of the architecture are given below. 

### Embedding Layer 

Since neural networks cannot directly process words, the input sequence needs to be converted into an embedding matrix which can then be passed into the model.

### Positional Encoding

Since transformers are not sequential like RNNs, the positional information needs to be integrated in the input embedding for better performance.

### Encoder Block

The encoder consists of many encoder layers where the purpose of each encoder layer is to convert each token into an abstract representation vector that contains all learned information.

The abstract representation is constructed via: Multi Head Attention + Add & Norm + Feedforward + Add & Norm

#### Multi Head Attention

This enables the model to calculate attention scores between tokens. The operations in MHA are as follows: 


1. QKV: Each token vector is converted into 3 vectors via matrix multiplication- the query $Q$, the key $K$ and the value $V$.
2. QK Matmul: For each token, we get attention scores for other tokens by $QK^T$ which represents the relevance of query token for one key token. In the attention score matrix, $a_{ij}$ represents the attention that ith token pays to the jth token and each row now represents the scores for one token for all other tokens. 
3. Scaling: Each scores are then scaled down by dividing them by the square root of the dimension of the query/key vectors. This step is implemented to ensure more stable gradients, as the multiplication of values can lead to excessively large effects.
4. Softmax: Subsequently, a softmax function is applied to the adjusted scores to obtain the attention weights. This results in probability values ranging from 0 to 1 in for each token.
5. AV Matmul: The attention weights are multiplied by the value vector, resulting in an output vector where each row represents the weighted values of a token across dimensions i.e for first element in row 1 represents the importance of token 1 for itself in dimension 1 , the second element in row 1 represents the importance of token 2 for token 1 in dimension 2, and so on. Thus in this process, only the tokens that had high softmax scores for a given token in a dimension are preserved. 


<img src="images/qkv.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Ketan Doshi Blog](https://ketanhdoshi.github.io/Transformers-Why/)

6. Concatenation: The calculations detailed above happens separately across $h$ heads where each head has its own QKV weight matrices representing different subspaces of the input.  In the end, the output of each head is concatenated back together side by side.

7. Linear Layer: The concatenated matrix has a different dimension with $N_l \times N_{dim * h}$ and thus is passed through linear layer i.e a weight matrix of dim $N_{dim * h} \times N_{dim}$ to project it back to the general space.

<img src="images/multihead.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Jay Alammar's Blog](https://jalammar.github.io/illustrated-transformer/)

#### Add & Norm

The output of the multihead attention is added to the residual and then undergoes layer normalization for each token row to stabilise activations and gradients.

#### Feedforward Network

The feedforward network consists of two fully connected linear layers with a ReLU activation in between. 

It can be mathematically expressed as: 

$$ FFN(x)= (max (0,(xW_1+b_1)))W_2 + b_2 $$

The FFN processes the input from the attention module in the following way: 

1. **Linear Layer 1**: The input from the attention module $x_{N_{l}\times N_{dim}}$ is projected to a higher dimensional space by the linear layer with dimensions $N_{FFN} > N_{dim}$ which helps map each token i.e each row into a higher dimensional space.
2. **ReLU Activation**: The output of the first linear layer undergoes ReLU activation where we choose the maximum of 0 and output. This introduces non-linearity and helps model emphasise some features or aspects of the input.
3. **Linear Layer 2**: The second linear layer is used to project the output back into the $N_{dim}$ space for processing by further layers.

One important thing to remember is that each token position undergoes the exact same transformation independently of others and thus each FFN can be said to extract the same information type from each input position in a layer, unlike the attention layer where information is passed between positions. 

#### Add & Norm

The output of the feedforward network is again added to the residual and then undergoes token wise normalization to stabilise activations and gradients.

### Decoder Block

The individual elements of a decoder block are similar to those found in the encoder but the weights of every element in the decoder is optimized for a purpose - to generate the next best token. 

The outputs is constructed iteratively with following steps: Masked Multi Head Attention + Add & Norm +  Cross Attention + Add & Norm + Feedforward + Add & Norm

#### Masked Multihead Attention

The masked multihead attention is mainly different from general attention by the addition of a look-ahead mask on the scaled score matrix that zeros out attention scores for each token for all following tokens. 

Once we have the scaled $QK$ matrix, the mask matrix-  with zeros for each position's current and previous token and a mask for the rest- is added to the scaled scores such that the attention scores for following tokens is masked. Then, the usual softmax, $V$ multiplication, concatenation across heads and linear layer processing happens. 

<img src="images/attmask.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Datacamp Blog](https://www.datacamp.com/tutorial/how-transformers-work)


#### Add & Norm

The output of the masked multihead attention is added to the residual and then undergoes layer normalization for each token row to stabilise activations and gradients. 

####  Cross Attention

This multihead attention module functions exactly as the one in the encoder layer, with the only different being that the input for Q and K comes from the output of the encoder block. 

#### Add & Norm

The output of the multihead attention is added to the residual and then undergoes layer normalization for each token row to stabilise activations and gradients. 

#### Feedforward Network

The output of the cross attention is passed through a FFN similar to the one found in the encoder block. 

#### Add & Norm

The output of the feedforward network once again undergoes addition to residual followed by normalization. 

### Linear Layer

The final linear layer of the model acts as a classifier which prompts a score over each token in the model vocabulary. 

It uses the output matrix of decoder ( $N_l \times N_{dim}$) and multiplies it with a trained weight matrix ($N_{dim} \times N_v$) to generate a matrix of dimensions $N_l \times N_v$ where each row represents the logit scores of that token for all the $v$ vocabulary items in the model vocab. 

### Softmax

The output of the linear classifier is used to convert the scores from linear classifier into a probability distribution over the vocabulary. 

The softmax is a differentiable function and is used to calculate the cross entropy loss $\sum_{i=1}^n x_i log(p_i)$ for each token which is propagated backward through the network during the training phase. 


$$s(x_i)= \frac{e^{x_i}}{\sum_{j=1}^n e^j}$$


<img src="images/llineartrans.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Datacamp Blog](https://www.datacamp.com/tutorial/how-transformers-work)

## Training

The training of a typical transformer involves the following steps


1. **Forward Pass of Encoder**: The forward pass of encoder takes in the entire training sequence and generates an encoder output which serves as input to cross-attention of decoder layer.
2. **Forward Pass of Decoder**: The forward pass of decoder takes in one token input at a time of the sequence and the encoder output, to generate an output matrix
3. **Classification**: The decoder output is used to generate a score vector over the model voacbulary.
4. **Softmax**: The softmaxed vector output is obtained for a sequence.
5. **Ground Truth Encoding**: For each token in sequence, an equivalent one hot encoding is generated where the only 1 is the correct next token.

<img src="images/onehot.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Jay Alammar's Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)

6. **Loss Calculation**: The loss between these two vectors is calculated in terms of encoding difficulty (Cross-Entropy Loss) or divergence (KL Divergence Loss) from one probability distribution to another. The cross entropy loss is most commonly used since the only difference from KL Divergence is the entropy of P which is constant.

$$ H(P,Q)= -\sum_{x=1}^j P(x_i)log_2(Q(x_i))$$

7. **Backpropagation**: The gradients are calculated for all weights in the network and updated accordingly.

## Inference

For the use of transformers in real world scenarios, the softmax vector is used to output a token from the vocabulary and this is known as decoding.

There are various decoding methods for LLMs: 

1. **Greedy Decoding**: The output with highest probability from the softmax is chosen. 
2. **Beam Search**: Given a hyperparameter k, the algorithm tracks the top $k$ vocabulary tokens and chooses one that has the highest probability with all tokens so far. 
3. **Nucleus Sampling**: Given a hyperparameter p, the algorithm tracks the tokens with cumulative probability above $p$.

## Features of Transformers

There are two main architectural novelties in the transformer: 

1. **Parallelization**: Unlike RNNs, the tokens in a sequence do not have to be processed one by one and each token undergoes its own separate paralle flow through a transformer.
2. **Attention Mechanism**: The parallelization in part is made possible by the attention mechanism which plays the role of hidden state of RNNs, allowing transformers to keep track of other sequence tokens (past or ahead) that the model processes. 

# Types of Transformers

Since their introduction, several modifications have been made to adapt transformers for different tasks. The primary variants of this architecture include: 

- **Sequence-to-Sequence Transformers**: These models use the original transformer architecture with both encoder to decoder for sequence transduction tasks. Example: Bart, T5
- **Autoencoder Transformers**: These models use only the encoder blocks of a transformer for understanding of a text via the representational capabilities of a model. Based on their training on text with masked words, they are also known as Masked Language Models (MLMs).
- **Autoregressive Transformers**: These models use only the decoder blocks of a transformer for generating text. Based on their training with next token prediction, they are also known as Causal Language Models (CLMs). 

## Sequence-to-Sequence Transformers

These models create contextualized embeddings (embeddings + positional encodings) for each word in a sequence and then pass them through encoder blocks (attention + MLP) followed by decoder blocks (masked attention + cross attention + MLP). 

- They are trained with sequence translation.
- They are good at understanding text via their bidirectional attention mechanism and also generating text.
- The output representations of encoder can be used for downstream tasks like token classification, sequence classification, feature analysis etc.
- The output of decoder can be used for generating text. 

## Autoencoder Transformers

The encoder models create contextualized embeddings (embeddings + positional encodings) for each word in a sequence and then pass them through encoder blocks (attention + MLP) 

- They are trained with masked language modelling.
- They are good at understanding text via their bidirectional attention mechanism.
- The output representations can be used for downstream tasks like token classification, sequence classification, feature analysis etc. 

## Autoregressive Transformers

The encoder models create contextualized embeddings (embeddings + positional encodings) for each word in a sequence and then pass them through decoder blocks (masked attention + MLP) 

- They are trained with causal language modelling.
- They are good at generating probability distributions over model vocabulary via their masked attention mechanism.
- The output representations undergo decoding to generate tokens iteratively. 

# Implementing Transformers 

One of the most common ways to implement transformers is by building them with automated differentiation libraries like pytorch. 

In [5]:
# layer norm 
import numpy as np 
class LayerNorm(): 
    def __init__(self,epsilon=1e-5):
        self.epsilon=epsilon

    def forward (self, x): 
        mean=np.mean(x,axis=-1, keepdims=True)
        variance = np.var(x, axis=-1, keepdims=True)
        x_normalized = (x - mean) / np.sqrt(variance + self.epsilon)
        return x_normalized
        
# applying layer norm
x = np.array([[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]])

# Instantiate LayerNorm and normalize x
layer_norm = LayerNorm()
output = layer_norm.forward(x)

print("Normalized Output:\n", output)
print(output.shape)

Normalized Output:
 [[-1.34163542 -0.44721181  0.44721181  1.34163542]
 [-1.34163542 -0.44721181  0.44721181  1.34163542]
 [-1.34163542 -0.44721181  0.44721181  1.34163542]]
(3, 4)


# Huggingface Transformers

Implementing and training transformers from scratch is not always feasible so we can use trained transformer models from Huggingface. 

The HuggingFace ecosystem is geared toward NLP and consists of 4 main libraries:

- [Datasets](https://github.com/huggingface/datasets)
- [Tokenizers](https://github.com/huggingface/tokenizers)
- [Transformers](https://github.com/huggingface/transformers)
- [Accelerate](https://github.com/huggingface/accelerate)

 ## What is NLP? 

 NLP is subfield of linguistics and machine learning which is focused on creating models that process, understand and generate natural language in a way that reflects human language abilities. 

 Some common NLP tasks are: 

 - Classification: It involves taking elements from a document and putting them into pre-specified categories. 
 - Extraction: It involves isolating and returning specific parts from a document or set of document collection which are relevant to query
 - Summarization: It involves taking a document or set of documents and converting it into a shorter version without loss of information.
 - Generation: It involves creating more text related to a given input.

While the most common work involves written text, NLP also finds applications in speech and vision related tasks. 

## Transformers Library

The HuggingFace Hub consists of hundreds of pretrained transformers which can be used with help of the "transformers" library that provides functionality for creating and using the models. The library uses a single API through which any model can be loaded, trained and saved. Some important features of the library are: 

- Each model is a simple PyTorch `nn.module` class or a TensorFlow `tf.keras.Model` class.
- Each model in the hub has its own forward pass defined in a unique Config file, which does not include weights and is not shared across models.
- The Config File (from `class transformers.PretrainedConfig`) can be edited separately for each model to allow us more flexibility.


There are two main ways of using the models: 

- **Pipelines**: It is an end to end function that performs a pre specified NLP task on one or more texts. The pipeline tasks available are: `zero-shot-classification`, `sentiment-analysis`,`ner`,`summarization`,`text-generation`, `translation`,`question-answering`, `fill-mask`, `feature-extraction`, and many more. The total tasks stand at 17 currently. 
- **Checkpoints**: It is way of loading pre defined architecture and weights of a transformer and then defining the pre and post processing pipelines ourselves for better customisation.
- **Architectures**: It is way of loading pre defined architecture of a transformer from model config and then loading weights separately before defining the pre and post processing pipelines ourselves for better customisation.

For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

### Using Pipelines

In [7]:
# Pipelines with task
# import pipeline
from transformers import pipeline

# pipeline with task
classifier= pipeline("sentiment-analysis")
classifier("I'm doing really really good")

# pipeline with model
pipe = pipeline(model="FacebookAI/roberta-large-mnli")
pipe("This restaurant is awesome")

# pipeline with task and model
unmasker=pipeline("fill-mask", model="bert-base-cased")
male_jobs= unmasker("The man works as a [MASK].")
print([dic['token_str'] for dic in male_jobs])
female_jobs= unmasker("The woman works as a [MASK].")
print([dic['token_str'] for dic in female_jobs])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Some weights of the model checkpoint at FacebookAI/roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a mode

['waiter', 'carpenter', 'lawyer', 'mechanic', 'doctor']
['nurse', 'maid', 'waitress', 'cook', 'teacher']


### Using Checkpoints

Once we go beyond pipelines and use checkpoints, we use the `AutoTokenizer and `AutoModel` classes which are wrappers over tokenizer and models in the library for easy initialization of tokenizer/model objects. If we use checkpoints, we need to do 3 important steps ourselves: 

- Preprocessing with `AutoTokenizer`
- Inference with `AutoModel`
- Postprocessing to get output.

#### Preprocessing 

The preprocessing for inputs is done as follows: 

- Import AutoTokenizer
- Initialize AutoTokenizer object from a pretrained model checkpoint
- Create input list
- Call tokenizer object with inputs, padding, truncation and output tensor type. Padding indicates whether sequences of differing length should be padded to match max length and truncation indicates breaking up of inputs into multiple sequences if they are longer than allowed model sequence length.
- The output is a dictionary with two keys- input_ids which is a tensor of dimension $seqnum \times seqlen$ indicating token numbers and attention mask which is also a same dimension tensor of mask values for each token where 1 is non-mask and 0 is mask. 

In [9]:
# preprocessing
from transformers import AutoTokenizer
model_name= "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer= AutoTokenizer.from_pretrained(model_name) # tokenizer configuration from specific model
raw_inputs=[  # pass in sentence list
    'I love the intro',
    'I hated the game'
]
tokenized_inputs= tokenizer(raw_inputs, padding= True, truncation= True, return_tensors="pt")
print(tokenized_inputs) # outputs token id and mask tensors of input

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

{'input_ids': tensor([[  101,  1045,  2293,  1996, 17174,   102],
        [  101,  1045,  6283,  1996,  2208,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]])}


#### Inference with AutoModel Only

The model inference with tokenized inputs is done as follows: 

- Import AutoModel
- Initialize AutoModel from a pretrained model checkpoint
- Call model object with tokenized inputs
- The output is a tensor of dimensions $seqnum \times seqlen \times model dimension$

In [11]:
# inference with automodel
from transformers import AutoModel
model_name= "distilbert-base-uncased-finetuned-sst-2-english"
model=AutoModel.from_pretrained(model_name)
outputs= model(**tokenized_inputs)
print(outputs)
print(outputs.last_hidden_state.shape)

BaseModelOutput(last_hidden_state=tensor([[[ 0.5472,  0.0722,  0.0797,  ...,  0.5431,  1.1225, -0.3735],
         [ 0.7301,  0.1646,  0.1477,  ...,  0.4137,  1.1830, -0.2681],
         [ 0.9611,  0.3415,  0.3466,  ...,  0.3894,  1.0995, -0.3300],
         [ 0.5608, -0.0158,  0.1010,  ...,  0.5690,  1.2197, -0.3903],
         [ 0.4259,  0.2286,  0.2534,  ...,  0.5037,  0.9604, -0.3332],
         [ 1.0663,  0.1271,  0.6909,  ...,  0.6291,  0.7586, -0.8868]],

        [[-0.3951,  0.7698, -0.2996,  ...,  0.0231, -0.8558, -0.0593],
         [-0.4719,  0.8989, -0.2922,  ..., -0.2249, -0.5333,  0.1423],
         [-0.2880,  0.8173, -0.2044,  ..., -0.1231, -0.7169, -0.0454],
         [-0.5774,  0.6085, -0.3825,  ..., -0.4900, -0.7831,  0.1544],
         [-0.4643,  0.4916, -0.4042,  ..., -0.7778, -0.6882,  0.4220],
         [ 0.0956,  0.5564, -0.3878,  ..., -0.2513, -0.6992, -0.1342]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)
torch.Size([2, 6, 768])


#### Inference with AutoModel + Head

The model output with head typically gives the loss and logit tensor of dimension $seqnum \times classes$. The output can be used directly for analysis or it can be decoded to get natural language ouput. 

Instead of AutoModel, we can use AutoModel* where * can be anything from  `ForTokenClassification`, `ForSequenceClassification`,`ForQuestionAnswering`, `ForMaskedLM`, `ForCausalLM`, etc. 

In [14]:
# inference with automodel + head
from transformers import AutoModelForSequenceClassification
model_name= "distilbert-base-uncased-finetuned-sst-2-english"
model=AutoModelForSequenceClassification.from_pretrained(model_name)
outputs= model(**tokenized_inputs)
print(outputs)
print(outputs.logits.shape) # each row indicates a sequence and each column the logit score for a class. 

SequenceClassifierOutput(loss=None, logits=tensor([[-4.2865,  4.6021],
        [ 4.4335, -3.6342]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
torch.Size([2, 2])


#### Postprocessing

The logit outputs need to be converted into probability scores via softmax to be turned into predictions. The predictions can be further decoded into a class or a token as needed. 

In [17]:
# postprocessing
from torch import nn as nn
predictions= nn.functional.softmax(outputs.logits, dim=-1) # softmax across last dimension
print(predictions)
print(model.config.id2label)

tensor([[1.3793e-04, 9.9986e-01],
        [9.9969e-01, 3.1341e-04]], grad_fn=<SoftmaxBackward0>)
{0: 'NEGATIVE', 1: 'POSITIVE'}


### Using Architectures

Using architectures gives us the most control over the model. Unlike using checkpoints, here we use the `Model` and `Config` classes to load architectures (i.e defined forward pass) and then we initialize weights (via training or loading) before we go onto initializing a tokenizer/model object. 

The steps typically include: 

- The `Config` file gives two important things-  configuration class to be used for model and the model class to be used.
- The model configuration is instantiated and is equivalent to a model blueprint.
- The model config and the model class are combined to initialize a model.
- Now, the weights are loaded via a `Model` file to properly initialize a pretrained tokenizer and model.
- The next steps are similar to using checkpoints. 