# Chapter 2 - Transformer Architecture


## Introduction

At the heart of any large language model lies the transformers. Proposed by in the seminal paper Attention is all you need by {cite:p}`vaswani2023attentionneed`, it has become the backbone for modern LLM. Original Transformer articulated in the paper is a neural network, consisting of the following modules

1. Encoder Module
    This is made of several self attention module heads followed by a fully conntected layers.
    
2. Decoder Module
    Similar to an encoder module with attention heads followed by fully connected layers. In addtion to user input, this layer also takes in the embedding output from encoding module.
    
 
 ```{figure} ../../images/chapter2/Transformer.png
---
height: 150px
name: Transformer Architecture
---
Transformer Architecture
```
Transormer architecture is a deep learning neural network trained for language translation tasks. The Attention is all you need paper gives an example of training from english to german and french. You can think of it as two neural networks, an encoder and a decoder connected in a cascading manner to perform the translation task.

The encoder converts the whole input english text into an embedded vector represenation. The decoder uses this embedded vector and translates one word a time. There are three types of transformers which evolved quickly after this paper was published.

1. Encoder - Decoder transformers
    They include both encoders and decoders. Translation models uses these architecture.  T5, BART etc. Their pre-training is task dependent. Tasks that involve both understanding and generating data. They first encode an input sequence into an internal represenation and then decode this represenation into an output sequence.

2. Encoder only Transformers  
    Models like BERT are encoder only transfomer.Their task involves only understanding. They are trained to do masked word prediction. Given a sentence, one of the word will be masked. Encoder only models are heavily used in classification tasks.
    
3. Decoder only Transformers
    In this book we will be looking at decoder only models. Decoder only models are trained to perform text generation. These models are also called as autoregressive models.
    

The secret sauce in encoder / decoder architecture is the multi-head attention module.It is this attention mechanism module which claims to provide better contextual information and long term dependency features present in the input text data to the model. Let us lookat the following two examples

1. The river bank is not accessible to the tourist.
2. The robbers had planned a heist of major banks in the city.


Both the example sentences have the word "bank". The context in the first sentence is "river" and the second sentence is "heist".  We want the model to treat the word "bank" with respect to their context in the sentence. Another example,

1. I went to the library yesterday, there i forgot my book. I am returning there today.

As an English reader, if I ask you a question "Were am I returning?", you know that I am returning to the library. The previous sentence has the clue. This is a trivial example for long term dependency.


This is where attention mechanism comes to rescue. Using attention mechanism we inject contextual information and long term dependency in to the model. We saw word embeddings in the previous chapter. Each word or a token had a unique position in the embedding space. However as we saw in the previous examples, these embeddings need to account for the contextual information in the input text. Attention mechanism adds these contextual information to the vector representaion of the token. More about attention in the subsequent sections.





## Why LLMs are decoder Transformers

Why most of the LLMs are decoder only models and not encoder-decoder model?

According to {cite}'wang2022language' Decoder only models trained on next word prediction objective exibit the strongest zero-shot generalization after purely self-supervised pretraining.

Models trained with masked language modeling objective, followed by mutitask finetuning perform the best among our experiments.

* Factors to consider while choosing between Decoder only or Encoder Decoder models.

  1 cost of training
      Decoder only models are cheaper to train. They can be trained on large unsupervised corpus
      Encoder Decoder needs a lot of labelled data for  multitask finetuning.
  2 Emerging Abilities
      phenomenon where models display new, sophisticated capabilities not explicitly taught during training, arising naturally as the model scales in size
  and complexity.https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

  https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f

  emergent abilities reduce the performance gap achieved by encoder decoder models over decoder-only ones with multitask finetuning.

  3. In-context learning from prompts
 
      prompt engineering methods provide few-shot examples to help LLM understand the context or task. In context information can be seen to have a similar effect as gradient descent that updates the attention weight of the zero-shot prompt.

  4. Efficiency Optimization
 
     In decoder only models, the key and value matrices from previous tokens can be reused for subsequent tokens during the decoding process. Since each position only attends to previous tokens the K and V matrices for those tokens remain unchaged. This caching mechanism improves effieciency by avoiding the recomputation of K and V matrices for tokens that have already been processed, facilitates faster generation and lower computation cost during inference.
      





Figure 1 GPT-2 Architecture

(reference:preprocess../ing)=
```{figure} ../images/chapter2/GPT2.drawio.png
---
height: 150px
name: preprocessing
---

## Scaled dot product self-attention


In a given input, words that are highly correlated receive higher weights.

In [21]:
from dataclasses import dataclass
import torch
import torch.nn as nn
import math

@dataclass
class SLLMConfig:
    # Embedding dimension
    d_model: int = 512
    # Query key Value projection dimension
    d_head: int  = 128
    # bias for query,key and value projection matrices
    bias: bool = False
    dropout: int = 0.0
    # Number of input tokens
    context_window: int = 32
    # Number of attention heads
    n_heads: int = 4

    
config = SLLMConfig()
assert config.d_model % config.n_heads == 0

    
class SingleHeadAttention(nn.Module):
    """
    Implements weighted self attention
    """
    def __init__(self, config):

        super().__init__()
        self.Wq =  nn.Linear(config.d_model, config.d_head, bias=config.bias)
        self.Wk =  nn.Linear(config.d_model, config.d_head, bias=config.bias)
        self.Wv =  nn.Linear(config.d_model, config.d_head, bias=config.bias)

        self.attn_drop = nn.Dropout(config.dropout)
        self.__init_weights()

    def __init_weights(self):

        nn.init.xavier_uniform_(self.Wq.weight)
        nn.init.xavier_uniform_(self.Wk.weight)
        nn.init.xavier_uniform_(self.Wv.weight)

    def forward(self, x, mask=None):

        q = self.Wq(x)
        k = self.Wk(x)
        v = self.Wv(x)

        attn_score = q @ k.transpose(-2,-1)
        
        if mask == None:
            mask = torch.triu(torch.ones(x.shape[-2], x.shape[-2],device=x.device), diagonal=1)
        
        masked = attn_score.masked_fill(mask.bool(), -torch.inf)
        attn_weights = torch.softmax(masked / math.sqrt(k.shape[-1]), dim=1)
        attn_weights = self.attn_drop(attn_weights)

        context_vector = attn_weights @ v

        return context_vector

In [22]:
x = torch.normal(0.5,0.3,size=(config.context_window, config.d_model))
print(x.shape)

sha = SingleHeadAttention(config)
attn = sha.forward(x)
print(attn.shape)
print(attn[0,0:3])

torch.Size([32, 512])
torch.Size([32, 128])
tensor([-0.0328, -1.4182,  1.4332], grad_fn=<SliceBackward0>)


### Flash Attention

Pytorch scaled_dot_product_attention implements a memory optimized version of self-attention called flash attention

In [27]:
from torch.nn.functional import scaled_dot_product_attention

q = sha.Wq(x)
k = sha.Wk(x)
v = sha.Wv(x)


context_vector = scaled_dot_product_attention(q,k,v, attn_mask=None
                                              , dropout_p =0.0
                                              ,is_causal=True
                                              ,scale=math.sqrt(k.shape[-1]))

print(context_vector[0,0:3])

tensor([-0.0328, -1.4182,  1.4332], grad_fn=<SliceBackward0>)


In [18]:
sha1 = SingleHeadAttention1(config)
attn = sha1.forward(x)
print(attn.shape)


torch.Size([32, 128])
tensor([ 0.6260, -0.1517, -0.5864], grad_fn=<SliceBackward0>)


In [1]:
class MultiHeadAttention(nn.Module):
    """
    Multihead Attention Implementation
    """
    def __init__(self, config):
        
        super().__init__()
        self.heads = nn.ModuleList(
            [
                SingleHeadAttention(config) for _ in range(config.n_heads)
            ]
        )
        self.projection_out = nn.Linear(config.n_heads * config.d_head, config.d_head)

    def forward(self, x):
        attentions = []
        for head in self.heads:
            attentions.append(head(x))

        context_vector = torch.cat(attentions, dim=-1)
        context_projected = self.projection_out(context_vector)
        return context_projected



NameError: name 'nn' is not defined

In [32]:
mha = MultiHeadAttention(config)
x = torch.normal(0.5,0.3,size=(config.context_window, config.d_model))
print(x.shape)
context_vec = mha.forward(x)
print(context_vec.shape)

torch.Size([250, 512])
torch.Size([250, 128])


  nn.init.xavier_uniform(self.Wq.weight)
  nn.init.xavier_uniform(self.Wk.weight)
  nn.init.xavier_uniform(self.Wv.weight)


## Layer Normalization

Adjust the the output of the activation to have zero mean and a variance of 1.

In [17]:
import torch
import torch.nn as nn

batch = 2
context_window = 3
d_model = 4

embedding = torch.randn(batch, context_window, d_model)
print(embedding)
layer_norm = nn.LayerNorm(d_model)

normalized = layer_norm(embedding)
print(normalized)

tensor([[[-0.5976, -0.9168, -0.9624,  0.1854],
         [-0.7473,  0.5628, -0.8835,  2.3138],
         [-0.3804, -0.3160, -0.0926,  1.0782]],

        [[-1.5180,  0.4197, -1.1423,  0.1079],
         [ 1.2017, -1.3224,  1.7539,  0.1658],
         [ 0.3813,  1.2058,  0.6612, -1.0390]]])
tensor([[[-0.0539, -0.7481, -0.8472,  1.6491],
         [-0.8229,  0.1953, -0.9287,  1.5563],
         [-0.7667, -0.6576, -0.2792,  1.7034]],

        [[-1.2077,  1.1685, -0.7469,  0.7862],
         [ 0.6420, -1.5130,  1.1135, -0.2425],
         [ 0.0952,  1.0895,  0.4328, -1.6176]]],
       grad_fn=<NativeLayerNormBackward0>)


In [18]:
print(f"Mean {torch.mean(embedding,dim=-1)}")
print(f"Var {torch.var(embedding,dim=-1)}")


Mean tensor([[-0.5728,  0.3114,  0.0723],
        [-0.5332,  0.4498,  0.3023]])
Var tensor([[0.2819, 2.2072, 0.4649],
        [0.8866, 1.8291, 0.9168]])


In [19]:
print(f"Mean {torch.round(torch.mean(normalized,dim=-1))}")
print(f"Var  {torch.round(torch.var(normalized,dim=-1))}")


Mean tensor([[0., 0., 0.],
        [0., -0., -0.]], grad_fn=<RoundBackward0>)
Var  tensor([[1., 1., 1.],
        [1., 1., 1.]], grad_fn=<RoundBackward0>)


weight and bias as scaling and shifting values which are trained as a part of the model

In [33]:
class LayerNorm(nn.Module):

    def __init__(self, ndim, bias=False):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias   = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, x):
        return F.layer_norm(x, self.weight.shape, self.weight, self.bias, 1e-5)


In [34]:
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.Linear(config.d_head, config.d_head, bias=config.bias)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(config.d_head, config.d_head, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.ln_1(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return self.dropout(x)
    
        

In [35]:
class TransformerBlock(nn.Module):

    def __init__(self, config):
        super().__init__()

        self.ln1 = LayerNorm(config.d_model, bias=config.bias)
        self.mha = MultiHeadAttention(config)
        self.ln2 = LayerNorm(config.d_head, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):

        x = x + self.mha(self.ln1(x))
        x = x + self.mlp(self.ln2(x))

        return x
        
        

In [36]:
class SLLM(nn.Module):

    def __init__(self, config):
        super().__init__()

        self.token_embdgs = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_embdgs   = nn.Embedding(config.context_window, config.d_model)
        self.droput       = nn.Dropout(config.dropout)

        self.transformer_blocks = nn.ModuleList(
        [TransformerBlock(config) for _ in config.n_layers]
        )
        self.final_norm = LayerNorm(config.d_head)
        self.out_head = nn.Linear(config.d_head, config.vocab_size)

    def forward(self, x):

        batch_size, seq_length = x.shape
        token_embds = self.token_embdgs(x)
        pos_embds = self.pos_embds(torch.arange(seq_length, device=x.device))
        x = token_embds + pos_embds
        x = self.dropout(x)
        x = self.transformer_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits
                                   
                                   
        

In [None]:
##Improvements to multihead attention

1. Matrix Multiplication
2. Multi Query Attention
4. A
5. scald dot product

## Multi Query Attenion

Multi-head attention has multiple attention layers in parallel. Each attention layer has its own linear transformations on the queries, keys and values and outputs.

In [20]:
from einops import einsum



ModuleNotFoundError: No module named 'einops'