# Chapter 2 - Transformer Architecture


## Introduction

At the heart of any large language model lies the transformers. Proposed by in the seminal paper Attention is all you need by {cite:p}`vaswani2023attentionneed`, it has become the backbone for modern LLM. Original Transformer articulated in the paper is a neural network, consisting of the following modules

1. Encoder Module
    This is made of several self attention module heads followed by a fully conntected layers.
    
2. Decoder Module
    Similar to an encoder module with attention heads followed by fully connected layers. In addtion to user input, this layer also takes in the embedding output from encoding module.
    
 
 ```{figure} ../../images/chapter2/Transformer-full.png
---
height: 150px
name: Transformer Architecture
---
Transformer Architecture
```
Transormer architecture is a deep learning neural network trained for language translation tasks. The Attention is all you need paper gives an example of training from english to german and french. You can think of it as two neural networks, an encoder and a decoder connected in a cascading manner to perform the translation task.

The encoder converts the whole input english text into an embedded vector represenation. The decoder uses this embedded vector and translates one word a time. There are three types of transformers which evolved quickly after this paper was published.

1. Encoder - Decoder transformers
    They include both encoders and decoders. Translation models uses these architecture.  T5, BART etc. Their pre-training is task dependent. Tasks that involve both understanding and generating data. They first encode an input sequence into an internal represenation and then decode this represenation into an output sequence.

2. Encoder only Transformers  
    Models like BERT are encoder only transfomer.Their task involves only understanding. They are trained to do masked word prediction. Given a sentence, one of the word will be masked. Encoder only models are heavily used in classification tasks.
    
3. Decoder only Transformers
    In this book we will be looking at decoder only models. Decoder only models are trained to perform text generation. These models are also called as autoregressive models.
    

The secret sauce in encoder / decoder architecture is the multi-head attention module. It is this attention mechanism module which claims to provide better contextual information and long term dependency features present in the input text data to the model. Let us lookat the following two examples

1. The river bank is not accessible to the tourist.
2. The robbers had planned a heist of major banks in the city.


Both the example sentences have the word "bank". The context in the first sentence is "river" and the second sentence is "heist".  We want the model to treat the word "bank" with respect to their context in the sentence. Another example,

1. I went to the library yesterday, there i forgot my book. I am returning there today.

As an English reader, if I ask you a question "Were am I returning?", you know that I am returning to the library. The previous sentence has the clue. This is a trivial example for long term dependency.


This is where attention mechanism comes to rescue. Using attention mechanism we inject contextual information and long term dependency in to the model. We saw word embeddings in the previous chapter. Each word or a token had a unique position in the embedding space. However as we saw in the previous examples, these embeddings need to account for the contextual information in the input text. Attention mechanism adds these contextual information to the vector representaion of the token. More about attention in the subsequent sections.



ALl the examples in this book are decoder only architecture. It is the widely used architecture for building causal LLMs.

## Why LLMs are decoder Transformers

Why most of the LLMs are decoder only models and not encoder-decoder model?

According to {cite}'wang2022language' Decoder only models trained on next word prediction objective exibit the strongest zero-shot generalization after purely self-supervised pretraining.

Models trained with masked language modeling objective, followed by mutitask finetuning perform the best among our experiments.

* Factors to consider while choosing between Decoder only or Encoder Decoder models.

  1 cost of training
      Decoder only models are cheaper to train. They can be trained on large unsupervised corpus
      Encoder Decoder needs a lot of labelled data for  multitask finetuning.
  2 Emerging Abilities
      phenomenon where models display new, sophisticated capabilities not explicitly taught during training, arising naturally as the model scales in size
  and complexity.https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

  https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f

  emergent abilities reduce the performance gap achieved by encoder decoder models over decoder-only ones with multitask finetuning.

  3. In-context learning from prompts
 
      prompt engineering methods provide few-shot examples to help LLM understand the context or task. In context information can be seen to have a similar effect as gradient descent that updates the attention weight of the zero-shot prompt.

  4. Efficiency Optimization
 
     In decoder only models, the key and value matrices from previous tokens can be reused for subsequent tokens during the decoding process. Since each position only attends to previous tokens the K and V matrices for those tokens remain unchaged. This caching mechanism improves effieciency by avoiding the recomputation of K and V matrices for tokens that have already been processed, facilitates faster generation and lower computation cost during inference.
      





## Scaled Self-Attention

Sometime while performing average of numbers, we assign weights to different numbers; this is weighted averaging. Varied attention is given to different numbers.In the example below we show a weighted average example by restricting the contribution of the three entities, to ten, thirty and sixty percentages. Let us take some liberty here for the sake for quickly grasping the concept of self-attention. Instead of calling these as weights, lets call them as attention. Rewriting the previous sentence, we have restricted the attention the average function can give to these three entities as 10, 30 and 60 percent.


In [152]:
weights  = (0.1, 0.3, 0.6)
entities = (90.,120.,100.)

normal_average = sum(entities) / len(entities)

weighted_sum   = (x*y for x,y in zip(weights, entities))
weighted_average = sum(weighted_sum) / len(entities)

 
print(f"Average Normal {normal_average:.3f} \
    Weighted {weighted_average:.2f}")

Average Normal 103.333 Weighted 35.00


So, what is self attention. Say we have six tokens in our input, "The boat left the river bank." 
For each token we have a vector represenation in the form of embedding. If we take a dot product of the vectors for "The" and "boat", it gives us the strength of their relationship. Refer to previous chapter for vector represenation of words. So say we have an initial embedding for word "The". Now we do a dot product of word "the" and "boat", followed by dot product of "The" and "left" and so on. Let us call them attention weights, the strength of relationship between those words. 

When we start training a transformer for a language task, we choose to initialize these word embeddings in a random fashion leaving us with no control over the scale of the embedding values. Attention paper proposes to scale these values using square root of the embedding dimension. Hence we call this attention as scaled attention. After scaling we normalize normalize these weights to fall between zero and one. Normalizing the values help in stabilized training of LLM. 

Finally we will enrich the embedding representation of each word with these scaled normalized attention weights. We take the embedding for the word "The", we multiply it with attention weight for "The". The embedding for "The" is now enriched with its relationship with itself. To this add the next enriched vector, the embedding vector for "The" multipled by attention weight for the word "boat" and so on. The final vector called context vector for the word "The" is thus formed by addition. Through this weighted sum, the context vector for each word is now enriched by their strength of relationship with their neighbors. The input is enriched by its constituents and hence the name self-attention.

To sum up scaled self-attention, the input is enriched by its constituents and hence self, the contribution of the words in the input to each other is the attention and finally these attention weights are scaled by square root of embedding dimension and hence scaled is used in naming this.


```{note}
dot product explanation comes here.

```

To demonstrate this concept, let us create some dummy word embeddings. For the purpose of illustration we we will stick with word tokens and not their integer encoding. In the previous chapter we studied about integer encoding followed by token embedding. For bruety purpose let us skip directly to token embeddings.

In [239]:
import numpy as np
input_tokens = ["The", "boat", "left", "the", "river", "bank"]
embd_dim =4

generate_embdgs = np.random.normal(size=(embd_dim,))
token_embedding = np.array([generate_embdgs * np.random.rand() \
                            for i in range(len(input_tokens))])
token_embedding

array([[ 0.40340279,  0.03305275,  0.99138182, -0.57605323],
       [ 0.25784043,  0.02112612,  0.63365529, -0.36819233],
       [ 0.0519119 ,  0.00425339,  0.12757601, -0.07412943],
       [ 0.09087871,  0.00744613,  0.22333881, -0.12977345],
       [ 0.10743083,  0.00880233,  0.26401646, -0.15340965],
       [ 0.05611193,  0.00459752,  0.13789777, -0.08012701]])

With these embeddings, now let us calcuate the reltionship strength of firt word "The" with all the words in the input sequence.

In [240]:
first_word_initial_vector = token_embedding[0]

attention_scores_first_word = np.zeros(len(input_tokens),)

for idx, each_word_vector in enumerate(token_embedding):
    attention_scores_first_word[idx] = \
            np.dot(first_word_initial_vector, each_word_vector)

for i in range(len(attention_scores_first_word)):
    print(f"Strength betwen '{input_tokens[0]}' \
        and '{input_tokens[i]}' {attention_scores_first_word[i]:.2f}")
    

Strength betwen 'The'         and 'The' 1.48
Strength betwen 'The'         and 'boat' 0.95
Strength betwen 'The'         and 'left' 0.19
Strength betwen 'The'         and 'the' 0.33
Strength betwen 'The'         and 'river' 0.39
Strength betwen 'The'         and 'bank' 0.21


The stength score array now has the relationship strengths. Using we can now created a weighted sum for the word "The". Let us normalize these scores using softmax function

In [241]:
from scipy.special import softmax

attention_scores_first_word = attention_scores_first_word * (1/np.sqrt(embd_dim))
attention_weights_first_word = \
    softmax(attention_scores_first_word)

print(f"Attention weights {attention_weights_first_word} \
        sum of weights {sum(attention_weights_first_word)}")

Attention weights [0.2521732  0.19313079 0.13242228 0.14222411 0.1466042  0.13344543]         sum of weights 1.0


Normalized attention weights sums up to 1.0. Softmax is useful in dealing with extreme values. Let us try to see the difference in code below.

In [242]:
x = np.asarray([0.14,0.48,10000,0,47])
x_normalized = x / sum(x)
print(f"Normalized values {x_normalized}")
print(f"Softmax Normalization {softmax(x)}")

Normalized values [1.39336480e-05 4.77725073e-05 9.95260569e-01 0.00000000e+00
 4.67772468e-03]
Softmax Normalization [0. 0. 1. 0. 0.]


With softmax we get more numerical stability, so we can avoid underflow and overflow problems during gradient calcualtions and backprobagation.

In [243]:
first_word_context_vector = np.zeros(first_word_initial_vector.shape)

for idx, score in enumerate(attention_weights_first_word):
    first_word_context_vector += token_embedding[idx] * score

first_word_context_vector

array([ 0.19456143,  0.01594136,  0.47814409, -0.27783085])

### Vectorization for all the words

Repeate the same process for each word in the squence. Using matrix operation we can vectorize attention calculation.

In [245]:
attention_scores = np.dot(token_embedding, token_embedding.T)
attention_scores = attention_scores * (1/math.sqrt(embd_dim))
attention_scores

array([[0.73925077, 0.47250227, 0.09513051, 0.16653865, 0.19687104,
        0.10282722],
       [0.47250227, 0.30200631, 0.06080397, 0.10644546, 0.12583283,
        0.06572343],
       [0.09513051, 0.06080397, 0.01224187, 0.02143103, 0.02533436,
        0.01323232],
       [0.16653865, 0.10644546, 0.02143103, 0.03751788, 0.04435117,
        0.02316495],
       [0.19687104, 0.12583283, 0.02533436, 0.04435117, 0.05242904,
        0.02738408],
       [0.10282722, 0.06572343, 0.01323232, 0.02316495, 0.02738408,
        0.01430291]])

In [246]:
attention_scores_first_word

array([0.73925077, 0.47250227, 0.09513051, 0.16653865, 0.19687104,
       0.10282722])

Compare the first row with our prevoius strength_score array.The first row here represents the strength score between word "The" and all the other words in the sequence. The second row represents the relationship between word "boat" and all hte other words.

In [261]:
attention_weights = softmax(attention_scores,axis=1)
attention_weights

array([[0.2521732 , 0.19313079, 0.13242228, 0.14222411, 0.1466042 ,
        0.13344543],
       [0.21871958, 0.18443452, 0.144907  , 0.15167402, 0.15464327,
        0.14562162],
       [0.17637897, 0.17042723, 0.16234867, 0.16384739, 0.16448819,
        0.16250955],
       [0.18392583, 0.17319868, 0.15908282, 0.16166266, 0.16277113,
        0.15935889],
       [0.1871979 , 0.17436104, 0.15768977, 0.16071723, 0.16202074,
        0.15801333],
       [0.17718185, 0.17072819, 0.16199763, 0.16361471, 0.16430648,
        0.16217115]])

We have all the attention weights. You can compare the attention weights for the first word "The" with the first row of the matrix.

In [248]:
attention_weights_first_word

array([0.2521732 , 0.19313079, 0.13242228, 0.14222411, 0.1466042 ,
       0.13344543])

In [249]:
context_vectors = np.dot(attention_weights, token_embedding)

### Masking

Another important concept to grasp before we move further is Masking. Let us see what is masking and why need masking. We perform masking to avoid data leakage. While we train the transformer to perform the next word prediction, we should avoid giving any information about the next word it is supposed to predict.In the previous example, while doing the weighted sum addition, we enriched the vector representation for word "The" w the ith all the other words in the sequence. While we expect the transformer to predict, "Boat" after "The", we have introduced leakage by providing information about "Boat" to the model while training.

Remember the context vector creation. We enrich the original embedding vector with attention scores. We dont want to enrich word "The" with the scores of "boat" before the model sees the word "boat". In the next step, when the model sees the word "The" and "boat", we will enrich it with the scores from "boat". We can do this by using a mask, where we will mask the upper triangle of our attenion score matrix.

Let us create a mask matrix of the same shape as our attention_weights matrix. Set the lower triangle part of this matrix as 1 and the upper triangle as 0.

In [250]:

mask = np.zeros_like(attention_scores, dtype=bool)
mask[np.tril_indices_from(attention_scores)] = True
mask

array([[ True, False, False, False, False, False],
       [ True,  True, False, False, False, False],
       [ True,  True,  True, False, False, False],
       [ True,  True,  True,  True, False, False],
       [ True,  True,  True,  True,  True, False],
       [ True,  True,  True,  True,  True,  True]])

Apply this mask over attention_scores matrix.

In [252]:
attention_scores_masked = attention_scores * mask
attention_scores_masked

array([[0.73925077, 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.47250227, 0.30200631, 0.        , 0.        , 0.        ,
        0.        ],
       [0.09513051, 0.06080397, 0.01224187, 0.        , 0.        ,
        0.        ],
       [0.16653865, 0.10644546, 0.02143103, 0.03751788, 0.        ,
        0.        ],
       [0.19687104, 0.12583283, 0.02533436, 0.04435117, 0.05242904,
        0.        ],
       [0.10282722, 0.06572343, 0.01323232, 0.02316495, 0.02738408,
        0.01430291]])

We should have a masked attention scores matrix, where all the upper triangular entries are set to zero. Let us set those zero values to minus infinity.

In [253]:
attention_scores_masked[attention_scores_masked == 0]= -np.inf
attention_scores_masked

array([[0.73925077,       -inf,       -inf,       -inf,       -inf,
              -inf],
       [0.47250227, 0.30200631,       -inf,       -inf,       -inf,
              -inf],
       [0.09513051, 0.06080397, 0.01224187,       -inf,       -inf,
              -inf],
       [0.16653865, 0.10644546, 0.02143103, 0.03751788,       -inf,
              -inf],
       [0.19687104, 0.12583283, 0.02533436, 0.04435117, 0.05242904,
              -inf],
       [0.10282722, 0.06572343, 0.01323232, 0.02316495, 0.02738408,
        0.01430291]])

Softmax function, will handle the negative infinity value and set it to zero.

In [279]:
attention_weights_masked = softmax(attention_scores_masked,axis=-1)
attention_weights_masked

array([[1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.54252104, 0.45747896, 0.        , 0.        , 0.        ,
        0.        ],
       [0.34641517, 0.33472572, 0.31885911, 0.        , 0.        ,
        0.        ],
       [0.27132906, 0.25550428, 0.23468043, 0.23848623, 0.        ,
        0.        ],
       [0.22232881, 0.2070829 , 0.18728298, 0.19087859, 0.19242672,
        0.        ],
       [0.17718185, 0.17072819, 0.16199763, 0.16361471, 0.16430648,
        0.16217115]])

Finally using this masked attention weights, we calculate the context vector.

In [267]:
context_vectors = np.dot(attention_weights_masked, token_embedding)
context_vectors

array([[ 0.40340279,  0.03305275,  0.99138182, -0.57605323],
       [ 0.33681107,  0.02759656,  0.82772946, -0.48096124],
       [ 0.24260326,  0.01987766,  0.5962092 , -0.34643387],
       [ 0.20919026,  0.01713997,  0.51409516, -0.29872061],
       [ 0.19082399,  0.01563513,  0.46895915, -0.27249384],
       [ 0.1655263 ,  0.01356237,  0.40678886, -0.23636911]])

The context vector for the word "The" is not enriched by other words in the input. The last word has the enrichment of all the words.

### Scaled dot product self-attention in Pytorch


The below code implements the attention mechanism in Pytorch. SLLMConfig class is a datastructure to store the model parameters.

We implement the attention mechanism in the class SingleHeadAttention.

Here is the torch implementation "torch.nn.functional.scaled_dot_product_attention" of scaled dot product attention. Going forward we will be using this version
inside our source code.

https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

<Need to rewrite it>

The documentation warns that its a beta version and subject to change. In addition to standard implementation similar to our example, this also implements three other versions,

1. Flash attention {cite}`dao2023flashattention2fasterattentionbetter`

The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation.

The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation.

2. Memory effiecient

3. C++ implementation

Scaled dot product attention attempts to automatically select the most optimal implementation based on the inputs. In order to provide more fine-grained control over what implementation is used, the following functions are provided for enabling and disabling implementations. The context manager is the preferred mechanism:

        torch.nn.attention.sdpa_kernel(): A context manager used to enable or disable any of the implementations.

        torch.backends.cuda.enable_flash_sdp(): Globally enables or disables FlashAttention.

        torch.backends.cuda.enable_mem_efficient_sdp(): Globally enables or disables Memory-Efficient Attention.

        torch.backends.cuda.enable_math_sdp(): Globally enables or disables the PyTorch C++ implementation.


    

In [280]:
from torch.nn.functional import scaled_dot_product_attention

torch_tensor = torch.tensor(token_embedding)

attn_torch = scaled_dot_product_attention(
        query = torch_tensor
       ,key   = torch_tensor
       ,value = torch_tensor
       ,attn_mask=None, dropout_p=0.0, is_causal=True, scale=None
)

print(attn_torch.shape)
attn_torch

torch.Size([6, 4])


tensor([[ 0.4034,  0.0331,  0.9914, -0.5761],
        [ 0.3368,  0.0276,  0.8277, -0.4810],
        [ 0.2426,  0.0199,  0.5962, -0.3464],
        [ 0.2092,  0.0171,  0.5141, -0.2987],
        [ 0.1908,  0.0156,  0.4690, -0.2725],
        [ 0.1655,  0.0136,  0.4068, -0.2364]], dtype=torch.float64)

In [299]:
from dataclasses import dataclass
import torch
import torch.nn as nn
import math
from torch.nn.functional import scaled_dot_product_attention

torch.manual_seed(4752)

@dataclass
class SLLMConfig:
    # Embedding dimension
    d_model: int = 512
    # Query key Value projection dimension
    d_head: int  = 128
    # bias for query,key and value projection matrices
    bias: bool = False
    dropout: int = 0.0
    # Number of input tokens
    context_window: int = 32
    # Number of attention heads
    n_heads: int = 4

    
config = SLLMConfig()
assert config.d_model % config.n_heads == 0

    
class SingleHeadAttention(nn.Module):
    """
    Implements weighted self attention
    """
    def __init__(self, config):

        super().__init__()
        self.Wq =  nn.Linear(config.d_model, config.d_head, bias=config.bias)
        self.Wk =  nn.Linear(config.d_model, config.d_head, bias=config.bias)
        self.Wv =  nn.Linear(config.d_model, config.d_head, bias=config.bias)

        self.attn_drop = config.dropout
        
        self.__init_weights()

    def __init_weights(self):

        nn.init.xavier_uniform_(self.Wq.weight)
        nn.init.xavier_uniform_(self.Wk.weight)
        nn.init.xavier_uniform_(self.Wv.weight)

    def forward(self, x, mask=None):

        q = self.Wq(x)
        k = self.Wk(x)
        v = self.Wv(x)
        
        context_vector = scaled_dot_product_attention(
                                query = q
                               ,key   = k
                               ,value = v
                               ,attn_mask=None
                               ,dropout_p=self.attn_drop
                               ,is_causal=True, scale=None)



        return context_vector

SingleHeadAttention class should look straightforward if you were following us till now. There are three trainable weights: Wq, Wk and Wv linear layers. Inside the forward function we see that these matrices are used to project the input token embeddings to three variables query, key and value. Let us spend some time understanding the importance of these trainable weights.

Query, Key and Value terminology comes from recommonder system literature and subsequently RNN based transalation systems. Query refers to "what are we looking for" and Key symbolizes "what is available". We calculate the attention score between Query and Key. The context vector for tokens in the query is enriched by its relationship with tokens in the key.

There is another way to look at this Query, Key and Value. Imaginge you are working on a curve fitting problem. A graph with lot of curves, a quadratic equation will be more suitable than a linear equation. The difference is in the number of coeffiecients. The linear equation has a single coeffiecient, the slope. However quadratic equation will have multiple coeffiecients and will be able to do better in fitting the curve.

Take this analogy. We have now three different coeffiecients for our input imbeddings to extract and learn more features for our text input. During the training of our LLM, we have given the model more options, it can do gradient updates during back probagation to the following

1. Embedding Layer
2. Query weight matrix
3. Key weight matrix
4. Value weight matrix.


Compare this if we dont have query, key and value and we use only the token embedding layer to tweak as we build our models.

In [300]:
x = torch.normal(0.5,0.3,size=(config.context_window, config.d_model))

sha = SingleHeadAttention(config)
attn = sha.forward(x)

print(attn.shape)
attn

torch.Size([32, 128])


tensor([[-0.7320, -1.3121, -1.1089,  ..., -0.3185, -0.0318, -0.0945],
        [-0.4218, -0.7940, -0.9174,  ...,  0.2992,  0.3103, -0.0832],
        [-0.5216, -0.7165, -0.9867,  ...,  0.1792,  0.2390,  0.0536],
        ...,
        [-0.8741, -0.9610, -1.1674,  ...,  0.0701,  0.4750,  0.0293],
        [-0.8710, -0.9496, -1.1936,  ...,  0.0768,  0.4816,  0.0163],
        [-0.8807, -0.9522, -1.1772,  ...,  0.0664,  0.4743,  0.0258]],
       grad_fn=<MmBackward0>)

We have the context vectors for our input token embeddings. In transfomer architecture, we can have multiple attention module. Hence the name, single head. One head provides a context vectors. Mutliple heads can provide multiple context vectors and finally we concatenate these context vectors. The concatenated context vectors serve as input to the next layer.

Below is an example of multi head implementation. "n_heads" in config defines the number of heads needed.

In [376]:
class MultiHeadAttention(nn.Module):
    """
    Multihead Attention Implementation
    """
    def __init__(self, config):
        
        super().__init__()
        self.heads = nn.ModuleList(
            [
                SingleHeadAttention(config) for _ in range(config.n_heads)
            ]
        )
        self.projection_out = nn.Linear(config.n_heads * config.d_head, config.d_head)

    def forward(self, x):
        attentions = []
        for head in self.heads:
            c_vector = head(x)
            attentions.append(c_vector)

        context_vector = torch.cat(attentions, dim=-1)
        context_projected = self.projection_out(context_vector)
        return context_projected



self.heads stores the list of singlehaead attention, number decided by n_heads. This is followed by a projection
layer.All the concatenated context vectors are transformed through this projection layer.

In [377]:
mha = MultiHeadAttention(config)
x = torch.normal(0.5,0.3,size=(1, config.context_window, config.d_model))
projected_output = mha.forward(x)
projected_output

tensor([[[ 0.1402,  0.0391,  0.3527,  ...,  0.0339, -0.1148, -0.3316],
         [-0.2345, -0.4306,  0.1521,  ..., -0.1201, -0.0996, -0.2717],
         [ 0.0304, -0.4214,  0.2029,  ..., -0.0579, -0.3120, -0.1975],
         ...,
         [-0.1216, -0.3776,  0.0353,  ..., -0.1119, -0.3448, -0.5420],
         [-0.1336, -0.3805,  0.0261,  ..., -0.1079, -0.3413, -0.5553],
         [-0.1342, -0.3915,  0.0220,  ..., -0.0991, -0.3558, -0.5513]]],
       grad_fn=<ViewBackward0>)

In [382]:

class MultiHeadAttentionv1(nn.Module):
    """
    Multihead Attention Implementation
    """
    def __init__(self, config):
        
        super().__init__()

        self.projection_out = nn.Linear(config.n_heads * config.d_head, config.d_head)    
        
        self.Wq =  nn.Linear(config.d_model, config.d_head * config.n_heads, bias=config.bias)
        self.Wk =  nn.Linear(config.d_model, config.d_head * config.n_heads, bias=config.bias)
        self.Wv =  nn.Linear(config.d_model, config.d_head * config.n_heads, bias=config.bias)

        self.attn_drop  = config.dropout
        self.n_heads    = config.n_heads
        self.d_head     = config.d_head
        self.__init_weights()


    def __init_weights(self):

        nn.init.xavier_uniform_(self.Wq.weight)
        nn.init.xavier_uniform_(self.Wk.weight)
        nn.init.xavier_uniform_(self.Wv.weight)


    def forward(self, x):

        batch, length, d = x.shape
        is_causal = True
        
        if not self.train:
            is_causal = False
            self.attn_drop = 0.0
        
        q = self.Wq(x)
        k = self.Wk(x)
        v = self.Wv(x)
        
        q = q.view(batch, length, self.n_heads, self.d_head)
        k = k.view(batch, length, self.n_heads, self.d_head)
        v = v.view(batch, length, self.n_heads, self.d_head)

        context_vector = scaled_dot_product_attention(
                                query = q
                               ,key   = k
                               ,value = v
                               ,attn_mask=None
                               ,dropout_p=self.attn_drop
                               ,is_causal=True, scale=None)

        context_vector = context_vector.contiguous().view(batch, length, self.d_head * self.n_heads)
        output = self.projection_out(context_vector)
        return output

In [383]:
mha = MultiHeadAttentionv1(config)
projection_output = mha.forward(x)
print(projection_output)

tensor([[[ 0.0800, -0.3302, -0.0600,  ..., -0.2621,  0.0807, -0.0346],
         [ 0.0257, -0.4768, -0.1681,  ..., -0.0309,  0.0832,  0.1942],
         [ 0.0488, -0.0642, -0.0550,  ...,  0.0505,  0.1325, -0.0416],
         ...,
         [-0.1246, -0.1650, -0.1457,  ...,  0.0048,  0.2119,  0.1326],
         [-0.1885, -0.2958,  0.0418,  ...,  0.0224,  0.3544,  0.1434],
         [-0.2636, -0.1345, -0.0062,  ..., -0.1854,  0.2127,  0.0444]]],
       grad_fn=<ViewBackward0>)


Look at the dimensios of query,key and values weight matrices.They are designed to include the number of heads.

## Layer Normalization

Adjust the the output of the activation to have zero mean and a variance of 1. Numerical underflow and overflow are common problems while training deep learning networks. Numbers with uneven scales will not lead ot smooth backpropagation and hence trianing of deep networks will be difficult. By normalizng the values to zero mean and unit variance, we can speed up the convergence of our training.

In [17]:
import torch
import torch.nn as nn

batch = 2
context_window = 3
d_model = 4

embedding = torch.randn(batch, context_window, d_model)
print(embedding)
layer_norm = nn.LayerNorm(d_model)

normalized = layer_norm(embedding)
print(normalized)

tensor([[[-0.5976, -0.9168, -0.9624,  0.1854],
         [-0.7473,  0.5628, -0.8835,  2.3138],
         [-0.3804, -0.3160, -0.0926,  1.0782]],

        [[-1.5180,  0.4197, -1.1423,  0.1079],
         [ 1.2017, -1.3224,  1.7539,  0.1658],
         [ 0.3813,  1.2058,  0.6612, -1.0390]]])
tensor([[[-0.0539, -0.7481, -0.8472,  1.6491],
         [-0.8229,  0.1953, -0.9287,  1.5563],
         [-0.7667, -0.6576, -0.2792,  1.7034]],

        [[-1.2077,  1.1685, -0.7469,  0.7862],
         [ 0.6420, -1.5130,  1.1135, -0.2425],
         [ 0.0952,  1.0895,  0.4328, -1.6176]]],
       grad_fn=<NativeLayerNormBackward0>)


In [18]:
print(f"Mean {torch.mean(embedding,dim=-1)}")
print(f"Var {torch.var(embedding,dim=-1)}")


Mean tensor([[-0.5728,  0.3114,  0.0723],
        [-0.5332,  0.4498,  0.3023]])
Var tensor([[0.2819, 2.2072, 0.4649],
        [0.8866, 1.8291, 0.9168]])


In [19]:
print(f"Mean {torch.round(torch.mean(normalized,dim=-1))}")
print(f"Var  {torch.round(torch.var(normalized,dim=-1))}")


Mean tensor([[0., 0., 0.],
        [0., -0., -0.]], grad_fn=<RoundBackward0>)
Var  tensor([[1., 1., 1.],
        [1., 1., 1.]], grad_fn=<RoundBackward0>)


## Fully connected layer

In [34]:
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.Linear(config.d_head, config.d_head, bias=config.bias)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(config.d_head, config.d_head, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.ln_1(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return self.dropout(x)
    
        

### Activation Layer

## Put it all together

In [35]:
class TransformerBlock(nn.Module):

    def __init__(self, config):
        super().__init__()

        self.ln1 = LayerNorm(config.d_model, bias=config.bias)
        self.mha = MultiHeadAttention(config)
        self.ln2 = LayerNorm(config.d_head, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):

        x = x + self.mha(self.ln1(x))
        x = x + self.mlp(self.ln2(x))

        return x
        
        

In [36]:
class SLLM(nn.Module):

    def __init__(self, config):
        super().__init__()

        self.token_embdgs = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_embdgs   = nn.Embedding(config.context_window, config.d_model)
        self.droput       = nn.Dropout(config.dropout)

        self.transformer_blocks = nn.ModuleList(
        [TransformerBlock(config) for _ in config.n_layers]
        )
        self.final_norm = LayerNorm(config.d_head)
        self.out_head = nn.Linear(config.d_head, config.vocab_size)

    def forward(self, x):

        batch_size, seq_length = x.shape
        token_embds = self.token_embdgs(x)
        pos_embds = self.pos_embds(torch.arange(seq_length, device=x.device))
        x = token_embds + pos_embds
        x = self.dropout(x)
        x = self.transformer_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits
                                   
                                   
        