# <center> <font size = 24 color = 'steelblue'> **Transformer Architecture**

# <center> <img src = "https://drive.google.com/uc?export=view&id=1DXsSb3QZTO9vbI2VjJD4cpXOmj1svgSn" height = 500 width = 600>

<div class="alert alert-block alert-info">
    
<font size = 4>

**By the end of this notebook you will be able to:**

- Understand the transformer architecture
- Explore implementation of encoder
- Explain working of a decoder
    
</div>

# <a id= 'p0'>
<font size = 4>
    
**Table of Contents:**<br>
[1. Transformer architecture](#p1)<br>
[2. The encoder](#p2)<br>
>[2.1. How transformers pay attention?](#p2.1)<br>
>[2.2. Self attention](#p2.2)<br>
>[2.3. Calculating attention weights](#p2.3)<br>
>[2.4. Multi-headed attention](#p2.4)<br>
>[2.5. Feed forward layer](#p2.5)<br>
>[2.6. Adding layer normalization](#p2.6)<br>
>[2.7. Positional embeddings](#p2.7)<br>
>[2.8. Add classification head](#p2.8)<br>

[3. The decoder](#p3)<br>
[4. Summary](#p4)<br>

## <a id = 'p1'>
<font size = 10 color = 'midnightblue'> **The transformer architecture step by step**

<div class="alert alert-block alert-success">
    
<font size = 4>

The original transformer relies on the **encoder-decoder architecture**, commonly employed in tasks such as machine translation, involving the conversion of a sequence of words from one language to another.

This architecture comprises two key components:

**1. Encoder:**<br>
Converts an input sequence of tokens into a sequence of embedding vectors, often referred to as the hidden state or context.

**2. Decoder** <br>
Utilizes the hidden state from the encoder to progressively generate an output sequence of tokens, one token at a time.

</div>

<div class="alert alert-block alert-info">

<center> <font size = 5> <b>The encoder and decoder are themselves composed of several building blocks.</b>

## <center> <img alt="transformer-encoder-decoder" caption="Encoder-decoder architecture of the transformer, with the encoder shown in the upper half of the figure and the decoder in the lower half" src="https://drive.google.com/uc?export=view&id=1btkUwVzTtIwAjDnndegPI4gWkfmcvYm6" id="transformer-encoder-decoder" height = 800 width = 900>

<div class="alert alert-block alert-success">

- <font size = 4> The input text undergoes tokenization and is converted into token embeddings.
- <font size = 4>To address the attention mechanism's lack of awareness regarding token positions, positional embeddings containing positional information for each token are added to the token embeddings.

- <font size = 4>The encoder and decoder consist of stacked layers or **blocks**, similar to stacking convolutional layers in computer vision.
- <font size = 4>The encoder's output is fed to each decoder layer, iteratively predicting the next token until an end-of-sequence (EoS) token is reached or a maximum length is attained.

<div class="alert alert-block alert-success">
<font size = 4>

<b> The Transformer architecture, initially designed for sequence-to-sequence tasks like machine translation, has evolved into three main types:</b><br>

**Encoder-only:** Converts input text into a numerical representation suitable for tasks like text classification. Examples include BERT, RoBERTa, and DistilBERT, utilizing bidirectional attention.

**Decoder-only:** Autocompletes sequences based on a given prompt. Models like GPT belong to this class, using causal or autoregressive attention with a focus on the left context.

**Encoder-decoder:** Used for complex mappings between text sequences, suitable for tasks like machine translation. Examples include BART and T5, combining both encoder and decoder components.

[top](#p0)

## <a id = 'p2'>

<font size = 10 color = 'midnightblue'> **The encoder**

<div class="alert alert-block alert-success">
    
<font size = 4>

- As it was seen earlier, the transformers' encoder consists of many encoder layers stacked next to each other.
- As illustrated in Figure below, each encoder layer receives a sequence of embeddings and feeds them through the following sub-layers:

> **1. A multi-head self-attention layer**<br>
> **2. A fully connected feed-forward layer that is applied to each input embedding**

## <center> <img alt="encoder-zoom" caption="Zooming into the encoder layer" src="https://drive.google.com/uc?export=view&id=1oUunaulbpddoKTrscxqefR1LN-1MpEat" id="encoder-zoom" height = 400, width = 500>

<div class="alert alert-block alert-success">
- <font size = 4> The output embeddings of each encoder layer have the same size as the inputs. <br>
- The main role of the encoder stack is to <b>update</b> the input embeddings to produce representations that encode some contextual information in the sequence. <br>
- For example, the word <b>“apple”</b> will be updated to be more <b>“company-like”</b> and less <b>“fruit-like”</b> if the words <b>“keynote”</b> or <b>“phone”</b> are close to it.

<div class="alert alert-block alert-info">
    
<font size = 4>

- **Each sublayer employs skip connections and layer normalization, common techniques for effective training of deep neural networks.**<br>
- **To grasp the essence of a transformer's functionality, we must delve into the fundamental building block: the self-attention layer.**

[top](#p0)

###### <a id = 'p2.1'>
###### <font size = 6 color = 'pwdrblue'> **How transformers pay attention?**

# <center> <img src = "https://drive.google.com/uc?export=view&id=1BxdVq1Kt1NuG2Ck8mUJ8u3mk9N8TMmM5" height = 800 width = 900>

<div class="alert alert-block alert-success">
    
- <font size = 4.5> An attention function can be described as mapping a query and a set of key-value pairs to an output.<br>
- <font size = 4.5> Here the query, keys, values, and output are all vectors. <br>
- <font size = 4.5> The output is computed as a weighted sum of the values.<br>
- <font size = 4.5> The weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

###### <a id = 'p2.2'>
###### <font size = 6 color = 'pwdrblue'> **Self-Attention**

<div class="alert alert-block alert-success">

- <font size = 4.5>Self-attention is a mechanism that enables each element in a sequence to attend to other elements within the same sequence.<br>
- <font size = 4.5>Introduced in the "Attention is All You Need" paper, self-attention allows the model to assign different weights to different elements, capturing dependencies and relationships between them.

# <center> <img alt="Contextualized embeddings" caption="Diagram showing how self-attention updates raw token embeddings (upper) into contextualized embeddings (lower) to create representations that incorporate information from the whole sequence" src="https://drive.google.com/uc?export=view&id=1aqoT0tbJV-K_TvMQ025jJVdvKgXJ8iFu" id="contextualized-embeddings" height= 700 width = 900>

<div class="alert alert-block alert-info">
    
<font size = 4>

<center> <b>left:</b>  Scaled dot-product Attention
    <b>right:</b> Multi-head attention consists of several<br>
<center> Attention layers running in parallel. - From the paper - 'Attention is all you need'

# <center> <img src = "https://drive.google.com/uc?export=view&id=1DSlEKkHHCN4oV6NxpeiqdZmweJEqbUop" height = 700 width = 900>

[top](#p0)

###### <a id = 'p2.3'>
###### <font size = 6 color = 'pwdrblue'>**Calculating attention weights**

<font size = 5 color = 'seagreen'> **Scaled dot-product attention**

<div class="alert alert-block alert-success">
<font size = 4>
    
Implementing a self-attention layer can be done in various ways, with scaled dot-product attention being the most common, as introduced in the original transformer architecture paper.     

In [None]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});

In [None]:
# pip install bertviz

<div class="alert alert-block alert-success">
<font size = 4>
    
- BertViz library in Jupyter is used below to visualize how attention weights are calculated in transformer models.<br>
- BertViz, with its neuron_view module, traces the computation of weights, illustrating how query and key vectors combine to produce the final weight. <br>
- To activate the attention visualization, instantiate the BERT checkpoint with the BertViz model class and use the show() function for a specific encoder layer and attention head. <br>
- Click the “+” on the left to enable the attention visualization.

In [None]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<div class="alert alert-block alert-success">
<font size = 4>

<center> Take a look at this process in more detail by implementing the diagram of operations to compute scaled dot-product attention

</div>

# <center> <img alt="Operations in scaled dot-product attention" height="125" caption="Operations in scaled dot-product attention" src="https://drive.google.com/uc?export=view&id=1wKI6spyRipODWOnY276vXXWVBeuIf61q" id="attention-ops" width = 1000>

<div class="alert alert-block alert-info">
<font size = 4.5>
    
**Note**
- <font size = 4.5> Use PyTorch to implement the transformer architecture.<br>
- <font size = 4.5> The steps in TensorFlow are analogous.

In [None]:
from transformers import AutoTokenizer
model_ckpt = "bert-base-uncased"
text = "time flies like an arrow"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

<div class="alert alert-block alert-success">
<font size = 4>
    
- Each token in the sentence has been mapped to a unique ID in the tokenizer’s vocabulary. <br>
- Also, [CLS] and [SEP] tokens have been excluded by setting <b>add_special_tokens=False</b>. <br>
- Next, some dense embeddings are created. <br>
- Dense in this context means that each entry in the embeddings contains a nonzero value. <br>
- In contrast, the one-hot encodings are sparse, since all entries except one are zero. <br>
- In PyTorch, this can be done by using a <b>torch.nn.Embedding layer</b> that acts as a lookup table for each input ID:

In [None]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

In [None]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

<div class="alert alert-block alert-success">

<font size = 4>

- This has returned a tensor of shape [batch_size, seq_len, hidden_dim]. <br>
- Postpone the positional encodings, so the next step is to create the query, key, and value vectors and calculate the attention scores using the dot product as the similarity function:

In [None]:
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

<div class="alert alert-block alert-success">
   
<font size = 4>
    
- A 5X5 matrix of attention scores has been formed per sample in the batch.<br>
- It will be observed later that the query, key, and value vectors are created by applying separate weight matrices (W(Q, K, V)) to the embeddings.<br>
- However, for simplicity, they have been kept equal for now. <br>
- In scaled dot-product attention, the dot products are scaled by the size of the embedding vectors to avoid excessive large numbers during training, preventing saturation of the subsequent softmax.

<div class="alert alert-block alert-info">
<font size = 4>

<b>Note:</b>

- The computation of attention scores is simplified by the torch.bmm() function, which performs a batch matrix-matrix product.
- This is applied when the query and key vectors have the shape [batch_size, seq_len, hidden_dim].
- Ignoring the batch dimension, the dot product for each query and key vector can be calculated by transposing the key tensor to have the shape [hidden_dim, seq_len] and then the matrix product can be used to collect all dot products in a [seq_len, seq_len] matrix.
- To perform this operation independently for all sequences in the batch, torch.bmm() is utilized, taking two batches of matrices and multiplying each matrix from the first batch with the corresponding matrix in the second batch.

In [None]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

<font size = 5 color = 'seagreen'>  **The final step is to multiply the attention weights by the values:**

In [None]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

<div class="alert alert-block alert-info">
    
<center><font size =4> <b>And that concludes the completion of all the steps required to implement a simplified form of self-attention!

<div class="alert alert-block alert-success">
<font size =4>
<center>
Observe that the entire process involves only two matrix multiplications and a softmax, simplifying "self-attention" to a sophisticated type of averaging.

<div class="alert alert-block alert-success">
<font size = 4>
    

<center> <b>Now, let's encapsulate these steps into a function for future use:<\b>

In [None]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

<div class="alert alert-block alert-success">
<font size = 4>
- <font size = 4> In the attention mechanism with identical query and key vectors, a considerably large score is assigned to identical words in the context. <br>
- <font size = 4> Especially to the current word itself, as the dot product of a query with itself is always 1. <br>
- <font size = 4> However, in real-world scenarios, the meaning of a word is better enriched by complementary words in the context rather than identical ones. <br>
- <font size = 4> For instance, the understanding of "flies" benefits more from information about "time" and "arrow" than from another instance of "flies". <br>

<center><font size = 5 color = 'seagreen'> <b>How can we encourage this behavior?</b>

<div class="alert alert-block alert-info">

<font size = 4> **Let's enable the model to generate distinct sets of vectors for the query, key, and value of a token by utilizing three distinct linear projections.**<br>
**These projections project our initial token vector into three different spaces.**

[top](#p0)

###### <a id = 'p2.4'>
###### <font size = 6 color = 'pwdrblue'> **Multi-headed attention**

<div class="alert alert-block alert-success">

<font size = 4>

- In the example, only the embeddings for **"as is"** is used for attention scores and weights, but the complete process is more involved.<br>
- In practice, the self-attention layer employs three separate linear transformations for each embedding to create query, key, and value vectors.<br>
- These transformations project the embeddings, each with its own learnable parameters, enabling the self-attention layer to focus on diverse semantic aspects.<br>
- It's advantageous to have multiple sets of linear projections, each representing an attention head. <br>
- The resulting multi-head attention layer, as shown in the Figure, is crucial because the softmax of one head tends to focus on a single aspect of similarity. Multiple heads enable the model to focus on various aspects simultaneously. <br>

<img alt="Multi-head attention" height="125" caption="Multi-head attention" src="https://drive.google.com/uc?export=view&id=1meDlVWfXzPT7RQn59keoBjBFBPkMqGJU" id="multihead-attention"/>

<div class="alert alert-block alert-success">
- <font size = 4>For instance, one head may concentrate on subject-verb interaction, while another identifies nearby adjectives. <br>
- <font size = 4>These relationships are not manually crafted into the model; they are entirely learned from the data.

<div class="alert alert-block alert-info">
<font size = 4.5> <b>
For those acquainted with computer vision models, the resemblance to filters in convolutional neural networks becomes evident.<br></b>
    
<font size = 4> <center> Within such networks, one filter may identify faces, while another detects wheels of cars in images.</b>

<center><font size = 5 color = 'seagreen'> <b>Let’s implement this layer by first coding up a single attention head:

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

<div class="alert alert-block alert-success">

- <font size = 4> Here, three separate linear layers are initialized.
- <font size = 4> Matrix multiplication is performed on the embedding vectors to yield tensors of shape [batch_size, seq_len, head_dim].<br>
- <font size = 4> Here, head_dim represents the dimensions being projected into.<br>
- <font size =4> While head_dim is not required to be smaller than the number of embedding dimensions (embed_dim) of the tokens, it's typically chosen as a multiple of embed_dim for consistent computation across each head.<br>
<font size =4>**For instance, BERT utilizes 12 attention heads, resulting in a dimension of each head being 768/12 = 64**

<div class="alert alert-block alert-success">
<font size = 4> Now, with a singular attention head established, we can merge the outputs of each one to construct the complete multi-head attention layer:

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

<div class="alert alert-block alert-info">
<font size = 4>

- Observe that the concatenated output from the attention heads undergoes processing through a concluding linear layer, yielding an output tensor of shape **_[batch_size, seq_len, hidden_dim]_**, apt for the subsequent feed-forward network. <br>
- To verify, the **_MultiHeadAttention_** module is initialized with the configuration loaded earlier from the pre-trained BERT model, guaranteeing alignment with BERT settings.

In [None]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

<center><font size = 5 color = 'seagreen'> <b>It functions successfully!

<div class="alert alert-block alert-success">
<font size = 4>
    
**To conclude:**
- BertViz is employed once more to visualize the attention for two instances of the word "flies." <br>
- The **_head_view()_** function from BertViz is applied, computing attentions from a pre-trained checkpoint and indicating the sentence boundary.

In [None]:
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

<div class="alert alert-block alert-success">

<font size = 4>

- The visualization depicts attention weights as lines connecting the token undergoing embedding update (left) with each attended word (right). <br>
- Line intensity signifies attention weight strength, with dark lines indicating values close to 1 and faint lines representing values close to 0.

<div class="alert alert-block alert-success">
    
<font size = 4> **In this example, the input consists of two sentences and the [CLS] and [SEP] tokens are the special tokens in BERT’s tokenizer**

<div class="alert alert-block alert-success">
<font size = 4>
- From the visualization, we observe the strongest attention weights between words within the same sentence, indicating BERT's ability to recognize intra-sentence connections.<br>
- Specifically for the word "flies," BERT emphasizes the importance of "arrow" in the first sentence and "fruit" and "banana" in the second. <br>
- These attention weights enable the model to discern whether "flies" is used as a verb or noun, depending on its context.

<div class="alert alert-block alert-success">
<font size = 4>Having explored attention, let's delve into implementing the remaining component of the encoder layer:

**position-wise feed-forward networks**.

[top](#p0)

###### <a id = 'p2.5'>
###### <font size = 6 color = 'pwdrblue'>  **The feed-forward layer**

<div class="alert alert-block alert-success">
- <font size = 4> The encoder and decoder's feed-forward sublayer consist of a straightforward two-layer fully connected neural network, with a distinctive feature:<br><ind>
- <font size = 4>It processes each embedding independently, earning it the label of a position-wise feed-forward layer. <br>
- <font size = 4>Alternatively, individuals with a background in computer vision may refer to it as a one-dimensional convolution with a kernel size of one (as seen in the OpenAI GPT codebase). <br>
- <font size =4>A common guideline suggests setting the hidden size of the first layer to four times the size of the embeddings, and the GELU activation function is often applied.
-<font size = 4>This layer is considered crucial for capacity and memorization, frequently adjusted when scaling up models. Implementation as a simple nn.Module can be achieved as follows:

In [None]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

<div class="alert alert-block alert-info">
<font size = 4>
    
**Note:**

- It's worth noting that a feed-forward layer, like <b>nn.Linear</b>, is typically employed on a tensor with a shape of (batch_size, input_dim), operating independently on each element within the batch dimension. <br>
- This holds true for any dimension except the last one. Therefore, when presenting a tensor with a shape of (batch_size, seq_len, hidden_dim), the layer processes all token embeddings of the batch and sequence independently, aligning with our intended behavior. <br>
</div>
<div class="alert alert-block alert-success">

<center><font size = 4> <b>We can confirm this by passing the attention outputs:</b>

In [None]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

<div class="alert alert-block alert-success">
<font size = 4>
<b>
    
- All the necessary elements are now in place to construct a complete transformer encoder layer!<br>
- The remaining decision is the placement of skip connections and layer normalization.

</div>
<div class="alert alert-block alert-success">
<font size = 4>
<b>
    <center> <font color = 'seagreen'> Let's explore how this choice influences the model architecture.

[top](#p0)

###### <a id = 'p2.6'>
###### <font size = 6 color = 'pwdrblue'> **Adding Layer Normalization**

<div class="alert alert-block alert-success">

<font size = 4> - The Transformer architecture incorporates layer normalization and skip connections.

<font size = 4> - Layer normalization ensures each input in the batch has zero mean and unity variance.
    
<font size = 4> - Skip connections transmit a tensor to the next layer without processing and add it to the processed tensor.
    
<font size = 4> - Regarding the placement of layer normalization in Transformer's encoder or decoder layers, two main choices exist in the literature:
    
> <font size = 4><b>a. Post Layer Normalization:</b>
    
> * <font size = 4>Layer normalization is positioned between the skip connections.<br>
> * <font size = 4>Training from scratch with this arrangement can be challenging due to potential gradient divergence.<br>
> * <font size = 4>To address this, a concept called learning rate warm-up is often applied, gradually increasing the learning rate during training.<br>
    

> <font size = 4><b>b. Pre Layer Normalization:
> * <font size = 4>Layer normalization is situated within the span of the skip connections.<br>
> * <font size = 4>This configuration is more stable during training and typically does not require learning rate warm-up.<br>

<font size = 4><b>The figure below illustrates the difference between these two arrangements.</b>

# <center> <img alt="Transformer layer normalization" height="500" caption="Different arrangements of layer normalization in a transformer encoder layer" src="https://drive.google.com/uc?export=view&id=1qNT-xpVik7gjM5x-4s9C9he8KrFPURph" id="layer-norm" width = 800>

<font size = 5 color = seagreen>
<b>
    <center> Use the second arrangement, so we can simply stick together our building blocks as follows:

In [None]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

<font size = 5 color = seagreen>
<b>
    <center> Testing this with our input embeddings:

In [None]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

<div class="alert alert-block alert-success">
    
- <font size = 4>The initial implementation of the transformer encoder layer is complete.<br>
- <font size = 4>However, there is a limitation in the way the encoder layers are configured: <br>
    > <font size = 4>**They are entirely invariant to the position of the tokens.** <br>
- <font size = 4>As the multi-head attention layer operates as a sophisticated weighted sum, it results in the loss of information regarding token positions. <br>
> <font size = 4> **Fortunately, there's a straightforward solution to incorporate positional information by utilizing positional embeddings.**
    <font size = 4> <br>
    

###### <a id = 'p2.7'>
###### <font size = 6 color = 'pwdrblue'> **Positional embeddings**

<div class="alert alert-block alert-success">
    
<font size = 4> **Positional embeddings employ a straightforward yet impactful concept:**
  
>* <font size = 4>They enhance token embeddings with a position-dependent pattern arranged in a vector. <br>
>* <font size = 4>By ensuring a distinct pattern for each position, attention heads and feed-forward layers in each stack can learn to integrate positional information into their transformations.

<div class="alert alert-block alert-success">

- <font size =4>A prevalent approach involves using a learnable pattern, particularly with a substantial pre-training dataset.
- <font size =4>This mirrors token embeddings but utilizes the position index instead of the token ID as input.
- <font size =4>This approach efficiently learns to encode token positions during pre-training.

<font size = 5 color = seagreen>
<b>
    <center> Let's craft a custom Embeddings module

<div class="alert alert-block alert-success">
<font size = 4>
    
- Combining a token embedding layer projecting input_ids to a dense hidden state
- With a positional embedding doing the same for position_ids
- The resulting embedding is a simple sum of both embeddings

In [None]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [None]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

<div class="alert alert-block alert-success">
<font size = 4>

**Observations:**
- The embedding layer now generates a singular, dense embedding per token.
- While widely used, alternatives to learnable position embeddings include:
- Absolute positional representations: Employ static sine and cosine signal patterns to encode token positions, effective with limited data.
- Relative positional representations: Prioritize surrounding tokens during embedding computation, considering the relative positions between tokens. This requires modifying the attention mechanism to incorporate relative position terms, as seen in models like DeBERTa.5

<font size = 5 color = seagreen>
<b>
    <center> Now, let's integrate these components to construct the complete transformer encoder, combining embeddings with encoder layers.

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

<font size = 5 color = seagreen>
<b>
    <center> Check the output shapes of the encoder:

In [None]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

<div class="alert alert-block alert-success">

- <font size =4> A hidden state is obtained for each token in the batch.<br>
- <font size =4> This offers flexibility for diverse applications such as predicting missing tokens in masked language modeling or determining the start and end position of an answer in question answering.

[top](#p0)

###### <a id = 'p2.8'>
###### <font size = 6 color = 'pwdrblue'> **Adding a classification head**

<div class="alert alert-block alert-success">

<font size = 4>

- The Transformer model comprises a <b>task-independent body</b> and a <b>task-specific head</b>.<br>
    
> The current development focuses on the body. <br>
    
- <font size = 4> To construct a text classifier, append a classification head to this body. <br>
- <font size = 4> Utilize the hidden states for each token, various approaches exist. <br>
- <font size = 4> Typically, the initial token is employed for prediction, incorporating dropout and a linear layer.<br>
- <font size = 4> The subsequent class extends the existing encoder for sequence classification.

In [None]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

<font size = 5 color = seagreen>
<b>
    <center> Define the number of classes we would like to predict before initializing the model:

In [None]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

<div class="alert alert-block alert-info">
<font size = 4>
    
<center> <b>This is precisely what we've sought.</b>
    

Unnormalized logits for each class in the output are obtained for every example in the batch.<br>
With this, our examination of the encoder and its integration with a task-specific head is concluded.

### <center> **Lets talk about decoders.**

[top](#p0)

## <a id = 'p3'>

<font size = 10 color = 'midnightblue'> **The Decoder**

<div class="alert alert-block alert-success">
    
<font size = 4> In the illustration, the decoder, distinct from the encoder, incorporates two attention sublayers:

<font size = 4>
    
> **1. Masked multi-head self-attention layer:**<br>
    > - Ensures generation of tokens at each timestep is solely based on past outputs and the current predicted token. <br>
    > - This prevents the decoder from trivially copying target translations during training.<br>

> **2. Encoder-decoder attention layer:**<br>
    > - Conducts multi-head attention over the encoder stack's output key and value vectors, using the intermediate representations of the decoder as queries.
    > - This enables learning the relationship between tokens from different sequences, such as different languages.
    > - The decoder accesses encoder keys and values in each block.

<div class="alert alert-block alert-info">

<font size = 4>
    
> * The necessary modifications for masking in the self-attention layer will be explored.
> * While the implementation of the encoder-decoder attention layer is left as an exercise.
> * The masked self-attention involves introducing a mask matrix with ones on the lower diagonal and zeros above.

# <center> <img alt="Transformer decoder zoom" caption="Zooming into the transformer decoder layer" src="https://drive.google.com/uc?export=view&id=1Dmk3n1nkQoiHPlDBYyQfc4pvV20I4X47" id="decoder-zoom" width = 800>

In [None]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

<div class="alert alert-block alert-info">
    <font size = 4> <b>Here,</b><br

> - PyTorch's tril() function is employed to generate the lower triangular matrix. <br>
> - With this mask matrix, each attention head can be restricted from observing future tokens by utilizing Tensor.masked_fill() to replace zeros with negative infinity.

In [None]:
scores.masked_fill(mask == 0, -float("inf"))

# <center> <img src = "https://drive.google.com/uc?export=view&id=1xQXW0Wp7cVRisDiwpgJBDpVNNFmjc2FM" height = 1100 width = 800 >

<div class="alert alert-block alert-info">
    
<font size = 4> **By assigning negative infinity to the upper values,**
    
<font size = 4>
    
> - It is ensured that the attention weights become zero after applying softmax to the scores.
> - This masking behavior can be incorporated with a slight modification to our previously implemented scaled dot-product attention function in this notebook.

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

<a id = 'p4'>

<font size = 10 color = 'midnightblue'> **Summary**

<div class="alert alert-block alert-info">
    
<font size = 4> **In this notebook, three primary transformer model architectures have been presented:**

<font size = 4>
    
> <b>1. encoders, <br>
2. decoders, and <br>
3. encoder-decoders. </b>

<div class="alert alert-block alert-info">
    
<font size = 4>
    
* The success of early transformer models sparked extensive development.
* With researchers creating models for various datasets, utilizing different pre-training objectives, and adjusting architectures for enhanced performance.
* Despite the ongoing expansion of model diversity, they can broadly be categorized into these three groups.
    </div>
    <font size = 4> <b>The diagram below highlights a few of the architectural milestones.</b>

## <center> <img alt="Transformer family tree" caption="An overview of some of the most prominent transformer architectures" src="https://drive.google.com/uc?export=view&id=18G8p9uJMud3QHo2X5dDpn0WGczcMAd79" id="family-tree" width = 700 height = 700>