<center><p float="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/>
</p></center>

<h1><center><font size=8>Transformer Deep Dive</center></font></h1>
<h1><center>Attention, Encoder, Decoder, Architecture</center></h1>
<h3><center>Charlcye Mitchell, April 2024</center></h3>

# **Building the Transformer Model in PyTorch**

In this notebook, **we will utilize PyTorch** to implement the functions needed to construct the Decoder, and also put together the Encoder & Decoder stages needed to create the Transformer.

**PyTorch is preferred for this implementation due to the increased flexibility it offers in creating custom functions and classes for advanced Deep Learning**, as opposed to TensorFlow. In the rest of this code notebook, we will attempt to explain the code in addition to the mathematical operations being performed for the Transformer.

## **1. Importing the Libraries**

In [None]:
import torch
import torch.nn as nn
import numpy as np

The **nn** subpackage in PyTorch is used to import the Neural Network module.

## **2. The Self-Attention Block**

- The Self-Attention Block is just the first class we are defining in this notebook.
- It implements the attention of one word with another word - this is identical to what we have done in the Encoder layer.
- Because this is done in PyTorch, we have greater freedom to define the functions in the manner required, here we have defined the Multi-head Attention function from scratch.
- The __init__() is a class declaration following OOP principles. In this declaration, we define an object blueprint with certain parameters (such as embed_size and heads in this example).
- In the rest of the initialization, we define the **W_k, W_q and W_v** matrices, which get multiplied with the embedding, to form our **K, Q and V vectors.**
- We scale these K, Q and V vectors to be of dimension: **embedding size * number of heads**, so that we can create the multiple K, Q & V vectors needed for Multi-head Attention (as opposed to just Self-Attention for one head).

In [None]:
class SelfAttentionBlock(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttentionBlock, self).__init__()
        self.embed_size = embed_size
        self.heads = heads


        self.W_v = nn.Linear(embed_size, self.heads*embed_size)
        self.W_k = nn.Linear(embed_size, self.heads*embed_size)
        self.W_q = nn.Linear(embed_size, self.heads*embed_size)
        self.fc = nn.Linear(self.heads*embed_size, embed_size)

    def forward(self, embeddings, mask):
        # Get number of training examples
        N = embeddings.shape[0]

        v_len, k_len, q_len = embeddings.shape[1], embeddings.shape[1], embeddings.shape[1]

        V = self.W_v(embeddings)  # (N, value_len, heads*embed_size)
        K = self.W_k(embeddings)  # (N, key_len, heads*embed_size)
        Q = self.W_q(embeddings)  # (N, query_len, heads*embed_size)


        # Split the embedding into self.heads different pieces
        V = V.reshape(N, v_len, self.heads, self.embed_size)
        K = K.reshape(N, k_len, self.heads, self.embed_size)
        Q = Q.reshape(N, q_len, self.heads, self.embed_size)

        qk = torch.einsum("nqhd,nkhd->nhqk", [Q, K])
        # queries shape: (N, query_len, heads, embed_size),
        # keys shape: (N, key_len, heads, embed_size)
        # energy: (N, heads, query_len, key_len)

        # Mask padded indices so their weights become 0
        if mask is not None:
            qk = qk.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(qk / (self.embed_size ** (1 / 2)), dim=3)
        # attention shape: (N, heads, query_len, key_len)

        output = torch.einsum("nhql,nlhd->nqhd", [attention, V]).reshape(
            N, q_len, self.heads * self.embed_size
        )
        # attention shape: (N, heads, query_len, key_len)
        # values shape: (N, value_len, heads, embed_size)
        # out after matrix multiply: (N, query_len, heads, embed_size), then
        # we reshape and flatten the last two dimensions.

        out = self.fc(output)

        return out


- **Now in the forward() function being defined**, we are performing the computations for the Attention mechanism. We are multiplying the Query of one word with the Keys of every other word in the sentence.
- Then we scale it by dividing by a factor, and perform softmax along the dimension which contains the QK values (the Attention scores). These softmax scores add to 1 of course, and they are each multiplied by the Value vectors of each word in the sentence.
- It's important to note from the code that **this is being accomplished in a matrix multiplication manner**, where the dimensions of the queries, keys and values are defined such that we directly get the final Multi-head Attention matrix. **This is what the einsum() function is helping us do in the code.**

- In the SelfAttention() class, we had defined another fully connected or **fc layer**, which acts now to convert the shape of the Multi-head Attention matrix, to the shape of a single Self-Attention head.

**Note:** There is an additional PyTorch technicality to be aware of. When we inherit from the nn.Module, there is a function written in the "Module" library which forces the forward function to be called when we call the class name. Therefore, with the help of super(SelfAttention, self).__init__(), we are inheriting directly from the nn.Module() library.

**Note 2:** There is a mask padding defined in the forward() function, which is only relevant when we use the Self-Attention in the Decoder block. We will see later how that works.

## **3. The Encoder Block**

In [None]:
class EncoderBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout):
        super(EncoderBlock, self).__init__()
        self.attention_mechanism = SelfAttentionBlock(embed_size, heads)
        self.normalization1 = nn.LayerNorm(embed_size)
        self.normalization2 = nn.LayerNorm(embed_size)

        self.feed_forward_layers = nn.Sequential(
            nn.Linear(embed_size, 4 * embed_size),
            nn.ReLU(),
            nn.Linear(4 * embed_size, embed_size),
        )

        self.dropout_layer = nn.Dropout(dropout)

    def forward(self, embeddings, mask):
        attention_output = self.attention_mechanism(embeddings, mask)

        # Add skip connection, run through normalization and finally dropout
        x = self.dropout_layer(self.normalization1(attention_output + embeddings))
        forward = self.feed_forward_layers(x)
        output = self.dropout_layer(self.normalization2(forward + x))
        return output


- The **EncoderBlock** class contains the layers with which we can complete the rest of the Encoder block.
- We are first importing the Self-Attention mechanism we had built earlier, then we are adding a few LayerNorm, Dropout and Linear layers.
- The order of these layers is defined in the forward() function - we are first implementing Self-Attention on the input embeddings (which will already contain the sum of the positional encodings and word embeddings) to the TransformerBlock.
- This is followed by adding the Attention scores to the initial embeddings in a Skip & Add manner, followed by Layer Normalization, and finally a Dropout mechanism. This sequence can be described by the Encoder block picture we have already seen.
- Finally, there's a Feed-Forward Neural Network of 2 fully-connected layers (with the first layer having a ReLU activation function).
- The first linear layer scales up the dimension to 4 times the embedding size (the number 4 is just a hyperparameter), and the second layer shrinks the dimensions back to the embedding size.
- There is a lot of research around why Neural Networks are shaped in this manner (expansion + contraction), but suffice it to say, **this shape helps Neural Nets learn high-quality representations of their inputs.**

## **4. The Encoder Stack**

In [None]:
class EncoderStack(nn.Module):
    def __init__(
        self,
        source_vocab_size,
        embed_size,
        num_layers,
        heads,
        device,
        dropout,
        max_length,
    ):

        super(EncoderStack, self).__init__()
        self.embed_size = embed_size
        self.device = device
        self.words_embedding = nn.Embedding(source_vocab_size, embed_size)
        self.positional_embedding = nn.Embedding(max_length, embed_size)

        self.EncoderBlocklayers = nn.ModuleList(
            [
                EncoderBlock(
                    embed_size,
                    heads,
                    dropout=dropout
                )
                for _ in range(num_layers)
            ]
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        N, seq_length = x.shape
        positions_of_words = torch.arange(0, seq_length)
        positional_encoding = []
        for i in range(len(positions_of_words)):
          positional_encoding.append(np.cos(2*np.pi*.73*i))
        positional_encoding = torch.Tensor(positional_encoding)
        positional_encoding = positional_encoding.expand(N, seq_length)
        positional_encoding = positional_encoding.to(self.device)
        positional_encoding = positional_encoding.type(torch.int64)
        output = self.dropout(
            (self.words_embedding(x) + self.positional_embedding(positional_encoding))
        )

        # In the Encoder the query, key, value are all the same, it's in the
        # decoder this will change. This might look a bit odd in this case.


        for each_layer in self.EncoderBlocklayers:
            output = each_layer(output, mask)

        return output

- In this code block, we are creating the **Encoder Stack** of the Transformer, **containing as many Encoder blocks as defined** by the num_layers parameter in the __init__() declaration.
- We are also defining the Word Embedding and Positional Embedding layers in this step.
- **The nn.Embedding() function is being trained (it has learnable parameters) to produce unique vectors for each unique index that it receives.** We use this function to create two kinds of embedding layers - the Word Embeddings & Position Embeddings.
- These two embeddings are added and then fed to the layers that we built in the EncoderBlock() class.

## **5. The Encoder-Decoder Attention Layer**

In [None]:
class Encoder_Decoder_Attention(nn.Module):
    def __init__(self, embed_size, heads):
        super(Encoder_Decoder_Attention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads


        self.W_Encoder_Decoder_values = nn.Linear(embed_size, self.heads*embed_size)
        self.W_Encoder_Decoder_keys = nn.Linear(embed_size, self.heads*embed_size)
        self.W_self_queries = nn.Linear(embed_size, self.heads*embed_size)
        self.fc = nn.Linear(self.heads*embed_size, embed_size)

    def forward(self, embeddings,encoder_outputs):
        # Get number of training examples
        N = embeddings.shape[0]

        v_len, k_len, q_len = encoder_outputs.shape[1], encoder_outputs.shape[1], embeddings.shape[1]

        V = self.W_Encoder_Decoder_values(encoder_outputs)  # (N, value_len, heads*embed_size)
        K = self.W_Encoder_Decoder_keys(encoder_outputs)  # (N, key_len, heads*embed_size)
        Q = self.W_self_queries(embeddings)  # (N, query_len, heads*embed_size)


        # Split the embedding into self.heads different pieces
        V = V.reshape(N, v_len, self.heads, self.embed_size)
        K = K.reshape(N, k_len, self.heads, self.embed_size)
        Q = Q.reshape(N, q_len, self.heads, self.embed_size)

        qk = torch.einsum("nqhd,nkhd->nhqk", [Q, K])
        # queries shape: (N, query_len, heads, embed_size),
        # keys shape: (N, key_len, heads, embed_size)
        # energy: (N, heads, query_len, key_len)

        attention = torch.softmax(qk / (self.embed_size ** (1 / 2)), dim=3)
        # attention shape: (N, heads, query_len, key_len)

        output = torch.einsum("nhql,nlhd->nqhd", [attention, V]).reshape(
            N, q_len, self.heads * self.embed_size
        )
        # attention shape: (N, heads, query_len, key_len)
        # values shape: (N, value_len, heads, embed_size)
        # out after matrix multiply: (N, query_len, heads, embed_size), then
        # we reshape and flatten the last two dimensions.

        encoder_outputs = self.fc(output)

        return encoder_outputs

- **The Encoder-Decoder Attention layer is the unique layer** that Decoders possess outside what's already in the Encoder layer.
- We have declared a Wq matrix that converts the embeddings on the Decoder side into queries. However, **the Wk and Wv matrices we declare here are different from before, as they are working on the Encoder's outputs and not on the embeddings received from the previous layer in the Decoder side of things.**
- **The Query has been computed from the previous embeddings inputed into the layer, whereas the Keys and Values are computed from the final Encoder outputs themselves.**
- The rest of it is the same as the Self-Attention mechanism.

## **6. The Decoder Block**

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, device):
        super(DecoderBlock, self).__init__()
        self.normalization1 = nn.LayerNorm(embed_size)
        self.normalization2 = nn.LayerNorm(embed_size)
        self.attention_layer = SelfAttentionBlock(embed_size, heads=heads)
        self.EncoderDecoderAttention = Encoder_Decoder_Attention(embed_size, heads=heads)
        self.transformer_block = EncoderBlock(
            embed_size, heads, dropout
        )
        self.dropoutlayer1 = nn.Dropout(dropout)
        self.dropoutlayer2 = nn.Dropout(dropout)




    def forward(self, x, encoder_outputs, src_mask, trg_mask):
        attention_output = self.attention_layer(x, trg_mask)
        out = self.dropoutlayer1(self.normalization1(attention_output + x))
        EncoderDecoderAttentionoutput = self.EncoderDecoderAttention(out, encoder_outputs)
        out2 = self.dropoutlayer2(self.normalization2(EncoderDecoderAttentionoutput + out))
        return out2

- The Decoder Block is the counterpart to the Encoder Block we defined before.
- Since the Encoder and Decoder blocks show a lot of similarity, we are using some layers from the EncoderBlock Class.
- However the difference is of course that inside Decoders, they implement both Self-Attention and Encoder-Decoder Attention.
- We set the trg_mask (Target Mask) inside the Self-Attention block, such that it does not look at the time steps ahead during Training. We do that by setting these word embeddings to zero.

## **7. The Decoder Stack**

In [None]:
class DecoderStack(nn.Module):
    def __init__(
        self,
        target_vocab_size,
        embed_size,
        num_layers,
        heads,
        dropout,
        device,
        max_length,
    ):
        super(DecoderStack, self).__init__()
        self.device = device
        self.words_embedding = nn.Embedding(target_vocab_size, embed_size)
        self.positional_embedding = nn.Embedding(max_length, embed_size)

        self.DecoderBlocklayers = nn.ModuleList(
            [
                DecoderBlock(embed_size, heads, dropout, device)
                for _ in range(num_layers)
            ]
        )
        self.fc_out = nn.Linear(embed_size, target_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        positions_of_words = torch.arange(0, seq_length)
        positional_encoding = []
        for i in range(len(positions_of_words)):
          positional_encoding.append(np.cos(2*np.pi*.73*i))
        positional_encoding = torch.Tensor(positional_encoding)
        positional_encoding = positional_encoding.expand(N, seq_length)
        positional_encoding = positional_encoding.to(self.device)
        positional_encoding = positional_encoding.type(torch.int64)
        x = self.dropout((self.words_embedding(x) + self.positional_embedding(positional_encoding)))

        for each_layer in self.DecoderBlocklayers:
            x = each_layer(x, enc_out, src_mask, trg_mask)

        out = self.fc_out(x)

        return out




- This DecoderStack() is doing what the EncoderStack() was doing, just on the Decoder side of things.
- **There is an additional fully-connected group of layers afterwards**, which perform the classification task to classify between the tokens / words in the vocabulary.

## **8. The Transformer Architecture**

In [None]:
class TransformerArchitecture(nn.Module):
    def __init__(
        self,
        source_vocab_size,
        target_vocab_size,
        source_pad_idx,
        target_pad_idx,
        embed_size=512,
        num_layers=6,
        heads=8,
        dropout=0,
        device="cpu",
        max_length=100,
    ):

        super(TransformerArchitecture, self).__init__()

        self.encoder = EncoderStack(
            source_vocab_size,
            embed_size,
            num_layers,
            heads,
            device,
            dropout,
            max_length,
        )

        self.decoder = DecoderStack(
            target_vocab_size,
            embed_size,
            num_layers,
            heads,
            dropout,
            device,
            max_length,
        )

        self.source_pad_idx = source_pad_idx
        self.target_pad_idx = target_pad_idx
        self.device = device

    def make_source_mask(self, src):
        src_mask_pads = (src != self.source_pad_idx).unsqueeze(1).unsqueeze(2)
        # (N, 1, 1, src_len)
        return src_mask_pads.to(self.device)

    def make_target_mask(self, trg):
        N, trg_len = trg.shape
        trg_mask_pads = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, 1, trg_len, trg_len
        )

        return trg_mask_pads.to(self.device)

    def forward(self, src, trg):
        source_mask_pad = self.make_source_mask(src)
        target_mask_pad = self.make_target_mask(trg)
        enc_src = self.encoder(src, None)
        out = self.decoder(trg, enc_src, source_mask_pad, target_mask_pad)
        return out

- This class puts all the pieces together from the Encoder stack and the Decoder stack.
- We also implement Target masks, which dynamically mask words ahead of the current time-step, while feeding it to the Self-Attention block in the Decoder.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

x = torch.tensor([[2, 4, 5, 1, 3, 7, 2, 1, 3], [1, 3, 6, 7, 2, 9, 2, 5, 8]]).to(
        device)
trg = torch.tensor([[0, 3, 5, 4, 1, 3, 2, 5, 8], [2, 3, 1, 0, 5, 9, 4, 9, 7]]).to(device)

source_pad_idx = 0
target_pad_idx = 0
source_vocab_size = 10
target_vocab_size = 10
model = TransformerArchitecture(source_vocab_size, target_vocab_size, source_pad_idx, target_pad_idx, device=device).to(
        device
    )
out = model(x, trg[:, :-1])
print(out.shape)

cpu
torch.Size([2, 8, 10])


## **9. Summary & Conclusions**

- This was a **quick code demonstration** of how the Transformer architecture can be put together from scratch by utilizing the concepts we have discussed about both the Encoder and the Decoder.
- While this notebook was meant to give a code-based understanding of the building blocks of the Transformer model, in actual practice, **the industry operates at the level of large-scale architectures themselves.**
- That means, once the fundamentals of the architecture are mastered, it will be more relevant moving forward to understand the large models that use these blocks in multiple ways, and know how to import pre-trained models and fine-tune them where required. This will be the focus of our deep-dive into modern Transformer architectures moving forward.