## Transformers from Scratch



In [None]:
!pip install loguru
import sys
import numpy as np
import torch
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoConfig, AutoModel, AutoTokenizer
from math import sqrt
from torch import nn
from loguru import logger
from typing import Optional
import matplotlib.pyplot as plt

Collecting loguru
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Downloading loguru-0.7.3-py3-none-any.whl (61 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: loguru
Successfully installed loguru-0.7.3


## 1. Overview of the Transformer Architecture

<p align="center"><img src="https://github.com/happy-jihye/Natural-Language-Processing/blob/main/images/transformer1.png?raw=1" width = "400" ></p>



### What are transformers used for?
**Transformers generate text!** You feed in language, and the model generates a probability distribution over tokens. And you can repeatedly sample from this to generate text!

Goal of Training a Transformer: You give it a bunch of text, and train it to predict the next token.

Importantly, if you give a model 100 tokens in a sequence, it predicts the next token for *each* prefix, i.e. it produces 100 logit vectors (= probability distributions) over the set of all words in our vocabulary, with the `i`-th logit vector representing the probability distribution over the token *following* the `i`-th token in the sequence.

### Main Components

### The Encoder
The encoder in a Transformer model is responsible for processing the input sequence, such as a sentence or a document. It consists of a stack (Nx) of encoder layers or "blocks". Each encoder layer receives a sequence of token embeddings, which are representations of the input tokens obtained through tokenization and embedding techniques.
#### Components:

1. Multi-head self-attention layer
-  This layer allows each token to attend to other tokens in the input sequence. It computes attention weights that determine the importance of each token with respect to other tokens. The self-attention mechanism helps the model capture the input sequence's dependencies and relationships between different tokens.
- Because the self-attention mechanism works across the whole input sequence, the encoder is bidirectional by design.
- 입력 벡터들이 서로의 정보를 참고하도록 함
2. Fully Connected feed-forward layer
- After the self-attention layer, the output embeddings from the previous step are passed through a feed-forward neural network layer. This layer applies a non-linear transformation to each input embedding independently. The feed-forward layer introduces additional modeling capacity and helps capture more complex relationships within the input sequence.


The output embeddings of each encoder layer have the same size as the inputs. The role of the encoder stack is to "update" the input embeddings at each layer, gradually incorporating contextual information and capturing higher-level representations of the sequence.

### The Decoder
The decoder in the Transformer model is structurally similar to the encoder, but it has some key differences. The input at each step of the decoder is its own predicted output word from the previous step, similar to an autoregressive model. The input word is embedded and combined with positional encodings, just like in the encoder.

#### Components:
1. Masked multihead self-attention layer:
- This self-attention mechanism in the decoder allows each position in the sequence to attend to preceding positions in the partially generated target sequence. Unlike the encoder's self-attention, which attends to the entire input sequence, **the decoder's self-attention only attends to the preceding sequence elements. **
- This is achieved by applying a mask to the softmax input, setting the corresponding values to -∞, which prevents illegal connections between future positions and the current position being attended to. This masking ensures that the decoder is unidirectional, attending only to the preceding positions.
2. Encoder-decoder attention layer
- This layer allows every position in the decoder to attend over all positions in the input sequence (encoder output).
3. Feed-forward network
- Similar to the encoder, the decoder includes a feed-forward network. This network applies a non-linear transformation to each position's representation independently

# Building a Transformer

## Preliminaries
### 0. Key, Query, Value
- In the context of attention mechanisms, each element in the input sequence is associated with a query, key, and value vector.
  - Imagine you’re attending a conference where multiple speakers give presentations. Each presentation corresponds to a token in the input sequence. Now, let’s break down the key, query, and value in this context:

  1. **Key**:  The key represents the content or context of each presentation. It captures the main ideas, themes, or relevant information associated with each talk.
  2. **Query**: The query represents the specific topic or question you’re interested in or want to focus on during the conference. It could be a specific area of interest or a particular subject you’re curious about.
  3. **Value**:  The value contains the detailed information, insights, or knowledge provided by each speaker during their presentation.

### 1. Self-Attention
- The self-attention mechanism calculates attention weights that indicate the relevance of each element with respect to the other elements within the same sequence.
- The term “self” in self-attention emphasizes that attention is computed within the same sequence, without considering any external context or other sequences. It highlights the capability of the self-attention mechanism to capture dependencies and relationships between elements within the input sequence itself.
- In the scenario previously described, the attention mechanism allows you to attend to relevant presentations and extract valuable information based on your query. The key vectors help determine which presentations are most relevant to your query, while the query vector represents your specific area of interest or focus. The value vectors contain the detailed content of each presentation.

- The model identifies the most important presentations that align with your interests by calculating attention weights between the query and the keys. It then combines the values of these selected presentations using the attention weights, effectively capturing the relevant information from each presentation based on your query.


#### 1.1. Scaled dot-product attention
- It computes the attention weights between a query vector and a set of key-value pairs by calculating the dot product similarity between them.
$$Attention(Q,K,V) = softmax(\frac{QK^{T}}{\sqrt{d_k}})$$
- Steps:
  1. Project each token embedding, with dimension dₘ, into three vectors.
  2. Compute the attention scores using the dot product similarity ($QK^{T}$). The dot product between the query vector Q and the key vectors K^T for a sequence with dₖ input tokens will yield a similarity matrix of dimensions dₖ × dₖ.
  3. Scale the similarity matrix by dividing it by the square root of the dimensionality of the query/key vectors (${\sqrt{d_k}}$). This scaling ensures that the dot product values are not too large and helps prevent gradient explosion during training.
  4. Compute attention weights w. Apply the softmax function to the scaled similarity matrix. The resulting attention weights represent the importance of each key with respect to each query.
  5. Update the token embeddings. Multiply the attention weights w by the value vectors V to obtain a weighted sum of the values. The output is a weighted representation of the values based on the attention weights.



In [None]:
def scaled_dot_product_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    mask: Optional[torch.Tensor] = None,
    dropout: Optional[nn.Dropout] = None
) -> torch.Tensor:
    """
    Compute scaled dot product attention weights.

    Args:
        query: Tensor with shape [batch_size, seq_length_q, depth_q].
        key: Tensor with shape [batch_size, seq_length_k, depth_k].
        value: Tensor with shape [batch_size, seq_length_v, depth_v].
        mask: Optional tensor with shape [batch_size, seq_length_q, seq_length_k],
            containing values to be masked. Default is None.

    Returns:
        Tensor with shape [batch_size, seq_length_q, depth_v].
    """

    dim_k = query.size(-1)
    logger.debug(f"query_size: {query.size()}")
    logger.debug(f"key: {key.transpose(-2, -1).size()}")
    # TODO
    # Compute the attention scores.
    # Scale the dot product simialrity between the query and the key tensors.
    scores =
    if mask is not None:
        mask = mask.unsqueeze(1)
        scores = scores.masked_fill(mask == 0, float("-inf"))
    # TODO:
    # Compute attention weights w.
    # Apply the softmax function to the scaled similarity matrix.
    weights =

    if dropout is not None:
        weights = dropout(weights)
    # TODO:
    # Update the token embeddings.
    # Multiply the attention weights w by the value vectors V
    # to obtain a weighted sum of the values.
    result_embeddings =
    return result_embeddings

#### 1.2. Multi-head Attention
- The multi-head attention is an extension of the self-attention mechanism. It enhances the modeling capability by performing multiple attention computations in parallel, with different learned linear projections.
$$ \begin{matrix}
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1,...,\text{head}_h)W^O\\
\text{where}~\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K,VW_i^V)
\end{matrix} $$
- Steps:
  1. It applies linear transformations to the query, key, and value tensors using the learned linear layers self.q, self.k, and self.v, respectively. This projects the tensors to the appropriate dimensions for attention computation.
  2. The attention scores are computed by performing matrix multiplication between the query and key tensors.
  3. The attention scores are scaled by dividing them by the square root of the head dimension (`self.head_dim`).
  4. If a mask is provided, the attention scores are masked by setting the scores corresponding to masked positions to negative infinity.
  5. The attention scores are passed through a softmax activation function along the last dimension (`dim=-1`).
  6. The attention probabilities are used to weight the value tensor.
  7. The resulting attention output is transposed and reshaped to match the original shape.
  8. Finally, the attention output is passed through the `self.output_linear` linear layer, which applies another linear transformation to the output representation.

In [None]:

class MultiHeadAttention(nn.Module):
    """
    Multi-head attention module.

    Args:
        config: Configuration for the multi-head attention.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.embed_dim = config.hidden_size
        self.num_heads = config.num_attention_heads
        logger.debug(f"hidden_dim: {self.embed_dim}")
        logger.debug(f"num_heads: {self.num_heads}")

        assert self.embed_dim % self.num_heads == 0
        self.head_dim = self.embed_dim // self.num_heads
        logger.debug(f"head_dim: {self.head_dim}")

        self.q = nn.Linear(self.embed_dim, self.head_dim * self.num_heads)
        self.k = nn.Linear(self.embed_dim, self.head_dim * self.num_heads)
        self.v = nn.Linear(self.embed_dim, self.head_dim * self.num_heads)
        self.output_linear = nn.Linear(self.embed_dim, self.embed_dim)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(
            self,
            query: torch.Tensor,
            key: torch.Tensor,
            value: torch.Tensor,
            mask: Optional[torch.Tensor] = None
        ) -> torch.Tensor:
            """
            Perform a forward pass of the multi-head attention.

            Args:
                query: Query tensor of shape [batch_size, seq_len, embed_dim].
                key: Key tensor of shape [batch_size, seq_len, embed_dim].
                value: Value tensor of shape [batch_size, seq_len, embed_dim].
                mask: Optional mask tensor. Default is None.

            Returns:
                Tensor of shape [batch_size, seq_len, embed_dim],
                representing the output of the multi-head attention.
            """
            # TODO:
            # Apply linear transformations to the query, key, and value tensors
            q =
            k =
            v =
            logger.debug(f"q_size: {q.size()}")
            logger.debug(f"k_size: {k.size()}")
            logger.debug(f"v_size: {v.size()}")

            # Reshape and transpose tensors for matrix multiplication
            q = q.view(q.size(0), -1, self.num_heads, self.head_dim).transpose(1, 2)
            k = k.view(k.size(0), -1, self.num_heads, self.head_dim).transpose(1, 2)
            v = v.view(v.size(0), -1, self.num_heads, self.head_dim).transpose(1, 2)

            logger.debug(f"qT_size: {q.size()}")
            logger.debug(f"kT_size: {k.size()}")
            logger.debug(f"vT_size: {v.size()}")
            # TODO:
            # Calculate the attention scores using the
            # scaled_dot_product_attention function defined earlier

            attn_scores =
            attn_scores = attn_scores.transpose(1, 2).contiguous()
            attn_scores = attn_scores.view(attn_scores.size(0), -1, self.embed_dim)
            logger.debug(f"attn_scores: {attn_scores.size()}")

            output = self.output_linear(attn_scores)
            logger.debug(f"output_size: {output.size()}")
            return output

In [None]:
model_ckpt = "bert-base-uncased"
config = AutoConfig.from_pretrained(model_ckpt)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)

text = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs_embeds = token_emb(inputs.input_ids)

query = key = value = inputs_embeds

multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(query, key, value)

[32m2025-02-06 10:40:36.719[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m12[0m - [34m[1mhidden_dim: 768[0m
[32m2025-02-06 10:40:36.725[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mnum_heads: 12[0m
[32m2025-02-06 10:40:36.732[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m17[0m - [34m[1mhead_dim: 64[0m
[32m2025-02-06 10:40:36.824[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m50[0m - [34m[1mq_size: torch.Size([1, 9, 768])[0m
[32m2025-02-06 10:40:36.832[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m51[0m - [34m[1mk_size: torch.Size([1, 9, 768])[0m
[32m2025-02-06 10:40:36.839[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m52[0m - [34m[1mv_size: torch.Size([1, 9, 768])[0m
[32m2025-02-06 10:40:36.844[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m59[0m - [34m[1mqT_size: torch.

### 2. The Feed-Forward Layer
- The feed-forward layer is a type of neural network layer that processes the input data independently at each position in the input sequence, without considering the dependencies between different positions. This means that the computations for different positions can be parallelized, making the Transformer architecture highly efficient for sequence processing tasks.

- The feed-forward layer in Transformers typically consists of two linear transformations with a non-linear activation function in between. The input to the feed-forward layer is a tensor representing the hidden states of the previous layer or the input embeddings.

- The input to the feed-forward layer is a tensor representing the hidden states of the previous layer or the input embeddings. The feed-forward layer is a critical component of Transformers as it helps capture local patterns and dependencies in the input data.



In [None]:

class FeedForward(nn.Module):
    """
    Feed-forward layer module.

    Args:
        config: Configuration for the feed-forward layer.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the feed-forward layer.

        Args:
            x: Input tensor of shape [batch_size, seq_len, hidden_dim].

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim],
            representing the output of the feed-forward layer.
        """
        x = self.linear_1(x)
        x = self.relu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        logger.debug(f"ff_output_size: {x.size()}")
        return x

In [None]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_output)

[32m2025-02-06 10:41:00.346[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m30[0m - [34m[1mff_output_size: torch.Size([1, 9, 768])[0m


### 3. Positional Embeddings
- The purpose of positional embeddings is to provide the model with a representation that encodes the relative positions of tokens within the sequence. This allows the model to differentiate between tokens based on their position, even though all tokens initially have the same embeddings.

- In the original Transformer model, the positional embeddings used to encode the sequential order of tokens are learned as part of the model training process. The positional embeddings are initialized with fixed sinusoidal functions of different frequencies and then fine-tuned during training.

- Steps:
  1. The constructor of the Embedding class defines two embedding layers `self.token_embeddings` and `self.position_embeddings`. These layers are initialized with different vocabulary sizes and hidden sizes.
  2. In the forward method, position IDs are created using `torch.arange(seq_length).unsqueeze(0)`.  This creates a tensor of sequential integers from 0 to seq_lenght-1 and unsqueezes it to have a shape of [1, seq_lenght]. These position IDs represent the positions of the tokens in the input sequence.
  3. The token embeddings for the input sequence are obtained by passing `input_ids` to `self.token_embeddings`. This maps each token ID to its corresponding embedding vector. On the other hand, the position embeddings for the input sequence are obtained by passing `position_ids` to `self.position_embeddings`. This maps each position ID to its corresponding embedding vector.
  4. The token embeddings and position embeddings are added element-wise (`token_embeddings` + `position_embeddings`) to create the combined embeddings. This operation incorporates both the token information and the positional information of each token in the input sequence.



In [None]:
class Embeddings(nn.Module):
    """
    Embeddings layer module.
    Combines a token embedding layer that projects the `input_ids` to a dense hidden state
    with the positional embedding that does the same for `position_ids`.
    The resulting embedding is simply the sum of both embeddings.

    Args:
        config: Configuration for the embeddings layer.
    """
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the embeddings layer.

        Args:
            input_ids: Input tensor of shape [batch_size, seq_len].

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim],
            representing the embeddings of the input.

        Notes:
            1. Create position IDs for input sequence.
            2. Create token and position embeddings.
            3. Combine token and position embeddings.
        """
        logger.debug(f"input_size: {input_ids.size()}")
        seq_length = input_ids.size(1)
        # TODO: 1. Create position IDs
        position_ids =
        # TODO: 2. Create token and position embeddings
        token_embeddings =
        logger.debug(f"token_embd_size: {token_embeddings.size()}")
        position_embeddings =
        logger.debug(f"position_embd_size: {token_embeddings.size()}")
        # TODO 3: Combine token and position embeddings
        embeddings =

        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        logger.debug(f"embd_size: {token_embeddings.size()}")

        return embeddings

## The Encoder
### 1. TransformerEncoderBlock
- With the previously defined components, we can now define the TransformerEncoderBlock class. It is responsible for performing one layer of the encoder in a Transformer model.
- Steps:
  1. Layer Normalization: The input tensor `x` is first passed through a layer normalization operation using `self.layer_norm_1`. This operation normalizes the activations across the hidden dimension of x to have zero mean and unit variance. The result is stored in hidden_state.
  2. Attention with Skip Connection: The attention mechanism is applied to `hidden_state` using `self.attention`. This attention operation takes hidden_state as the input and produces an attention-based output. The output is then element-wise added (`+`) to the original input tensor `x`. This skip connection allows the model to directly incorporate the original input along with the attention-based output.
  3. Feed-Forward Layer with Skip Connection: The output of the previous step is passed through another layer normalization operation `self.layer_norm_2` to normalize the activations. Then, the result is passed through the feed-forward layer self.feed_forward. The output of the feed-forward layer is again element-wise added (`+`) to the input tensor from the previous step (`x`). This skip connection allows the model to combine the information from the original input with the transformed output from the feed-forward layer.

- In summary, the skip connections enable the model to incorporate the original input tensor x into the output of each layer. By adding the transformed outputs to the original input, the model can retain important information from the input and facilitate the flow of gradients during training. The skip connections help in addressing the vanishing gradient problem and make it easier to train deep Transformer architectures by ensuring the model has access to the original input information at each layer.

In [None]:
class TransformerEncoderBlock(nn.Module):
    """
    Transformer Encoder block module.

    Args:
        config: Configuration for the encoder block.
    """

    def __init__(self, config) -> None:
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor]=None) -> torch.Tensor:
        """
        Perform a forward pass of the transformer encoder block.

        Args:
            x: Input tensor of shape [batch_size, seq_len, hidden_dim].
            mask: Optional mask tensor. Default is None.

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim],
            representing the output of the encoder block.
        """
        logger.debug(f"encoder_block_input_size: {x.size()}")
        # TODO:
        # Normalize input tensor x
        hidden_state =
        # TODO:
        # Apply the attention mechanism to the hidden_state using self.attention
        # Add the output to the original input tensor (skip connection)

        attention_output =
        x = x + self.dropout(x)
        # TODO:
        # Normalize the activations using self.layer_norm_2
        # Pass it to the feed-forward layer
        # Add the output of the feed_forward layer to the input tensor from
        # the previous step (skip connection)
        x =
        x =

        x = self.dropout(x)
        logger.debug(f"encoder_block_output_size: {x.size()}")
        return x

In [None]:
encoder_layer = TransformerEncoderBlock(config)
_ = encoder_layer(inputs_embeds)

[32m2025-02-06 10:42:07.947[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m12[0m - [34m[1mhidden_dim: 768[0m
[32m2025-02-06 10:42:07.949[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mnum_heads: 12[0m
[32m2025-02-06 10:42:07.951[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m17[0m - [34m[1mhead_dim: 64[0m
[32m2025-02-06 10:42:08.086[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m29[0m - [34m[1mencoder_block_input_size: torch.Size([1, 9, 768])[0m
[32m2025-02-06 10:42:08.093[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m50[0m - [34m[1mq_size: torch.Size([1, 9, 768])[0m
[32m2025-02-06 10:42:08.095[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m51[0m - [34m[1mk_size: torch.Size([1, 9, 768])[0m
[32m2025-02-06 10:42:08.098[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m52[0m - [34m

### 2. TransformerEncoder
- Finally, putting everything together, we can now define the TransoformerEncoder class. It is responsible for processing the input sequence using multiple stacked Transformer Encoder Blocks.

In [None]:
class TransformerEncoder(nn.Module):
    """
    Transformer Encoder module.

    Args:
        config: Configuration for the encoder.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderBlock(config) for _ in range(config.num_hidden_layers)])

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Perform a forward pass of the transformer encoder.

        Args:
            x: Input tensor of shape [batch_size, seq_len].
            mask: Optional mask tensor. Default is None.

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim],
            representing the output of the encoder.
        """
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x, mask)

        return x

In [None]:
logger.remove()
logger.add(sys.stderr, level="INFO")
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
encoder_output.size()

torch.Size([1, 9, 768])

In [None]:
encoder_output

tensor([[[-0.3491, -0.3320, -0.0517,  ..., -0.5293, -0.0000,  0.8004],
         [ 0.0000, -0.7515,  1.4030,  ...,  0.3103,  0.1716, -0.0362],
         [ 0.3268, -0.5307, -1.9108,  ..., -0.4951,  0.1939, -3.6646],
         ...,
         [ 0.0061, -0.0843,  0.4482,  ...,  0.1206,  0.3052, -5.8102],
         [ 0.8655,  0.0621,  0.0000,  ..., -0.5356,  0.2064, -3.1653],
         [-1.2111,  0.1792,  0.1255,  ...,  0.4157, -0.6652, -0.3626]]],
       grad_fn=<MulBackward0>)

## The Decoder
-  The main difference between the decoder and encoder is that the decoder has two attention sublayers.

- The first attention sublayer, known as the self-attention sublayer, allows the decoder to attend to its own previously generated tokens, capturing dependencies and relationships within the output sequence.

- The second attention sublayer is the encoder-decoder attention, which allows the decoder to attend to the encoded representations produced by the encoder, incorporating contextual information from the input sequence.

- **Mask** is applied in the self-attention mechanism to enforce the causality constraint during the decoding process. Since the decoder generates the target sequence autoregressively, each position in the target sequence should only attend to previous positions and not future positions. If you recall the scaled_dot_product_attention function, we set the upper values to infinity. This guarantees that the attention weights are all zero once we take the softmax over the scores (as e^-∞=0).

### 1. The Decoder Block
- Similarly to the TransformerEncoderBlock, the TransformerDecoderBlock is responsible for performing one layer of the decoder in a Transformer model:

In [None]:
class TransformerDecoderBlock(nn.Module):
    """
    Transformer Decoder layer module.

    Args:
        config: Configuration for the decoder layer.
    """

    def __init__(self, config, ) -> None:
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_3 = nn.LayerNorm(config.hidden_size)
        self.self_attention = MultiHeadAttention(config)
        self.encoder_decoder_attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self,
        x: torch.Tensor,
        encoder_output: torch.Tensor,
        source_mask: Optional[torch.Tensor] = None,
        target_mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        """
        Perform a forward pass of the transformer decoder block.

        Args:
            x: Input tensor of shape [batch_size, seq_len, hidden_dim].
            encoder_output: Output tensor from the encoder of shape [batch_size, seq_len, hidden_dim].
            source_mask: Optional source mask tensor. Default is None.
            target_mask: mask: Optional target mask tensor. Default is None.
        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim],
            representing the output of the decoder block.
        """
        logger.debug(f"decoder_block_input_size: {x.size()}")

        # TODO:
        # First attention sublayer, attending to its own previously generated
        # tokens. mask 필요: 디코더가 미래 단어를 참조하지 못하도록 차단
        hidden_state = self.layer_norm_1(x)
        attn_1_out =
        x =
        # TODO:
        # Second attention sublayer, attending to the encoded representations
        # from the encoder.
        # mask 필요 없음: 인코더 출력을 전체적으로 봐도 됨. 인코더-디코더 간의 Attention에서는 미래 정보 제한이 필요 없음
        x = self.layer_norm_2(x)
        attn_2_out =
        x =
        x = self.layer_norm_3(x)

        feed_forward_output = self.feed_forward(x)
        x = x + self.dropout(feed_forward_output)
        logger.debug(f"decoder_block_output_size: {x.size()} ")

        return x

### 2. TransformerDecoder


In [None]:
class TransformerDecoder(nn.Module):
    def __init__(self, config) -> None:
        """
        Transformer Decoder module.

        Args:
            config: Configuration object for the decoder.
            mask: Masking object for attention layers.
        """
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerDecoderBlock(config) for _ in range(config.num_hidden_layers)])

    def forward(
        self,
        input_ids: torch.Tensor,
        encoder_output: torch.Tensor,
        source_mask: torch.Tensor = None,
        target_mask: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Perform a forward pass of the transformer decoder.

        Args:
            x: Input tensor of shape [batch_size, tgt_len].

        Returns:
            Tensor of shape [batch_size, tgt_len, vocab_size],
            representing the predicted probabilities over the vocabulary.
        """
        x = self.embeddings(input_ids)
        for layer in self.layers:
            x = layer(x, encoder_output, source_mask=source_mask, target_mask=target_mask)
        return x

In [None]:
logger.remove()
logger.add(sys.stderr, level="INFO")
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
decoder = TransformerDecoder(config)
output = decoder(inputs.input_ids, encoder_output, target_mask=mask)
output.size()

torch.Size([1, 9, 768])

## The Transformer
- With all the required components now in place, we can proceed to define and implement this model

In [None]:
class EncoderDecoder(nn.Module):
    """
    Encoder-Decoder model that combines the TransformerEncoder and TransformerDecoder.

    Args:
        encoder_config: Configuration for the encoder.
        decoder_config: Configuration for the decoder.
    """
    def __init__(
        self,
        config,
    ) -> None:
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.decoder = TransformerDecoder(config)
        self.fc = nn.Linear(config.hidden_size, config.vocab_size)

    def forward(
        self,
        input_ids: torch.Tensor,
        target_ids: torch.Tensor,
        source_mask: Optional[torch.Tensor] = None,
        target_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Perform a forward pass of the encoder-decoder model.

        Args:
            input_ids: Input tensor of shape [batch_size, src_len].
            target_ids: Target tensor of shape [batch_size, tgt_len].

        Returns:
            Tensor of shape [batch_size, tgt_len, vocab_size],
            representing the predicted probabilities over the vocabulary.
        """
        encoder_output = self.encoder(input_ids)
        decoder_output = self.decoder(
            target_ids,
            encoder_output,
            source_mask=source_mask,
            target_mask=target_mask
        )

        x = self.fc(decoder_output)  # Apply linear layer to transform to vocab_size

        return x

### Masking
The mask used in the Transformer model should have a specific shape and values to ensure proper masking during the attention mechanism. Here's how you can define the mask:

- Padding Mask: The padding mask is used to mask out padding tokens in the input sequences. It should have a shape of (batch_size, seq_length) and contain 1 where the padding tokens are present and 0 for the non-padding tokens. This mask ensures that the padding tokens do not contribute to the attention scores.

- Future Mask: The future mask is used to prevent attending to future positions in the self-attention mechanism. It should have a shape of (seq_length, seq_length) and have 1 for positions that can be attended and 0 for positions that should be masked or ignored.

- Combined Mask: To create the final mask, you need to combine the padding mask and the future mask. This can be done by applying logical operations, such as element-wise multiplication or logical OR, to the two masks.

In [None]:
def create_mask(batch_size: int, seq_length: int) -> torch.Tensor:
    """
    Create a lower triangular mask with ones below the diagonal.

    Args:
        batch_size: The batch size.
        seq_length: The length of the sequence.

    Returns:
        The mask tensor with shape (batch_size, seq_length, seq_length).
    """

    mask = torch.tril(torch.ones(seq_length, seq_length))
    mask = mask.unsqueeze(0).expand(batch_size, seq_length, seq_length)  # Expand the mask along the batch dimension

    return mask

## Testing the Transformer!

In [None]:
class TransformerConfig:
    """
    Configuration class for the Transformer model.

    Args:
        hidden_size: Size of the hidden state.
        intermediate_size: Size of the intermediate layer in the feed-forward network.
        num_hidden_layers: Number of hidden layers in the Transformer.
        vocab_size: Size of the vocabulary.
        max_position_embeddings: Maximum number of positional embeddings.
        hidden_dropout_prob: Dropout probability for the hidden layers.
        num_attention_heads: Number of attention heads in the multi-head attention.
    """
    def __init__(
        self,
        hidden_size: int,
        intermediate_size: int,
        num_hidden_layers: int,
        vocab_size: int,
        max_position_embeddings: int,
        hidden_dropout_prob: float,
        num_attention_heads: int
    ):
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_dropout_prob = hidden_dropout_prob
        self.num_attention_heads = num_attention_heads

In [None]:
# Set up hyperparameters and configuration
config = TransformerConfig(
    hidden_size=512,
    intermediate_size=2048,
    num_hidden_layers=6,
    vocab_size=100,
    max_position_embeddings=512,
    hidden_dropout_prob=0.1,
    num_attention_heads=8
)

In [None]:
# Define some fake data
batch_size = 16
source_length = 10
target_length = 12

source_ids = torch.randint(0, config.vocab_size, (batch_size, source_length))
target_ids = torch.randint(0, config.vocab_size, (batch_size, target_length))

source_ids.size(), target_ids.size()

(torch.Size([16, 10]), torch.Size([16, 12]))

In [None]:
source_mask = create_mask(batch_size, source_length)
target_mask = create_mask(batch_size, target_length)
source_mask.size(), target_mask.size()

logger.remove()
logger.add(sys.stderr, level="INFO")

5

In [None]:
encoder = TransformerEncoder(config)
encoder_output = encoder(source_ids)
decoder = TransformerDecoder(config)
output = decoder(source_ids, encoder_output, source_mask=source_mask)
output.size()

torch.Size([16, 10, 512])

In [None]:
# Define the EncoderDecoder model
encoder_decoder = EncoderDecoder(config)
output = encoder_decoder(source_ids, target_ids, target_mask=target_mask)
print("Output Shape:", output.shape)  # Should be (batch_size, target_length, vocab_size)

Output Shape: torch.Size([16, 12, 100])


In [None]:
target_ids.size()


torch.Size([16, 12])

### Training


In [None]:
class RandomDataset(torch.utils.data.Dataset):
    """
    Provides random data copy dataset for training.

    Args:
        vocabulary_size: The vocabulary size.
        batch_size: The batch size.
        num_samples: The number of samples.
        sample_length: The length of each sample.
    """

    def __init__(self, vocabulary_size: int, batch_size: int, num_samples: int, sample_length: int):
        self.samples = list()

        for i in range(batch_size * num_samples):
            data = torch.from_numpy(np.random.randint(1, vocabulary_size, size=(sample_length,)))
            data[0] = 1
            source = torch.autograd.Variable(data, requires_grad=False)
            target = torch.autograd.Variable(data, requires_grad=False)

            # Prepare the sample dictionary
            sample = {
                'source': source,
                'target': target[:-1],
                'target_y': target[1:],
                'source_mask': (source != 0).unsqueeze(-2),
                'target_mask': self.make_std_mask(target, 0),
                'tokens_count': (target[1:] != 0).data.sum()  # Assuming target_y is the actual target shifted by 1
            }

            self.samples.append(sample)

    def __len__(self) -> int:
        """
        Get the number of samples in the dataset.

        Returns:
            The number of samples.
        """
        return len(self.samples)

    def __getitem__(self, idx: int) -> dict:
        """
        Get a sample from the dataset.

        Args:
            idx: The index of the sample to retrieve.

        Returns:
            A dictionary containing the source, target, target_y, source_mask, target_mask, and tokens_count.
        """
        return self.samples[idx]

    @staticmethod
    def make_std_mask(target: torch.Tensor, pad: int) -> torch.Tensor:
        """
        Create a mask to hide padding and future words.

        Args:
            target (torch.Tensor): The target tensor.
            pad (int): The padding value.

        Returns:
            torch.Tensor: The mask tensor.
        """
        target_mask = (target != pad)
        target_mask = target_mask & torch.autograd.Variable(
            RandomDataset.subsequent_mask(target.size(-1)).type_as(target_mask.data))

        return target_mask

    @staticmethod
    def subsequent_mask(size: int) -> torch.Tensor:
        """
        Mask out subsequent positions.

        Args:
            size: The size of the mask.

        Returns:
            torch.Tensor: The subsequent mask tensor.
        """
        attn_shape = (size, size)
        subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
        return torch.from_numpy(subsequent_mask) == 0

In [None]:
batch_size = 64
num_samples = 1000
samples_len = 10
train_set = RandomDataset(config.vocab_size, batch_size, num_samples, samples_len)
train_loader = torch.utils.data.DataLoader(train_set, batch_size)

KeyboardInterrupt: 

In [None]:
model = EncoderDecoder(config)

# Initialize parameters.
for p in encoder_decoder.parameters():
    if p.dim() > 1:
        torch.nn.init.xavier_uniform_(p)

model.train()

optimizer = torch.optim.Adam(model.parameters())
loss_function = torch.nn.CrossEntropyLoss()

current_loss = 0.0
counter = 0

for i, batch in enumerate(train_loader):
    with torch.set_grad_enabled(True):
        out = model.forward(batch['source'], batch['target'], batch['source_mask'], batch['target_mask'])
        loss = loss_function(out.contiguous().view(-1, out.size(-1)), batch['target_y'].contiguous().view(-1))
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

        current_loss += loss
        counter += 1

        if counter % 5 == 0:
            print("Batch: %d; Loss: %f" % (i + 1, current_loss / counter))
            current_loss = 0.0
            counter = 0