(Transformer)=
# Chapter 21 -- Transformer

## 1. Introduction

Transformers are a type of deep learning model introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. They have revolutionized natural language processing (NLP) by enabling the creation of models like BERT, GPT, and T5. Transformers leverage self-attention mechanisms to process input data in parallel, making them highly efficient and effective for various tasks.

## 2. Mathematical Theory

### Self Attention Mechanism
The self-attention mechanism is a core component of the transformer model. It allows the model to weigh the importance of different words in a sequence when encoding a particular word.

#### Query, Key, and Value Vectors
For each word in the input sequence, we create three vectors: Query ($Q$), Key ($K$), and Value ($V$). These vectors are obtained by multiplying the input embeddings with learned weight matrices.

$$
Q = XW_Q \quad K = XW_K \quad V = XW_V
$$
(eq2_1)

Where:
- $X$ is the input embedding matrix.
- $ W_Q, W_K, W_V $ are learned weight matrices.


### Attention Score Calculation 
The attention scores are calculated using the dot product of the query and key vectors, followed by scaling.

$$
\text{Attention\_scores} = \frac{QK^T}{\sqrt{d_k}}
$$
(eq2_2)

Where:
- $d_k$ is the dimension of the key vectors (used for scaling).

#### Softmax Function for Normalization

To convert the attention scores into probabilities, we apply the softmax function.

$$
\alpha_{ij} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) = \frac{\exp \left(\frac{Q_i K_j^T}{\sqrt{d_k}}\right)}{\sum_{j=1}^{n} \exp \left(\frac{Q_i K_j^T}{\sqrt{d_k}}\right)}
$$
(eq2_3)

#### Weighted Sum of Values

The output of the self-attention mechanism for each word is the weighted sum of the value vectors, where the weights are the attention scores.

$$
\text{Attention}(Q, K, V) = \alpha V
$$
(eq2_4)


### 2.2 Positional Encoding

Since transformers do not have a built-in notion of word order, we add positional encodings to the input embeddings to incorporate sequence order information.

$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
(eq2_5)

$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
(eq2_6)

Where:
- $pos$ is the position.
- $i$ is the dimension.
- $d_{model}$ is the model dimension.

### 2.3 Scaled Dot-Product Attention

Scaled dot-product attention uses the concepts described above (queries, keys, values, and softmax normalization).

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
(eq2_7)

### 2.4 Multi-Head Attention

Instead of performing a single attention function, multi-head attention runs multiple attention mechanisms in parallel, which allows the model to focus on different parts of the input sequence.

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W_O
$$
(eq2_8)

Where each head is calculated as:

$$
\text{head}_i = \text{Attention}(QW_{Q_i}, KW_{K_i}, VW_{V_i})
$$
(eq2_9)

### 2.5 Feed-Forward Neural Networks

Each position in the sequence is processed independently and identically by a fully connected feed-forward network (FFN), which consists of two linear transformations with a ReLU activation in between.

$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$
(eq2_10)

### 2.6 Layer Normalization and Residual Connections

Transformers use layer normalization and residual connections to stabilize and speed up the training process.

#### Layer Normalization

$$
\text{LayerNorm}(x) = \frac{x - \mu}{\sigma + \epsilon} \gamma + \beta
$$
(eq2_11)

Where:
- $mu$ is the mean.
- $sigma$ is the standard deviation.
- $gamma$ and $beta$ are learned parameters.

#### Residual Connection

$$
\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$
(eq2_12)

Where Sublayer could be the multi-head attention mechanism or the feed-forward network.


### 3. Transformer Architecture Diagram

A detailed diagram of the transformer architecture shows the encoder and decoder blocks. Key components include multi-head attention and feed-forward layers.






## 4. Structure of Transformer

### 4.1 Encoder

The encoder is composed of a stack of $N$ identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.

#### Encoder Layer

Each encoder layer can be described mathematically as:

$$
\text{LayerNorm}(x + \text{MultiHead}(x, x, x)) \quad \text{(Self-Attention Sub-Layer)}
$$
(eq4_1)

$$
\text{LayerNorm}(x + \text{FFN}(x)) \quad \text{(Feed-Forward Sub-Layer)}
$$
(eq4_2)

Where:
- $x$ is the input to the encoder layer.
- $\text{MultiHead}$ is the multi-head attention mechanism.
- $\text{FFN}$ is the feed-forward neural network.

### 4.2 Decoder

The decoder is also composed of a stack of $N$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

#### Decoder Layer

Each decoder layer can be described mathematically as:

$$
\text{LayerNorm}(x + \text{MultiHead}(x, x, x)) \quad \text{(Masked Self-Attention Sub-Layer)}
$$
(eq4_3)

$$
\text{LayerNorm}(x + \text{MultiHead}(x, \text{EncoderOutput}, \text{EncoderOutput})) \quad \text{(Encoder-Decoder Attention Sub-Layer)}
$$
(eq4_4)

$$
\text{LayerNorm}(x + \text{FFN}(x)) \quad \text{(Feed-Forward Sub-Layer)}
$$
(eq4_5)

Where:
- $x$ is the input to the decoder layer.
- $\text{EncoderOutput}$ is the output from the encoder stack.

### 4.3 Encoder-Decoder Interactions

The encoder and decoder interact through the encoder-decoder attention sub-layer. This layer allows the decoder to focus on relevant parts of the input sequence. The mathematical representation of this interaction is given by:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W_O
$$
(eq2_8)

Where each attention head is calculated as:

$$
\text{head}_i = \text{Attention}(QW_{Q_i}, KW_{K_i}, VW_{V_i})
$$
(eq2_9)

And the attention function is defined as:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
(eq2_7)



### Implmentation a Transformer from Scratch

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Transformer(nn.Module):
    def __init__(self, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout):
        super(Transformer, self).__init__()
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout)
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, dropout)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        src = self.embedding(src) * math.sqrt(d_model)
        src = self.pos_encoder(src)
        tgt = self.embedding(tgt) * math.sqrt(d_model)
        tgt = self.pos_encoder(tgt)
        output = self.transformer(src, tgt, src_mask, tgt_mask)
        return output

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

# Example usage:
vocab_size = 10000  # Define vocabulary size
d_model = 512       # Embedding dimension
nhead = 8           # Number of attention heads
num_encoder_layers = 6
num_decoder_layers = 6
dim_feedforward = 2048
dropout = 0.1

model = Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout)
src = torch.randint(0, vocab_size, (10, 32))  
tgt = torch.randint(0, vocab_size, (20, 32))  
output = model(src, tgt)
