# The Transformer Architecture

Notably, self-attention enjoys both parallel computation and the shortest maximum path length. Therefore, it is appealing to design deep architectures by using self-attention. The Transformer architecture is a stack of self-attention layers followed by a feed-forward network.

[Transformer](https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html)

In [1]:
import math
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l

## Positionwise feed-forward network
The positionwise feed-forward network transforms the representation at all the sequence positions using the same MLP. This is why we call it **positionwise**. In the implementation below, the input X with shape (batch size, number of time steps or sequence length in tokens, number of hidden units or feature dimension) will be transformed by a two-layer MLP into an output tensor of shape (batch size, number of time steps, ffn_num_outputs).

In [2]:
class PositionWiseFFN(nn.Module):  #@save
    """The positionwise feed-forward network."""
    def __init__(self, ffn_num_hiddens, ffn_num_outputs):
        super().__init__()
        self.dense1 = nn.LazyLinear(ffn_num_hiddens)
        self.relu = nn.ReLU()
        self.dense2 = nn.LazyLinear(ffn_num_outputs)

    def forward(self, X):
        return self.dense2(self.relu(self.dense1(X)))

In [3]:
# example of position-wise feed-forward network
ffn = PositionWiseFFN(4, 8)
ffn.eval()
ffn(torch.ones((2, 3, 4)))[0]



tensor([[ 0.1246, -0.0252, -0.1967,  0.7297, -0.4245, -0.4610,  0.3698, -0.3923],
        [ 0.1246, -0.0252, -0.1967,  0.7297, -0.4245, -0.4610,  0.3698, -0.3923],
        [ 0.1246, -0.0252, -0.1967,  0.7297, -0.4245, -0.4610,  0.3698, -0.3923]],
       grad_fn=<SelectBackward0>)

## Residual Connection and Layer Normalization

In [4]:
ln = nn.LayerNorm(2)
bn = nn.LazyBatchNorm1d()
X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32)
# Compute mean and variance from X in the training mode
print('layer norm:', ln(X), '\nbatch norm:', bn(X))

layer norm: tensor([[-1.0000,  1.0000],
        [-1.0000,  1.0000]], grad_fn=<NativeLayerNormBackward0>) 
batch norm: tensor([[-1.0000, -1.0000],
        [ 1.0000,  1.0000]], grad_fn=<NativeBatchNormBackward0>)


In [5]:
class AddNorm(nn.Module):  #@save
    """The residual connection followed by layer normalization."""
    def __init__(self, norm_shape, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(norm_shape)

    def forward(self, X, Y):
        return self.ln(self.dropout(Y) + X)