# **Transformer Architecture**

### **Encoder-Decoder Architecture**

It is used for **sequence-to-sequence** tasks like machine translation.

This is the architecture proposed by the original transformer paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. in 2017.

<img src="assets/encoder_decoder.png" alt="Encoder-Decoder Architecture" style="background-color:white;" height="500" />

### **Decoder-Only Architecture**

It is used for **sequence generation** tasks like language modeling.

This simpler architecture will be implemented in this notebook from scratch.

<img src="assets/decoder_only.png" alt="Decoder-Only Architecture" style="background-color:white;" height="500" />

In [1]:
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


## **Scaled Dot-Product Attention**

Let $Q \in \mathbb{R}^{m \times d_k}$, $K \in \mathbb{R}^{n \times d_k}$ y $V \in \mathbb{R}^{n \times d_v}$ be the query, key and value matrices, respectively. The scaled dot-product attention is defined as:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \in \mathbb{R}^{m \times d_v}
$$

<img src="assets/self_attention.png" alt="Decoder-Only Architecture" style="background-color:white;" height="500" />

In [2]:
class ScaledDotProductAttention(nn.Module):

    def __init__(self, embeddingSize: int, headSize: int, blockSize: int, dropout: float) -> None:
        super().__init__()
        self.key = nn.Linear(embeddingSize, headSize, bias=False)
        self.query = nn.Linear(embeddingSize, headSize, bias=False)
        self.value = nn.Linear(embeddingSize, headSize, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(blockSize, blockSize)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        weights = q @ k.transpose(-2, -1) / (C ** 0.5)
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float("-inf"))
        weights = F.softmax(weights, dim=-1)
        weights = self.dropout(weights)
        output = weights @ v
        return output

## **Multi-Head Attention**

Let $h$ be the number of heads. The multi-head attention is defined as:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \in \mathbb{R}^{m \times d_v}
$$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$ and $W_i^Q \in \mathbb{R}^{d_k \times d_k}$, $W_i^K \in \mathbb{R}^{d_k \times d_k}$, $W_i^V \in \mathbb{R}^{d_v \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_v}$ are the learnable parameters.

<img src="assets/multi_head_attention.png" alt="Decoder-Only Architecture" style="background-color:white;" height="500" />

In [3]:
class MultiHeadAttention(nn.Module):

    def __init__(self, nHeads: int, embeddingSize: int, headSize: int, blockSize: int, dropout: float) -> None:
        super().__init__()
        self.heads = nn.ModuleList([ScaledDotProductAttention(embeddingSize, headSize, blockSize, dropout) for _ in range(nHeads)])
        self.projection = nn.Linear(embeddingSize, embeddingSize)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        output = torch.cat([head(x) for head in self.heads], dim=-1)
        output = self.projection(output)
        return output

## **Position-wise Feed-Forward Networks**

The position-wise feed-forward networks are defined as:

$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$

where $W_1 \in \mathbb{R}^{d_{ff} \times d_{model}}$, $b_1 \in \mathbb{R}^{d_{ff}}$, $W_2 \in \mathbb{R}^{d_{model} \times d_{ff}}$ and $b_2 \in \mathbb{R}^{d_{model}}$ are the learnable parameters.

In [4]:
class FeedForward(nn.Module):

    def __init__(self, embeddingSize: int, dropout: float) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embeddingSize, 4 * embeddingSize),
            nn.ReLU(),
            nn.Linear(4 * embeddingSize, embeddingSize),
            nn.Dropout(dropout)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

## **Layer Normalization**

The layer normalization is defined as:

$$
\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta
$$

where $\gamma \in \mathbb{R}^{d_{model}}$ and $\beta \in \mathbb{R}^{d_{model}}$ are the learnable parameters, and $\mu$ and $\sigma$ are the mean and standard deviation of $x$, respectively.

In the original paper, the layer normalization is applied after the residual connections, whereas in more modern implementations, it is applied before in what is known as **pre-norm formulation**.

```python
# Original paper
x = LayerNorm(x + MultiHeadAttention(x))
x = LayerNorm(x + FeedForward(x))

# Pre-norm formulation
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))
```

In [5]:
class Layer(nn.Module):

    def __init__(self, nHeads: int, embeddingSize: int, blockSize: int, dropout: float) -> None:
        super().__init__()
        headSize = embeddingSize // nHeads
        self.multiHeadAttention = MultiHeadAttention(nHeads, embeddingSize, headSize, blockSize, dropout)
        self.feedForward = FeedForward(embeddingSize, dropout)
        self.layerNorm1 = nn.LayerNorm(embeddingSize)
        self.layerNorm2 = nn.LayerNorm(embeddingSize)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.multiHeadAttention(self.layerNorm1(x))
        x = x + self.feedForward(self.layerNorm2(x))
        return x

## **Language Model**

In [6]:
class GPTLanguageModel(nn.Module):

    def __init__ (self, vocabularySize: int, nLayers: int, nHeads: int, embeddingSize: int, blockSize: int, dropout: float) -> None:
        super().__init__()
        self.tokenEmbeddingTable = nn.Embedding(vocabularySize, embeddingSize)
        self.positionEmbeddingTable = nn.Embedding(blockSize, embeddingSize)
        self.layers = nn.Sequential(*[Layer(nHeads, embeddingSize, blockSize, dropout) for _ in range(nLayers)])
        self.layerNorm = nn.LayerNorm(embeddingSize)
        self.linearModelHead = nn.Linear(embeddingSize, vocabularySize)
        self.blockSize = blockSize
        self.apply(self._init_weights)
    
    def _init_weights(self, module: nn.Module) -> None:
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx: torch.Tensor, targets: torch.Tensor = None) -> tuple[torch.Tensor, torch.Tensor]:
        B, T = idx.shape
        tokenEmbeddings = self.tokenEmbeddingTable(idx)
        positionEmbeddings = self.positionEmbeddingTable(torch.arange(T, device=device))
        x = tokenEmbeddings + positionEmbeddings
        x = self.layers(x)
        x = self.layerNorm(x)
        logits = self.linearModelHead(x)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx: torch.Tensor, nTokens: int) -> torch.Tensor:
        self.eval()
        with torch.no_grad():
            for _ in range(nTokens):
                logits, loss = self(idx[:, -self.blockSize:])
                logits = logits[:, -1, :]
                probabilities = F.softmax(logits, dim=-1)
                nextIdx = torch.multinomial(probabilities, num_samples=1)
                idx = torch.cat([idx, nextIdx], dim=1)
        self.train()
        return idx

## **Text Generation**

In [7]:
dataDir = "data"
file = os.path.join(dataDir, "martin_fierro.txt")
with open(file, "r") as f:
    text = f.read()
print(f"Text length: {len(text)}")
print(text[:250])

Text length: 187096
I

Aquí me pongo a cantar
al compás de la vigüela,
que el hombre que lo desvela
una pena estrordinaria,
como la ave solitaria
con el cantar se consuela.

Pido a los santos del cielo
que ayuden mi pensamiento:
les pido en este momento
que voy a cantar


In [8]:
vocabulary = sorted(set(text))
vocabularySize = len(vocabulary)
print(f"Vocabulary size: {vocabularySize}")
print("".join(vocabulary))

Vocabulary size: 72

 !"'(),-.:;?ABCDEFGHIJLMNOPQRSTUVXYZabcdefghijklmnopqrstuvxyz¡¿Ñáéíñóúü


In [9]:
stringToToken = {ch: i for i, ch in enumerate(vocabulary)}
tokenToString = {i: ch for i, ch in enumerate(vocabulary)}
encode = lambda string: [stringToToken[char] for char in string]
decode = lambda tokens: "".join([tokenToString[token] for token in tokens])
print(encode("Los hermanos sean unidos"))
print(decode(encode("Los hermanos sean unidos")))

[23, 51, 55, 1, 44, 41, 54, 49, 37, 50, 51, 55, 1, 55, 41, 37, 50, 1, 57, 50, 45, 40, 51, 55]
Los hermanos sean unidos


In [10]:
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data shape: {data.shape}")
print(f"Data type: {data.dtype}")
print(data[:100])

Data shape: torch.Size([187096])
Data type: torch.int64
tensor([21,  0,  0, 13, 53, 57, 67,  1, 49, 41,  1, 52, 51, 50, 43, 51,  1, 37,
         1, 39, 37, 50, 56, 37, 54,  0, 37, 48,  1, 39, 51, 49, 52, 65, 55,  1,
        40, 41,  1, 48, 37,  1, 58, 45, 43, 71, 41, 48, 37,  7,  0, 53, 57, 41,
         1, 41, 48,  1, 44, 51, 49, 38, 54, 41,  1, 53, 57, 41,  1, 48, 51,  1,
        40, 41, 55, 58, 41, 48, 37,  0, 57, 50, 37,  1, 52, 41, 50, 37,  1, 41,
        55, 56, 54, 51, 54, 40, 45, 50, 37, 54])


In [11]:
trainValSplit = 0.9
trainSize = int(len(data) * trainValSplit)
trainData = data[:trainSize]
valData = data[trainSize:]
print(f"Train data shape: {trainData.shape}")
print(f"Validation data shape: {valData.shape}")

Train data shape: torch.Size([168386])
Validation data shape: torch.Size([18710])


In [12]:
blockSize = 256
embeddingSize = 384
nHeads = 6
nLayers = 6
dropout = 0.2

model = GPTLanguageModel(vocabularySize, nLayers, nHeads, embeddingSize, blockSize, dropout).to(device)
print(f"{sum(p.numel() for p in model.parameters())/1e6:.2f}M parameters")

10.79M parameters


In [13]:
def getBatch(data: torch.Tensor,
             batchSize: int,
             blockSize: int
             ) -> tuple[torch.Tensor, torch.Tensor]:
    ix = torch.randint(0, data.size(0) - blockSize, (batchSize,))
    x = torch.stack([data[i:i+blockSize] for i in ix])
    y = torch.stack([data[i+1:i+blockSize+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


@torch.no_grad()
def estimateLoss(model: nn.Module,
                 data: torch.Tensor,
                 evalIter: int,
                 batchSize: int,
                 blockSize: int
                ) -> float:
    model.eval()
    losses = torch.zeros(evalIter)
    for i in range(evalIter):
        xBatch, yBatch = getBatch(data, batchSize, blockSize)
        logits, loss = model(xBatch, yBatch)
        losses[i] = loss.item()
    model.train()
    return losses.mean().item()

In [14]:
batchSize = 64
learningRate = 3e-4
maxIter = 1000
evalInterval = 500
evalIter = 200

optimizer = optim.Adam(model.parameters(), lr=learningRate)

for iter in range(maxIter):
    if iter % evalInterval == 0 or iter == maxIter - 1:
        trainLoss = estimateLoss(model, trainData, evalIter, batchSize, blockSize)
        valLoss = estimateLoss(model, valData, evalIter, batchSize, blockSize)
        print(f"Iter: {iter}, Train loss: {trainLoss}, Val loss: {valLoss}")
    xBatch, yBatch = getBatch(trainData, batchSize, blockSize)
    logits, loss = model(xBatch, yBatch)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

modelsDir = "models"
torch.save(model.state_dict(), os.path.join(modelsDir, "martin_fierro.pt"))

Iter: 0, Train loss: 4.37283182144165, Val loss: 4.376593112945557
Iter: 500, Train loss: 1.8056716918945312, Val loss: 1.8216660022735596
Iter: 999, Train loss: 0.9794025421142578, Val loss: 1.5681744813919067


In [19]:
context = "Los hermanos sean unidos"
context = torch.tensor(encode(context), dtype=torch.long).unsqueeze(0).to(device)
generatedText = decode(model.generate(context, nTokens=10000)[0].tolist())
print(generatedText)
outputDir = "outputs"
with open(os.path.join(outputDir, "martin_fierro.txt"), "w") as f:
    f.write(generatedText)

Los hermanos sean unidos
que nos más pobres que una últifición,
para más mucho de un rerecho
como una ajuel más...

Yo no gato no queré
cuando el canto se atenció;
lo vista en la ocasión
para andar dás.
Me gente mi solza
y lo ponga sus manejos;
pero Dios haron de vistos
y perdonó el chiquito es manes.
úspren los que azon pierdo
que los es pesaren el malento,
Y que es el indio los berral
dende que es amorenan.

Y ahi con esta el mior
fue, altante con la rodel;
no he punto el encordidan,
empeñéan que allí tratapao
por vengar nuestracia.


En esa ima flortera
con mi gana pao andar,
y un salvaje medio flordar
y aunque me me ata amagocía;
por no me pude no arracha
a saber algún memoro.

Una hay veces me atropel
gritando me he de eseguida;
aunque ya veo me afroja
de enseñarle a hude noche
iba a dista corrarer
de grina, me me enseñura,
cuando no hablar al calma,
es poneces al corraje.


Y en aquel puerte negro
le echaron corrorror,
mas como más corroro;
soy en es rogro ganaos;
tan gamas si el