# Attention is all you need - the original transformer from scratch

The following notebook attempts to recreate the original transformer from the famous "Attention is all you need" paper, from scratch using PyTorch.

The first part defines the model, through various classes for all the different subcomponents. The second part does some experimentation with the defined model. This part also uses a dataset from kaggle, which can be found in the following liondrive folder:

https://drive.google.com/drive/folders/1DsMzK_HJE1vsNFXtlpTuhbdihWEDDPrQ?usp=sharing


###Acknowledgments:

- This notebook is partly adapted from the following medium article:
https://medium.com/@sdwzh2725/transformer-code-step-by-step-understandingtransformer-d2ea773f15fa
- The notebook also re-uses code snippets from various homeworks from the Applied Deep Learning course from Andrei Simion.
- The dataset used in the second part comes from the following kaggle page:
https://www.kaggle.com/datasets/bwandowando/479k-english-words?resource=download&select=words_alpha.txt

In [286]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Part 1: Define the model

## Step 1: Embedding

First we define the embedding for each token. The embedding consists of two parts:
- A vector of learnable weights, for each token in the vocabulary.
- A (fixed) positional embedding providing information about the position of each token to the model.

The embedding for each token is then defined as the sum of these two components.

In [287]:
class TokenEmbedding(nn.Module):
    """
        Use the two formulas in the paper to calculate PositionalEmbedding

        input size: [batch_size, seq_length]
        return size: [batch_size, seq_length, d_model]

        Args:
            max_len: Maximum length of input sentence
            dim_vector: the dimension of embedding vector for each input word.
            vocab_size: amount of tokens in the vocabulary
            drop_out: percentage of neurons to "drop out", i.e. set to zero. For regularization purposes.
    """

    def __init__(self, d_model, max_len, vocab_size, drop_out = 0.1):
        super().__init__()

        # Learnable embedding: vector of learnable weights for each token
        self.learnable_embed = nn.Embedding(vocab_size, d_model)

        # Positional embedding
        self.d_model = d_model
        self.max_len = max_len

        positional_enc = torch.zeros(max_len, d_model)
        for pos in range(max_len):
            for i in range(0, d_model, 2):
                positional_enc[pos, i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))
                positional_enc[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i) / d_model)))

        # The size of the pe matrix is [max_len, dim_vector].
        print(f"positional encoding size：{positional_enc.size()}")

        # Register buffer, indicating that this parameter is not updated. Tips: Registering the buffer is equivalent
        # to defining self.pe in__init__, so self.pe can be called in the forward function below, but the parameters
        # will not be updated.
        self.register_buffer('positional_enc', positional_enc)

        # Dropout: for regularization
        self.dropout = nn.Dropout(drop_out)

    def forward(self, x):
        # The input x to the position code is [batch_size, seq_len], where seq_len is the length of the sentence
        batch_size, seq_len = x.size()

        # Add learnable embedding and positional encoding for each token in the batch, and apply dropout.
        return self.dropout(self.learnable_embed(x) + self.positional_enc[:seq_len, :])

## Step 2: cross-self-attention

The encoder block employs cross-self-attention. First, we define one attention-head. Second, we combine multiple attention-heads to create a multi-head attention sublayer.

First, a cross-attention-head:

In [288]:
class CrossAttentionHead(nn.Module):
    """
    This class represents one head of cross-self-attention

    input size: [batch_size, seq_length, d_model]
    return size: [batch_size, seq_length, d_head]

    Args:
        d_head: dimension of each key, query and value vector.
        d_model: embedding dimension
        drop_out: percentage of attention weights to "drop out", i.e. set to zero. For regularization purposes.
    """

    def __init__(self, d_head, d_model, dropout = 0.1):
        super().__init__()
        self.d_head = d_head

        # Matrices to map each embedding to a key, query and value of dimension d_head
        self.W_K = nn.Linear(d_model, d_head, bias=False)
        self.W_Q = nn.Linear(d_model, d_head, bias=False)
        self.W_V = nn.Linear(d_model, d_head, bias=False)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # B = batch_size, T = block_size, where block_size corresponds to the longest sequence length in the batch
        B, T, d = x.shape

        # Get the key and query representations from the embedding x
        k = self.W_K(x)
        q = self.W_Q(x)
        v = self.W_V(x)

        # Compute attention scores
        # Transpose K is done over dimensions -2 and -1, i.e. (sequence_length, d_head)
        scores = q @ k.transpose(-2,-1) / (self.d_head ** 0.5)

        # Apply softmax to the final dimension of scores, i.e. over each row of each score matrix
        attention = F.softmax(scores, dim=-1)

        # Apply dropout
        attention = self.dropout(attention)

        # Calculate new representation for each token, using attention scores and value vectors
        out = attention @ v
        return out

Second, we define a multi-head-attention sublayer.

In [289]:
class MultiHeadAttentionEnc(nn.Module):
    """
    This class represents a multi-head-attention sublayer, employing cross-attention heads.

    input size: [batch_size, seq_length, d_model]
    return size: [batch_size, seq_length, d_model]

    Args:
        d_head: dimension of each key, query and value vector in each attention head
        d_model: embedding dimension
        num_heads: number of attention heads
        drop_out: drop_out in linear map after attention, for regularization.

    """

    def __init__(self, num_heads, d_head, d_model, dropout=0.1):
        super().__init__()
        # Create num_heads attention heads
        self.heads = nn.ModuleList([CrossAttentionHead(d_head, d_model) for _ in range(num_heads)])

        # This is to project back to the dimension of d_model.
        self.W_O = nn.Linear(d_head * num_heads, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Concatenate the different representations per head along the last dimension
        out = torch.cat([h(x) for h in self.heads], dim=-1)

        # Project the concatenation and apply dropout
        out = self.dropout(self.W_O(out))
        return out

## Step 4: Feed-Forward Neural Network

Next, we define the feed-forward neural network that is used both in the encoder and decoder.

In [290]:
class FeedForwardNet(nn.Module):
    """
    Feed forward neural network that takes in a representation for a certain position in the sequence,
    and returns a new one. The same network is applied to each position in the sequence.

    input size: [batch_size, seq_length, d_model]
    return size: [batch_size, seq_length, d_model]

    Args:
        d_model: embedding dimension
        drop_out: drop_out, for regularization.
    """

    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        # As in "Attention is all you need" paper, hidden dimension is 4x embedding dimension.
        d_hidden = 4 * d_model

        # The simple feed forward net.
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_hidden),
            nn.ReLU(),
            nn.Linear(d_hidden, d_model),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.ff(x)

# Step 5: Define an encoder block

Next, we can define an encoder block using the modules we defined above. We use the built in LayerNorm function from pytorch to implement the Add&Norm steps.

In [291]:
class EncoderBlock(nn.Module):
    """
    This defines an Encoder Block. In the final encoder, multiple of these blocks are stacked on top of each other

    input size: [batch_size, seq_length, d_model]
    return size: [batch_size, seq_length, d_model]

    Args:
        d_model: embedding dimension
        drop_out: drop_out, for regularization.
    """

    def __init__(self, d_model, n_head):
        super().__init__()
        # We will assume that d_model is divisible by n_head (number of heads), to determine the dimension of each attention head.
        d_head = d_model // n_head

        # Define attention head, feed forward net, and layernorm layers.
        self.multi_cross_att = MultiHeadAttentionEnc(n_head, d_head, d_model)
        self.FFN = FeedForwardNet(d_model)
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # add & norm after multi-head attention
        x = self.ln1(x + self.multi_cross_att(x))
        # add & norm after feed forward net
        x = self.ln2(x + self.FFN(x))
        return x


## Step 6: define masked-self-attention.

The decoder uses masked-self-attention. Here we thus define a masked-self-attention head.

In [292]:
class MaskedSelfAttentionHead(nn.Module):
    """
    This class represents one head of cross-self-attention

    input size: [batch_size, seq_length, d_model]
    return size: [batch_size, seq_length, d_head]

    Args:
        d_head: dimension of each key, query and value vector.
        d_model: embedding dimension
        drop_out: percentage of attention weights to "drop out", i.e. set to zero. For regularization purposes.
    """

    def __init__(self, d_head, d_model, dropout = 0.1):
        super().__init__()
        self.d_head = d_head

        # Matrices to map each embedding to a key, query and value of dimension d_head
        self.W_K = nn.Linear(d_model, d_head, bias=False)
        self.W_Q = nn.Linear(d_model, d_head, bias=False)
        self.W_V = nn.Linear(d_model, d_head, bias=False)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # B = batch_size, T = block_size, where block_size corresponds to the longest sequence length in the batch
        B, T, d = x.shape

        # Get the key and query representations from the embedding x
        k = self.W_K(x)
        q = self.W_Q(x)
        v = self.W_V(x)

        # Compute attention scores
        # Transpose K is done over dimensions -2 and -1, i.e. (sequence_length, d_head)
        scores = q @ k.transpose(-2,-1) / (self.d_head ** 0.5)

        # Masked! Apply a mask to prevent attending future tokens
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1)  # Upper triangular mask
        scores = scores.masked_fill(mask.bool(), float('-inf'))

        # Apply softmax to the final dimension of scores, i.e. over each row of each score matrix
        attention = F.softmax(scores, dim=-1)

        # Apply dropout
        attention = self.dropout(attention)

        # Calculate new representation for each token, using attention scores and value vectors
        out = attention @ v
        return out

## Step 7: masked multi-head attention

Combine multiple masked self attention heads into one muli-head-attention head.

In [293]:
class MaskedMultiHeadAttention(nn.Module):
    """
    This class represents a masked multi-head-attention sublayer, employing masked self-attention heads.
    This class is the same as the MuliHeadAttentionEnc class, only difference is use of MaskedSelfAttentionHeads
    instead of CrossAttentionHead

    input size: [batch_size, seq_length, d_model]
    return size: [batch_size, seq_length, d_model]

    Args:
        d_head: dimension of each key, query and value vector in each attention head
        d_model: embedding dimension
        num_heads: number of attention heads
        drop_out: drop_out in linear map after attention, for regularization.

    """

    def __init__(self, num_heads, d_head, d_model, dropout=0.1):
        super().__init__()
        # Create num_heads attention heads
        self.heads = nn.ModuleList([MaskedSelfAttentionHead(d_head, d_model) for _ in range(num_heads)])

        # This is to project back to the dimension of d_model.
        self.W_O = nn.Linear(d_head * num_heads, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Concatenate the different representations per head along the last dimension
        out = torch.cat([h(x) for h in self.heads], dim=-1)

        # Project the concatenation and apply dropout
        out = self.dropout(self.W_O(out))
        return out

## Step 8: Self-attention for decoder \(second attention sublayer in decoder\)

Now we define the self-attention head for the second layer of the decoder. This head uses the encoder output to get the keys and values, and the output (after add&norm) from the previous attention sublayer in the encoder block.

In [294]:
class CrossAttentionHeadDec(nn.Module):
    """
    This class represents one head of cross-attention in the decoder.

    input size:
      - x (queries from decoder): [batch_size, seq_length_dec, d_model]
      - encoder_output (keys and values from encoder): [batch_size, seq_length_enc, d_model]

    return size: [batch_size, seq_length_dec, d_head]

    Args:
        d_head: dimension of each key, query, and value vector.
        d_model: embedding dimension.
        dropout: percentage of attention weights to "drop out".
    """

    def __init__(self, d_head, d_model, dropout=0.1):
        super().__init__()
        self.d_head = d_head

        # Linear projections for queries, keys, and values
        self.W_Q = nn.Linear(d_model, d_head, bias=False)  # Decoder input -> Queries
        self.W_K = nn.Linear(d_model, d_head, bias=False)  # Encoder output -> Keys
        self.W_V = nn.Linear(d_model, d_head, bias=False)  # Encoder output -> Values

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output):
        B, T_dec, d = x.shape  # B = batch size, T_dec = decoder seq length, d = embedding dim
        _, T_enc, _ = encoder_output.shape  # T_enc = encoder sequence length

        # Compute queries (decoder), keys (encoder), and values (encoder)
        q = self.W_Q(x)  # [B, T_dec, d_head]
        k = self.W_K(encoder_output)  # [B, T_enc, d_head]
        v = self.W_V(encoder_output)  # [B, T_enc, d_head]

        # Compute attention scores (scaled dot-product attention)
        scores = q @ k.transpose(-2, -1) / (self.d_head ** 0.5)

        # Apply softmax to normalize the attention scores
        attention = F.softmax(scores, dim=-1)  # [B, T_dec, T_enc]

        # Apply dropout to attention scores
        attention = self.dropout(attention)

        # Compute output as weighted sum of values
        out = attention @ v

        return out


## Step 9: Second multi-head-attention in decoder

Next, we combine multiple heads into the second multi-head-attention sublayer in the decoder.

In [295]:
class MultiHeadAttentionDec(nn.Module):
    """
    This class represents the second multi-head-attention sublayer in the decoder.

    input size:
      - x (queries from decoder): [batch_size, seq_length_dec, d_model]
      - encoder_output (keys and values from encoder): [batch_size, seq_length_enc, d_model]

    return size: [batch_size, seq_length_dec, d_model]

    Args:
        d_head: dimension of each key, query, and value vector.
        d_model: embedding dimension.
        dropout: percentage of attention weights to "drop out".

    """

    def __init__(self, num_heads, d_head, d_model, dropout=0.1):
        super().__init__()
        # Create num_heads attention heads
        self.heads = nn.ModuleList([CrossAttentionHeadDec(d_head, d_model) for _ in range(num_heads)])

        # This is to project back to the dimension of d_model.
        self.W_O = nn.Linear(d_head * num_heads, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output):
        # Concatenate the different representations per head along the last dimension
        out = torch.cat([h(x, encoder_output) for h in self.heads], dim=-1)

        # Project the concatenation and apply dropout
        out = self.dropout(self.W_O(out))
        return out

## Step 10: define a Decoder Block

Here we define a decoder block. Multiple of these are stacked on top of each other in the decoder.

In [296]:
class DecoderBlock(nn.Module):
    """
    This defines a Decoder Block. In the final decoder, multiple of these blocks are stacked on top of each other

    input size:
      - x (queries from decoder): [batch_size, seq_length_dec, d_model]
      - encoder_output (keys and values from encoder): [batch_size, seq_length_enc, d_model]

    return size: [batch_size, seq_length_dec, d_model]

    Args:
        d_head: dimension of each key, query, and value vector.
        d_model: embedding dimension.
        dropout: percentage of attention weights to "drop out".
    """

    def __init__(self, d_model, n_head):
        super().__init__()
        # We will assume that d_model is divisible by n_head (number of heads), to determine the dimension of each attention head.
        d_head = d_model // n_head

        # Define the two attention sublayers, feed forward net, and layernorm layers.
        self.multi_masked_att = MaskedMultiHeadAttention(n_head, d_head, d_model)
        self.multi_cross_att = MultiHeadAttentionDec(n_head, d_head, d_model)
        self.FFN = FeedForwardNet(d_model)
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.ln3 = nn.LayerNorm(d_model)

    def forward(self, x, enc_output):
        # add & norm after masked multi-head attention
        x = self.ln1(x + self.multi_masked_att(x))
        # add & norm after second multi-head attention layer
        x = self.ln2(x + self.multi_cross_att(x, enc_output))
        # add & norm after feed forward net
        x = self.ln3(x + self.FFN(x))
        return x


## Full transformer model

Finally, we can now stack multiple encoder and decoder blocks to create the transformer model.

In [297]:
class Transformer(nn.Module):
    def __init__(self, d_model, max_len, vocab_size, n_layer, n_head,  drop_out = 0.1):
        super().__init__()

        # Embedding layer
        self.embedding = TokenEmbedding(d_model, max_len, vocab_size)
        # Define encoder as multiple encoder blocks stacked on top of each other
        self.encoder = nn.Sequential(*[EncoderBlock(d_model, n_head) for _ in range(n_layer)])
        # Define decoder as multiple decoder blocks stacked on top of each other
        self.decoder_blocks = nn.ModuleList([DecoderBlock(d_model, n_head) for _ in range(n_layer)])

        # final linear layer to predict next token
        self.lin = nn.Linear(d_model, vocab_size)

    def forward(self, input, output):
        # get embeddings
        x = self.embedding(input)
        y = self.embedding(output)

        # step through encoder
        enc_output = self.encoder(x)
        # step through decoder
        dec_output = y
        for decoder_block in self.decoder_blocks:
            dec_output = decoder_block(dec_output, enc_output)


        # apply final linear layer to get logits
        logits = self.lin(dec_output)

        return logits


In [298]:
transformer = Transformer(d_model=256, max_len=6, vocab_size=10000, n_layer=3, n_head=4)

positional encoding size：torch.Size([6, 256])


# Part 2: Experimenting with the model

In this part we will do some small experimentations with the model. We will load a dataset of words, and train the transformer to reverse a word.

I.e. given input "hello", the transformer should outpu "elloh"

First, we import a dataset of english words. This dataset was found on kaggle:
https://www.kaggle.com/datasets/bwandowando/479k-english-words?resource=download&select=words_alpha.txt

Note, we will use the characters as our tokens for this application.

From this dataset, we construct a mapping of words to integers and vice versa (stoi and itos)

We pad the output with the special start and stop token "."

For this code cell to work, you have to be running this notebook in google colab and have the dataset in your google drive. Also, this code cell will ask for access to your google drive.

In [299]:
from google.colab import drive
from collections import defaultdict

# Dictionaries, {number -> character} and {character -> number}
itos = defaultdict(int)
stoi = defaultdict(int)

# START = START token
stoi['.'] = 0
itos[0] = '.'
encode = lambda s: [stoi[c] for c in s] # Encoder: take a string, output a list of integers
decode = lambda l: ''.join([str(itos[i]) for i in l]) # Decoder: take a list of integers, output a string

drive.mount('/content/drive')


f = open('/content/drive/MyDrive/words_alpha.txt', 'r')
words = f.readlines()
for word in words:
    word = word.lower().strip()
    for c in word:
        if c not in stoi:
            stoi[c] = len(stoi)
            itos[len(itos)] = c
            assert len(stoi) == len(itos)
f.close()

vocab_size = len(itos)
print(vocab_size)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
27


We create a training dataset of pairs of words with their reversed version.
For simplicity, we only take words of more than 1 and less than 5 characters.

In [300]:
# Create the training dataset
training_data = []
for word in words:
    word = word.lower().strip()

    # only use words of more than 1 character
    if len(word) > 1 and len(word) < 5:
        # reverse word and add start and end token
        reversed_word = '.' + word[::-1] + '.'
        training_data.append((word, reversed_word))

print(training_data[0:5])


[('aa', '.aa.'), ('aaa', '.aaa.'), ('aah', '.haa.'), ('aahs', '.shaa.'), ('aal', '.laa.')]


We use TensorDataset and DataLoader to create batches of training examples.

In [301]:
from torch.utils.data import DataLoader, TensorDataset

# Convert training data to tensors
input_seqs = [torch.tensor(encode(word)) for word, _ in training_data]
output_seqs = [torch.tensor(encode(reversed_word)) for _, reversed_word in training_data]

# Pad sequences to have the same length
max_len = max(len(seq) for seq in input_seqs + output_seqs)
input_seqs = [torch.cat([seq, torch.zeros(max_len - len(seq), dtype=torch.int64)]) for seq in input_seqs]
output_seqs = [torch.cat([seq, torch.zeros(max_len - len(seq), dtype=torch.int64)]) for seq in output_seqs]

# Create TensorDataset and DataLoader
dataset = TensorDataset(torch.stack(input_seqs), torch.stack(output_seqs))
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

We create a Transformer model.

In [302]:
transformer = Transformer(d_model=256, max_len=15, vocab_size=vocab_size, n_layer=3, n_head=4)

positional encoding size：torch.Size([15, 256])


Now we train the model. To speed up this proces, we use cuda to load our model and training data to gpu, and run the training on the gpu.

In [303]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create a PyTorch optimizer
model = transformer
model.to(device)
learning_rate = 0.001

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
n_epochs = 1

for epoch in range(n_epochs):
    for batch_idx, (input_seq, output_seq) in enumerate(dataloader):
        input_seq, output_seq = input_seq.to(device), output_seq.to(device)

        logits = model(input_seq, output_seq)

        # Calculate loss
        logits = logits.view(-1, logits.shape[-1])
        output_seq = output_seq.view(-1)
        loss = F.cross_entropy(logits, output_seq, ignore_index=0)

        # Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch_idx % 50 == 0:
            print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item()}")
            scheduler.step()



Epoch 0, Batch 0, Loss: 3.429154396057129
Epoch 0, Batch 50, Loss: 0.0031471741385757923
Epoch 0, Batch 100, Loss: 0.0021119399461895227
Epoch 0, Batch 150, Loss: 0.0015743699623271823
Epoch 0, Batch 200, Loss: 0.001274883165024221
Epoch 0, Batch 250, Loss: 0.0011024789419025183
Epoch 0, Batch 300, Loss: 0.001051790197379887


We see that the loss quickly drops and the model converges. However, we were not able to correctly generate an output from the given model. The issue might have been in the way that we attempted to generate from the model, but we are not entirely sure.