# Reproduce Transformer Base Model from Attention is All You Need

## Dataset

The paper used the dataset [WMT2014](https://huggingface.co/datasets/wmt14) English-German dataset consisting of 4.5M sentence pairs. 

In [31]:
from datasets import load_dataset


dataset = load_dataset('wmt14', 'de-en') # Note: the dataset is downloaded at ~/.cache/huggingface/datasets

In [29]:
# probe the dataset
for key in dataset:
    print(f"{key}: {len(dataset[key])} entries")
# print the first 5 entries of the training set
for i in range(5):
    print(f"#{i+1}")
    translation = dataset["train"][i]["translation"]
    print(f"de: {translation['de']}")
    print(f"en: {translation['en']}")

train: 4508785 entries
validation: 3000 entries
test: 3003 entries
#1
de: Wiederaufnahme der Sitzungsperiode
en: Resumption of the session
#2
de: Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.
en: I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
#3
de: Wie Sie feststellen konnten, ist der gefürchtete "Millenium-Bug " nicht eingetreten. Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden.
en: Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
#4
de: Im Parlament besteht der Wu

### Tokenizer

The paper used a byte-pair encoding with a shared (English + German) vocab of 37000 tokens.

The following code follows [HuggingFace's tutorial](https://huggingface.co/docs/tokenizers/quicktour) on tokenizers.

The training takes about 2 mins.

In [32]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace


vocab_size = 37000


def batch_iterator(batch_size: int = 100):
    for lang in ["de", "en"]:
        for key in ["train", "validation", "test"]:
            for i in range(0, len(dataset[key]), batch_size):
                yield [item[lang] for item in dataset[key][i:i+batch_size]["translation"]]


tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=sum([len(_) for _ in dataset.values()]))
tokenizer.save("tokenizer-wmt14-de-en.json")






## Transformer from Scratch

In [38]:
import numpy as np
import torch
from torch import nn


class Transformer(nn.Module):
    def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6,
                 dim_feedforward: int = 2048, dropout: int = 0.1):
        super().__init__()
        self.d_model = d_model
        self.nhead = nhead
        self.num_encoder_layers = num_encoder_layers
        self.num_decoder_layers = num_decoder_layers
        self.dim_feedforward = dim_feedforward
        self.dropout = dropout

# allow adding new methods to a class
def add_method(cls):
    def decorator(func):
        setattr(cls, func.__name__, func)
        return func
    return decorator

# allow adding new properties to a class
def add_property(cls):
    def decorator(func):
        setattr(cls, func.__name__, property(func))
        return func
    return decorator

### Embedding Layer

The same layer (weights) is used for both the input and output embeddings.

In [48]:
@add_property(Transformer)
def embedding(self):
    if not hasattr(self, "_embedding"):
        self._embedding = nn.Embedding(tokenizer.get_vocab_size(), self.d_model)
    return self._embedding

### Positional Encoding

### Attention Layer

### Layer Normalization

### MLP Layer

### Encoder

### Decoder

### Build Transformer

In [46]:
model = Transformer()
print(model.embedding)
print(model.embedding)

[0.38076211 0.46243897 0.82224138 0.27802744 0.68888738]
[0.38076211 0.46243897 0.82224138 0.27802744 0.68888738]


## Train

## Test