# Reproduce Transformer from Attention is All You Need

## Preliminaries

In [5]:
%load_ext autoreload
%autoreload 2
import numpy as np
import torch
from torch import nn
from dataset import Dataset
from tokenizer import get_tokenizer
from utils import NUM_PROC, DEVICE, free_memory
from model import TransformerModel
from transformer import Transformer


print("Number of processors: ", NUM_PROC)
print("Device: ", DEVICE)

Number of processors:  32
Device:  cuda


## Transformer Lite from Scratch

Using half the dimension as the base model: $d_{\rm model} = 256$, $d_{\rm ff} = 1024$. 

### Tokenizer

Byte-Pair Encoding with shared (English + German) vocabulary of 37000 tokens.

In [6]:
tokenizer = get_tokenizer(name="wmt14", language="de-en", vocab_size=37000)

Loaded tokenizer from ../tokenizer-wmt14-de-en.json


### Dataset

The dataset is downloaded at ~/.cache/huggingface/datasets/. I've turned off dataset caching to avoid disk explosion.

In [7]:
dataset = Dataset(name="wmt14", language="de-en", percentage=1)

In [13]:
dataset.tokenize(tokenizer)

Map (num_proc=32):   0%|          | 0/45088 [00:00<?, ? examples/s]

Map (num_proc=32):   0%|          | 0/3000 [00:00<?, ? examples/s]

Map (num_proc=32):   0%|          | 0/3003 [00:00<?, ? examples/s]

In [14]:
dataloader = {}
for split in ["train", "validation", "test"]:
    dataloader[split] = dataset.get_dataloader(split=split, batch_size=64, shuffle=True, min_len=1, max_len=128)


Filter:   0%|          | 0/45088 [00:00<?, ? examples/s]

Map (num_proc=32):   0%|          | 0/45025 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map (num_proc=32):   0%|          | 0/2999 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3003 [00:00<?, ? examples/s]

Map (num_proc=32):   0%|          | 0/3003 [00:00<?, ? examples/s]

### Train

In [6]:
# create the transformer model
model = TransformerModel(vocab_size=tokenizer.get_vocab_size(), d_model=256, dim_feedforward=1024).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=512**-0.5, betas=(0.9, 0.98), eps=1e-9)
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda nstep: min((nstep + 1) ** -0.5, (nstep + 1) * 4000 ** -1.5))
loss_fn = nn.CrossEntropyLoss() # could add label smoothing

In [7]:
# load model
# model.load_state_dict(torch.load("model_1.pth"))

In [7]:
# free_memory("model")
free_memory()
print(torch.cuda.memory_summary())

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  80348 KiB |  80348 KiB |  80348 KiB |      0 B   |
|       from large pool |  37000 KiB |  37000 KiB |  37000 KiB |      0 B   |
|       from small pool |  43348 KiB |  43348 KiB |  43348 KiB |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |  80348 KiB |  80348 KiB |  80348 KiB |      0 B   |
|       from large pool |  37000 KiB |  37000 KiB |  37000 KiB |      0 B   |
|       from small pool |  43348 KiB |  43348 KiB |  43348 KiB |      0 B   |
|---------------------------------------------------------------

In [8]:
# create the transformer wrapper
transformer = Transformer(model, tokenizer)

In [9]:
transformer.train(dataloader, model, loss_fn, optimizer, scheduler)

-------------------------------
Epoch 1/1
Accuracy: 0.0%, Avg loss: 67.911957  [   64/45025]  [0:00:01 < 0:14:59]
Accuracy: 0.8%, Avg loss: 38.382389  [ 6464/45025]  [0:00:11 < 0:01:08]
Accuracy: 1.0%, Avg loss: 32.601387  [12864/45025]  [0:00:20 < 0:00:51]
Accuracy: 3.0%, Avg loss: 29.933926  [19264/45025]  [0:00:30 < 0:00:40]
Accuracy: 4.7%, Avg loss: 27.682467  [25664/45025]  [0:00:40 < 0:00:30]
Accuracy: 5.1%, Avg loss: 27.091312  [32064/45025]  [0:00:51 < 0:00:20]
Accuracy: 9.1%, Avg loss: 26.094402  [38464/45025]  [0:01:01 < 0:00:10]
Accuracy: 8.0%, Avg loss: 25.695353  [44864/45025]  [0:01:10 < 0:00:00]
Validation Error: 
 Accuracy: 7.4%, Avg loss: 27.908228 

Done!


### Evaluate

In [11]:
sample = dataset.dataset["test"]["translation"][10]
transformer.predict(sample["de"], sample["en"])

Accuracy: 20.5%
[31mIn[39m
" [31mIn[39m
" According [32mto[39m
" According to [31mthe[39m
" According to current [31msituation[39m
" According to current measurements [32m,[39m
" According to current measurements , [31mthe[39m
" According to current measurements , around [31mthe[39m
" According to current measurements , around 12 [31m%[39m
" According to current measurements , around 12 , [31mthe[39m
" According to current measurements , around 12 , 000 [31mpeople[39m
" According to current measurements , around 12 , 000 vehicles [31mare[39m
" According to current measurements , around 12 , 000 vehicles travel [31mare[39m
" According to current measurements , around 12 , 000 vehicles travel through [32mthe[39m
" According to current measurements , around 12 , 000 vehicles travel through the [31mHervor[39m
" According to current measurements , around 12 , 000 vehicles travel through the town [32mof[39m
" According to current measurements , around 12 , 000

In [12]:
print(transformer.translate("Ich bin ein Berliner."))

I am a good thing .


In [13]:
for i in range(5):
    samples = dataset.dataset["test"]["translation"]
    idx = np.random.randint(len(samples))
    sample = samples[i]
    print(f"#{i+1}")
    print(f"Source: {sample['de']}")
    print(f"Target: {sample['en']}")
    print(f"Prediction: {transformer.translate(sample['de'])}")
    print()

#1
Source: Gutach: Noch mehr Sicherheit für Fußgänger
Target: Gutach: Increased safety for pedestrians
Prediction: I have been able to make a great deal of work for the situation .

#2
Source: Sie stehen keine 100 Meter voneinander entfernt: Am Dienstag ist in Gutach die neue B 33-Fußgängerampel am Dorfparkplatz in Betrieb genommen worden - in Sichtweite der älteren Rathausampel.
Target: They are not even 100 metres apart: On Tuesday, the new B 33 pedestrian lights in Dorfparkplatz in Gutach became operational - within view of the existing Town Hall traffic lights.
Prediction: It is not a great deal of the new new groups in the new procedure , which is not the case in the new new new new ු .

#3
Source: Zwei Anlagen so nah beieinander: Absicht oder Schildbürgerstreich?
Target: Two sets of lights so close to one another: intentional or just a silly error?
Prediction: What is the same as the other countries : what is the same as the other countries or the other countries ?

#4
Source: Di

## DEBUG

In [14]:
for name in ["src_len", "tgt_len"]:
    len_list = dataset.dataset["train"][name]
    tot = sum(len_list)
    count = 0
    for num in len_list:
        if num <= 256:
            count += num
    print(f"count: {count}, tot: {tot}, percentage: {count/tot*100:.2f}%")

count: 14302156, tot: 14303264, percentage: 99.99%
count: 14339828, tot: 14340777, percentage: 99.99%


In [10]:
from transformers import FSMTForConditionalGeneration, FSMTTokenizer
mname = "facebook/wmt19-de-en"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)

input = "Maschinelles Lernen ist großartig, oder?"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)


Some weights of FSMTForConditionalGeneration were not initialized from the model checkpoint at facebook/wmt19-de-en and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Machine learning is great, isn't it?


In [12]:
dataset.dataset["test"][1]

{'translation': {'de': 'Sie stehen keine 100 Meter voneinander entfernt: Am Dienstag ist in Gutach die neue B 33-Fußgängerampel am Dorfparkplatz in Betrieb genommen worden - in Sichtweite der älteren Rathausampel.',
  'en': 'They are not even 100 metres apart: On Tuesday, the new B 33 pedestrian lights in Dorfparkplatz in Gutach became operational - within view of the existing Town Hall traffic lights.'}}

In [13]:
input_ids = tokenizer.encode(dataset.dataset["test"][1]["translation"]["de"], return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

They are less than 100 metres apart: on Tuesday, the new B 33 pedestrian traffic light at the village car park was put into operation in Gutach - within sight of the older town hall traffic light.
