# LANGUAGE TRANSLATION WITH TRANSFORMER

This tutorial shows, how to train a translation model from scratch using Transformer. We will be using Multi30k dataset to train a French to English translation model.

<div>
  <img src="https://pic4.zhimg.com/v2-1719966a223d98ad48f98c2e4d71add7_r.jpg" width="500"/>
</div>

## Data Processing

torchtext has utilities for creating datasets that can be easily iterated through for the purposes of creating a language translation model. In this example, we show how to tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor.

To run this tutorial, first install spacy using pip or conda. Next, download the raw data for the English and French Spacy tokenizers from https://spacy.io/usage/models

In [1]:
import io
import time
import os
import torch
from pathlib import Path

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="6"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
from torchtext.utils import extract_archive

torch.manual_seed(0)

<torch._C.Generator at 0x7f902cee7d80>

In [3]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

In [4]:
from transformer_helper import Seq2SeqTransformer, train_epoch, evaluate, translate

In [5]:
data_path = Path('data-bin')

train_urls = ('train.fr', 'train.en')
val_urls = ('val.fr', 'val.en')
test_urls = ('test_2016_flickr.fr', 'test_2016_flickr.en')

train_filepaths = [data_path / url for url in train_urls]
val_filepaths = [data_path / url for url in val_urls]
test_filepaths = [data_path / url for url in test_urls]

In [6]:
fr_tokenizer = get_tokenizer('spacy', language='fr_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

In [7]:
def build_vocab(filepath, tokenizer):
    counter = Counter()
    with io.open(filepath, encoding="utf8") as f:
        for string_ in f:
            counter.update(tokenizer(string_))
    return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

In [8]:
fr_vocab = build_vocab(train_filepaths[0], fr_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

In [9]:
def data_process(filepaths):
    raw_fr_iter = iter(io.open(filepaths[0], encoding="utf8"))
    raw_en_iter = iter(io.open(filepaths[1], encoding="utf8"))
    data = []
    for (raw_fr, raw_en) in zip(raw_fr_iter, raw_en_iter):
        fr_tensor_ = torch.tensor([fr_vocab[token] for token in fr_tokenizer(raw_fr.rstrip("\n"))],
                                  dtype=torch.long)
        en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en.rstrip("\n"))],
                                  dtype=torch.long)
        data.append((fr_tensor_, en_tensor_))
    return data

In [10]:
train_data = data_process(train_filepaths)
val_data = data_process(val_filepaths)
test_data = data_process(test_filepaths)

In [11]:
BATCH_SIZE = 128
PAD_IDX = fr_vocab['<pad>']
BOS_IDX = fr_vocab['<bos>']
EOS_IDX = fr_vocab['<eos>']

## DataLoader

The last torch specific feature we’ll use is the DataLoader, which is easy to use since it takes the data as its first argument. Specifically, as the docs say: DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

Please pay attention to collate_fn (optional) that merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

In [12]:
def generate_batch(data_batch):
    fr_batch, en_batch = [], []
    for (fr_item, en_item) in data_batch:
        fr_batch.append(
            torch.cat([torch.tensor([BOS_IDX]), fr_item, torch.tensor([EOS_IDX])], dim=0))
        en_batch.append(
            torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))

    fr_batch = pad_sequence(fr_batch, padding_value=PAD_IDX)
    en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
    return fr_batch, en_batch

In [13]:
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(test_data, batch_size=BATCH_SIZE,
                       shuffle=True, collate_fn=generate_batch)

## Transformer!

Transformer is a Seq2Seq model introduced in [“Attention is all you need”](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) paper for solving machine translation task. Transformer model consists of an encoder and decoder block each containing fixed number of layers.

Encoder processes the input sequence by propogating it, through a series of Multi-head Attention and Feed forward network layers. The output from the Encoder referred to as `memory`, is fed to the decoder along with target tensors. Encoder and decoder are trained in an end-to-end fashion using teacher forcing technique.

In [14]:
# Define model parameters and instantiate model
SRC_VOCAB_SIZE = len(fr_vocab)
TGT_VOCAB_SIZE = len(en_vocab)
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
NUM_EPOCHS = 16

In [15]:
transformer = Seq2SeqTransformer(
    NUM_ENCODER_LAYERS,
    NUM_DECODER_LAYERS,
    NHEAD,
    EMB_SIZE,
    SRC_VOCAB_SIZE,
    TGT_VOCAB_SIZE,
    FFN_HID_DIM,
)

for p in transformer.parameters():
    if p.dim() > 1:
        torch.nn.init.xavier_uniform_(p)

transformer = transformer.to(device)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

### Training

In [16]:
for epoch in range(1, NUM_EPOCHS + 1):
    start_time = time.time()
    train_loss = train_epoch(transformer, train_iter, optimizer, PAD_IDX, device)
    end_time = time.time()
    val_loss = evaluate(transformer, valid_iter, PAD_IDX, device)
    print((f"Epoch: {epoch:2d}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "
           f"Epoch time = {(end_time - start_time):.3f}s"))

Epoch:  1, Train loss: 5.298, Val loss: 4.005, Epoch time = 26.077s
Epoch:  2, Train loss: 3.585, Val loss: 3.051, Epoch time = 26.319s
Epoch:  3, Train loss: 2.852, Val loss: 2.550, Epoch time = 26.021s
Epoch:  4, Train loss: 2.411, Val loss: 2.271, Epoch time = 26.033s
Epoch:  5, Train loss: 2.106, Val loss: 2.066, Epoch time = 26.306s
Epoch:  6, Train loss: 1.877, Val loss: 1.928, Epoch time = 25.493s
Epoch:  7, Train loss: 1.697, Val loss: 1.834, Epoch time = 24.394s
Epoch:  8, Train loss: 1.549, Val loss: 1.742, Epoch time = 24.415s
Epoch:  9, Train loss: 1.423, Val loss: 1.679, Epoch time = 24.200s
Epoch: 10, Train loss: 1.318, Val loss: 1.634, Epoch time = 24.333s
Epoch: 11, Train loss: 1.226, Val loss: 1.582, Epoch time = 24.226s
Epoch: 12, Train loss: 1.143, Val loss: 1.571, Epoch time = 24.316s
Epoch: 13, Train loss: 1.070, Val loss: 1.535, Epoch time = 24.195s
Epoch: 14, Train loss: 1.000, Val loss: 1.502, Epoch time = 24.398s
Epoch: 15, Train loss: 0.940, Val loss: 1.497, E

We get the following results during model training.

```R
Epoch:  1, Train loss: 5.298, Val loss: 4.005, Epoch time = 24.849s
Epoch:  2, Train loss: 3.585, Val loss: 3.051, Epoch time = 26.374s
Epoch:  3, Train loss: 2.852, Val loss: 2.550, Epoch time = 24.791s
Epoch:  4, Train loss: 2.411, Val loss: 2.271, Epoch time = 24.833s
Epoch:  5, Train loss: 2.106, Val loss: 2.066, Epoch time = 25.058s
Epoch:  6, Train loss: 1.877, Val loss: 1.928, Epoch time = 24.756s
Epoch:  7, Train loss: 1.697, Val loss: 1.834, Epoch time = 25.069s
Epoch:  8, Train loss: 1.549, Val loss: 1.742, Epoch time = 25.382s
Epoch:  9, Train loss: 1.423, Val loss: 1.679, Epoch time = 24.702s
Epoch: 10, Train loss: 1.318, Val loss: 1.634, Epoch time = 24.820s
Epoch: 11, Train loss: 1.226, Val loss: 1.582, Epoch time = 24.807s
Epoch: 12, Train loss: 1.143, Val loss: 1.571, Epoch time = 24.967s
Epoch: 13, Train loss: 1.070, Val loss: 1.535, Epoch time = 24.855s
Epoch: 14, Train loss: 1.000, Val loss: 1.502, Epoch time = 25.069s
Epoch: 15, Train loss: 0.940, Val loss: 1.497, Epoch time = 24.914s
Epoch: 16, Train loss: 0.885, Val loss: 1.487, Epoch time = 25.155s
```

In [17]:
src_language = "Un groupe de personnes se tenant devant un igloo ."

tgt_language = translate(transformer, src_language, fr_vocab, en_vocab,
                         fr_tokenizer, BOS_IDX, EOS_IDX, device)

print(f"Translated: `{tgt_language}`.")

Translated: `A group of people standing in front of an igloo .`.


## References

1. Attention is all you need papaer. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. Language Translation With Transformer Tutorial. https://pytorch.org/tutorials/beginner/translation_transformer.html