# LANGUAGE TRANSLATION WITH TRANSFORMER

This tutorial shows, how to train a translation model from scratch using Transformer. We will be using Multi30k dataset to train a French to English translation model.

## Data Processing

torchtext has utilities for creating datasets that can be easily iterated through for the purposes of creating a language translation model. In this example, we show how to tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor.

To run this tutorial, first install spacy using pip or conda. Next, download the raw data for the English and French Spacy tokenizers from https://spacy.io/usage/models

In [1]:
import time
import os
from pathlib import Path
from tabulate import tabulate
import pandas as pd
from functools import partial
import torch

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="6"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
from torchtext.data.utils import get_tokenizer
from torchtext.utils import extract_archive

torch.manual_seed(0)

<torch._C.Generator at 0x7f5e743e1cc0>

In [3]:
from torch.utils.data import DataLoader

In [4]:
from transformer_helper import (
    Seq2SeqTransformer,
    build_vocab,
    data_process,
    generate_batch,
    train_epoch,
    evaluate,
    translate,
)

In [5]:
data_path = Path('data-bin')

train_urls = ('train.fr', 'train.en')
val_urls = ('val.fr', 'val.en')
test_urls = ('test_2016_flickr.fr', 'test_2016_flickr.en')

train_filepaths = [data_path / url for url in train_urls]
val_filepaths = [data_path / url for url in val_urls]
test_filepaths = [data_path / url for url in test_urls]

In [6]:
fr_tokenizer = get_tokenizer('spacy', language='fr_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

In [7]:
fr_vocab = build_vocab(train_filepaths[0], fr_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

In [8]:
train_data = data_process(train_filepaths, fr_vocab, fr_tokenizer, en_vocab, en_tokenizer)
val_data = data_process(val_filepaths, fr_vocab, fr_tokenizer, en_vocab, en_tokenizer)
test_data = data_process(test_filepaths, fr_vocab, fr_tokenizer, en_vocab, en_tokenizer)

In [9]:
BATCH_SIZE = 128
PAD_IDX = fr_vocab['<pad>']
BOS_IDX = fr_vocab['<bos>']
EOS_IDX = fr_vocab['<eos>']

## DataLoader

The last torch specific feature we’ll use is the DataLoader, which is easy to use since it takes the data as its first argument. Specifically, as the docs say: DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

Please pay attention to `collate_fn` (optional) that merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

In [10]:
collate_fn = partial(generate_batch, start_symbol=BOS_IDX, end_symbol=EOS_IDX, padding_symbol=PAD_IDX)

In [11]:
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_iter = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

## Visualize Mask

In [12]:
from transformer_helper import create_mask

src, tgt = next(iter(valid_iter))
src = src.to(device)
tgt = tgt.to(device)

tgt_input = tgt[:-1, :]

src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(
    src, tgt_input, PAD_IDX, device)

pd.DataFrame(tgt_mask[0:12, 0:12].cpu().numpy())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
1,0.0,0.0,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
2,0.0,0.0,0.0,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
3,0.0,0.0,0.0,0.0,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
4,0.0,0.0,0.0,0.0,0.0,-inf,-inf,-inf,-inf,-inf,-inf,-inf
5,0.0,0.0,0.0,0.0,0.0,0.0,-inf,-inf,-inf,-inf,-inf,-inf
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,-inf,-inf,-inf,-inf
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,-inf,-inf,-inf
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,-inf,-inf
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,-inf


## Transformer!

Transformer is a Seq2Seq model introduced in [“Attention is all you need”](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) paper for solving machine translation task. Transformer model consists of an encoder and decoder block each containing fixed number of layers.

Encoder processes the input sequence by propogating it, through a series of Multi-head Attention and Feed forward network layers. The output from the Encoder referred to as `memory`, is fed to the decoder along with target tensors. Encoder and decoder are trained in an end-to-end fashion using teacher forcing technique.

In [13]:
# Define model parameters and instantiate model
SRC_VOCAB_SIZE = len(fr_vocab)
TGT_VOCAB_SIZE = len(en_vocab)
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
NUM_EPOCHS = 16

In [14]:
transformer = Seq2SeqTransformer(
    NUM_ENCODER_LAYERS,
    NUM_DECODER_LAYERS,
    NHEAD,
    EMB_SIZE,
    SRC_VOCAB_SIZE,
    TGT_VOCAB_SIZE,
    FFN_HID_DIM,
)

for p in transformer.parameters():
    if p.dim() > 1:
        torch.nn.init.xavier_uniform_(p)

transformer = transformer.to(device)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

### Training

In [15]:
for epoch in range(1, NUM_EPOCHS + 1):
    start_time = time.time()
    train_loss = train_epoch(transformer, train_iter, optimizer, PAD_IDX, device)
    end_time = time.time()
    val_loss = evaluate(transformer, valid_iter, PAD_IDX, device)
    print((f"Epoch: {epoch:2d}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "
           f"Epoch time = {(end_time - start_time):.3f}s"))

Epoch:  1, Train loss: 5.295, Val loss: 3.985, Epoch time = 24.454s
Epoch:  2, Train loss: 3.572, Val loss: 3.042, Epoch time = 24.331s
Epoch:  3, Train loss: 2.843, Val loss: 2.560, Epoch time = 24.149s
Epoch:  4, Train loss: 2.408, Val loss: 2.261, Epoch time = 24.321s
Epoch:  5, Train loss: 2.103, Val loss: 2.061, Epoch time = 24.221s
Epoch:  6, Train loss: 1.875, Val loss: 1.912, Epoch time = 24.330s
Epoch:  7, Train loss: 1.694, Val loss: 1.823, Epoch time = 24.263s
Epoch:  8, Train loss: 1.548, Val loss: 1.756, Epoch time = 24.370s
Epoch:  9, Train loss: 1.423, Val loss: 1.681, Epoch time = 24.223s
Epoch: 10, Train loss: 1.319, Val loss: 1.639, Epoch time = 24.484s
Epoch: 11, Train loss: 1.226, Val loss: 1.603, Epoch time = 24.360s
Epoch: 12, Train loss: 1.142, Val loss: 1.555, Epoch time = 24.409s
Epoch: 13, Train loss: 1.068, Val loss: 1.527, Epoch time = 24.250s
Epoch: 14, Train loss: 1.002, Val loss: 1.507, Epoch time = 24.285s
Epoch: 15, Train loss: 0.939, Val loss: 1.509, E

We get the following results during model training.

```R
Epoch:  1, Train loss: 5.295, Val loss: 3.985, Epoch time = 24.454s
Epoch:  2, Train loss: 3.572, Val loss: 3.042, Epoch time = 24.331s
Epoch:  3, Train loss: 2.843, Val loss: 2.560, Epoch time = 24.149s
Epoch:  4, Train loss: 2.408, Val loss: 2.261, Epoch time = 24.321s
Epoch:  5, Train loss: 2.103, Val loss: 2.061, Epoch time = 24.221s
Epoch:  6, Train loss: 1.875, Val loss: 1.912, Epoch time = 24.330s
Epoch:  7, Train loss: 1.694, Val loss: 1.823, Epoch time = 24.263s
Epoch:  8, Train loss: 1.548, Val loss: 1.756, Epoch time = 24.370s
Epoch:  9, Train loss: 1.423, Val loss: 1.681, Epoch time = 24.223s
Epoch: 10, Train loss: 1.319, Val loss: 1.639, Epoch time = 24.484s
Epoch: 11, Train loss: 1.226, Val loss: 1.603, Epoch time = 24.360s
Epoch: 12, Train loss: 1.142, Val loss: 1.555, Epoch time = 24.409s
Epoch: 13, Train loss: 1.068, Val loss: 1.527, Epoch time = 24.250s
Epoch: 14, Train loss: 1.002, Val loss: 1.507, Epoch time = 24.285s
Epoch: 15, Train loss: 0.939, Val loss: 1.509, Epoch time = 24.425s
Epoch: 16, Train loss: 0.883, Val loss: 1.496, Epoch time = 24.337s
```

In [16]:
src_language = "Un groupe de personnes se tenant devant un igloo."

tgt_language = translate(transformer, src_language, fr_vocab, en_vocab,
                         fr_tokenizer, BOS_IDX, EOS_IDX, device)

print(f"Translated: `{tgt_language}`.")

Translated: `A group of people standing outside an igloo .`.


## References

1. Attention is all you need papaer. https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. Language Translation With Transformer Tutorial. https://pytorch.org/tutorials/beginner/translation_transformer.html