# Exercise 5 : Neural machine Translation with `nn.Transformer` and Multi30k dataset
========================================================

**USI, Deep Learning lab SA 2024-2025**

Lecturer: Eleonora Vercesi. TAs: Alvise Dei Rossi, Stefano Huber, Gabriele Dominici

========================================================

In this exercise we're going to tackle a "simple" (toy) machine translation task with a full transformer model, implemented with Pytorch.
We will use the [Multi30k dataset from torchtext
library](https://pytorch.org/text/stable/datasets.html#multi30k) that
yields a pair of source-target raw sentences (source: German, target: English). Originally this dataset was introduced by researchers to stimulate multilingual multimodal research (sentences are image descriptions, [link to article](https://arxiv.org/abs/1605.00459)). It includes approximately 30 thousand sentence-pairs, hence the name.

In this example, we show how to load the dataset, tokenize raw text sentences,
build vocabulary, and numericalize tokens into tensor. We'll then built the transformer model and train it onto the processed data.



For comparison the original Trasformer model (which is not too far with respect to the implementation we'll see today) discussed in [AIAYN](https://user.phil.hhu.de/~cwurm/wp-content/uploads/2020/01/7181-attention-is-all-you-need.pdf) was instead trained on the WMT 2014 English-German dataset consisting of 4.5M sentence pairs, with 37k tokens and on 8 NVIDIA P100 GPUs for 12 hours.

# Libraries version & Tokenizer download
====================

**Note**: You **SHOULD definitely** use a GPU for this exercise; we'll use Colab GPUs but if you finished your GPU time on Colab, you can also run this notebook on Kaggle or Lightning.

The following versioning configuration will throw an error but this shouldn't be relevant for running the rest of the notebook. Keep it unchanged.

In [1]:
!pip install torch==2.0.0 torchdata==0.6.0 spacy==3.7.2 numpy==1.24.4 torchvision==0.15.1 torchtext==0.15.1 portalocker>=2.0.0

zsh:1: 2.0.0 not found


In this notebook we'll download the tokenizer, focusing more on the architecture of the model; but of course you could tokenize the sentences with custom tokenizers. Note that the ``torchtext`` tokenizers are a bit lacking when it comes to multi-language support. Instead we'll use a tokenizer from ``SpaCy`` ([link](https://spacy.io/)), which originally provides strong support for tokenization in languages other than English. Another popular choice are HuggingFace [tokenizers](https://github.com/huggingface/tokenizers).

In [5]:
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

/Users/stefano/miniconda3/envs/DeepL/bin/python: No module named spacy
/Users/stefano/miniconda3/envs/DeepL/bin/python: No module named spacy


Data Downloading and Processing
============================

In [3]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k, IWSLT2017
from typing import Iterable, List


# Note that the links to the original dataset from torchtext are broken
# we need instead to load the dataset from zip files from a temporary repo
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}



We're going first to get the tokenizers and [vocabularies](https://pytorch.org/text/main/vocab.html).

Keep in mind that data samples in the dataset are tuples of sentences like (sentence_in_german, sentence_in_english)

In [6]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
# unk -> tokens not in the vocabulary
# pad -> padding - tokens to fill the sequence to the max length
# bos -> begin of sentence
# eos -> end of sentence
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

# Set ``UNK_IDX`` as the default index. This index is returned when the token is not found.
# If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)

Please install SpaCy. See the docs at https://spacy.io for more information.


ModuleNotFoundError: No module named 'spacy'

Let's make a couple examples:

In [None]:
en_vocab = vocab_transform[TGT_LANGUAGE]
de_vocab = vocab_transform[SRC_LANGUAGE]
print(f"English vocab size: {len(en_vocab)}")
print(f"German vocab size: {len(de_vocab)}")

# Tokenization of red car in english and german
print(en_vocab["red"], en_vocab["car"], en_vocab["Car"])
print(de_vocab["rotes"], de_vocab["Auto"], de_vocab["auto"])

Encoder Decoder Transformer Network
=================================

Transformer is a [sequence to sequence](https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html) model introduced in ["Attention is all you
need"](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
paper for machine translation tasks.

<img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" width=400 height=500>



## Embedding and positional encoding

Below, we will create a
Seq2Seq network that uses Transformer. The network consists of three
parts. First part is the embedding layer. This layer converts tensor of
input indices into corresponding tensor of input embeddings. These
embedding are further augmented with positional encodings to provide
position information of input tokens to the model (more details [here](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)).

<img src="https://jalammar.github.io/images/t/transformer_positional_encoding_example.png" width=800 height=300>

In the following image Depth is the embedding size.

<img src="https://kazemnejad.com/img/transformer_architecture_positional_encoding/positional_encoding.png" width=800 height=350>

We report here the original implementation


\begin{equation}
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
\end{equation}

\begin{equation}
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
\end{equation}

Try to implement a Pytorch layer following the equations above. Consider that it must match the dimension of the Embedding Layer reported below.


In [None]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

class PositionalEncoding(nn.Module):

    pass # TODO


# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

### Solution

In [None]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        # For a better explanation for the following computation, refer to the link:
        # https://ai.stackexchange.com/questions/41670/why-use-exponential-and-log-in-positional-encoding-of-transformer
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)


## Encoder and decoder


Embedding layers and positional encoding are used both in the Encoder and Decoder part of the Transformer architecture.

The key difference wrt recurrent networks is the attention mechanism, also present in both encoder and decoder.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ArTXQZip_TwbU6gLshXOEw.png">

For each token we create a Query vector $q$ a key vector $k$ and a value vector $v$, through linear transformation. Stacking these vectors we obtain the corresponding matrices $Q$, $K$, $V$.

Self attention is then computed as:

$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{\text{d}_{k}}}\right) V
$$

The $QK^T$ matrix multiplication computes the relevance between each pair of words (attention weights).

This relevance is then used as a "factor" to compute the weighted sum of all the values words.

This is done in parallel multiple times, for multiple heads (`nhead`). The idea is to capture different aspects of the input.

Self-attention is present both in the encoder and decoder parts of the network, with some key differences:

- For the encoder self-attention the input sequence pays attention to itself: masking is done only on PAD tokens, to avoid attention weight on padding.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*026nzf4bw_DdSZ7kLJWcog.png">

- For the decoder, multi head attention is present twice:
  - Right after positional encoding, we have an attention mechanism where the the output sequence pays attention to itself. The mask is defined so that it will prevent the model from looking into future words when making predictions (masked self-attention). Padding is also masked. Said in other words, in the Decoder Self-attention masking serves to prevent the decoder from ‘peeking’ ahead at the rest of the target sentence when predicting the next word.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*cawtdZLjT9hp7ByG2vcKOQ.png">

  - - After the previous attention mechanism, the Q matrix arrives from it, while K and V from the Encoder. In practice, In the Encoder-Decoder Attention, the Query is obtained from the target sentence and the Key/Value from the source sentence. Thus it computes the relevance of each word in the target sentence to each word in the source sentence.


  The masking is done as per the encoder part here, being careful about padding.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*OYi49Pkg-Vl3D4HleEuu7g.png">


**Note**: not in images, but an encoded \<EOS\> token would be present at the end of each sequence in the encoder, while an encoded \<BOS\> token is present in the decoder at the start of each sequence.

Keep in mind that the decoder part of the Transformers works differently when in training (teacher-forcing) and when in inference mode (autoregressive)



This process is then followed by a normalization layer, skip connections, feedforward and again normalization layer + skip connection.

Multiple encoder (`num_encoder_layers`) and decoder (`num_decoder_layers`) layers are stacked. Note that dimension of the representation of the sequence is preserved throught the architecture.
Finally, the output of the Transformer model is passed through
a linear layer that gives unnormalized probabilities for each token in the
target language.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*dQTK3oeYqOBUDVgNktSpCw.png" height=600 width=750>

the [Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) model is easily implemented with its Pytorch class.

In [None]:
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

We will also define functions for the masks:

In [None]:
def generate_square_subsequent_mask(sz):
    ## Decoder mask
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

# Just to have an idea of what it looks like:
print(generate_square_subsequent_mask(5))

Let\'s now define the parameters of our model and instantiate the same.
Below, we also define our loss function which is the cross-entropy loss
and the optimizer used for training.


In [None]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

Define the loss function and the optimizer. Remember, the loss should ignore padding; take a look at the Pytorch documentation of the appropriate loss (which one should you use for multi-class classificaiton?) in order to understand how to do it.

Also, how would you check if your model is too large or if everything you expect is within the model?

In [None]:
loss_fn = # TODO

optimizer = # TODO

# TODO

## Solution

In [None]:
# Cross entropy is not computed for the padded part of the sentence.
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

I always find interesting to keep track of the number of trainable parameters when I'm deciding the hyperparameters of a model:

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(transformer)

Instead if you want a more complete overview of the model:

In [None]:
print(transformer)

Collation
=========

As seen in the `Data Sourcing and Processing` section, our data iterator
yields a pair of raw strings. We need to convert these string pairs into
the batched tensors that can be processed by our `Seq2Seq` network
defined previously. Below we define our collate function that converts a
batch of raw strings into batch tensors that can be fed directly into
our model.


In [None]:
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# ``src`` and ``tgt`` language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor




Given the transform helper functions above, how would you define a collate function, which will be given to the DataLoaders, to appropriately prepare batches? Remember to also pad sequences with the ``pad_sequence`` function, imported from torch utils.

In [None]:
def collate_fn(batch):
  pass # TODO

## Solution

In [None]:
# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    # print("batch: ", batch)
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

# Training and inference functions

Let\'s define training and evaluation loop that will be called for each
epoch.

Note that in the validation epoch we also compute [perplexity](https://medium.com/nlplanet/two-minutes-nlp-perplexity-explained-with-simple-probabilities-6cdc46884584).


In [None]:
from torch.utils.data import DataLoader


def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        optimizer.zero_grad()

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

        perplexity = math.exp(loss.item())

    return losses / len(list(val_dataloader)), perplexity



Let's also write a couple functions to greedily decode sequences produced by the model.

In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

Write a function to qualitatively evaluate the quality of the translation of a single sentence, from the validation set, from the model. This is going to be used at the end of every training epoch to have a general idea of how / if the model is improving. Use the functions defined above within it.

In [None]:
def evaluate_single_sentence():
  pass #TODO

## Solution

In [None]:
def evaluate_single_sentence(model, iteration):
    model.eval()

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    src, tgt = next(iter(val_dataloader))
    src_sample = src[:, iteration]
    tgt_sample = tgt[:, iteration]
    src_str = " ".join([de_vocab.lookup_token(idx) for idx in list(src_sample) if idx not in [1,2,3]] )
    tgt_str = " ".join([en_vocab.lookup_token(idx) for idx in list(tgt_sample) if idx not in [1,2,3]] )

    return src_str, tgt_str, translate(model, src_str)

# Model training

We finally have everything we need to train our Transformer model.

Using all the functions defined above, train the transformer model for 20 epochs, keep track of training and validation losses at every epoch, as they will be useful to evaluate the model afterwards.

Make sure to print out essential information during the training.
For each epoch you should print out:

- Epoch number
- Training loss
- Validation loss
- Perplexity
- A single source sentence
- The corresponding target sentence
- The translation of the model for that sentence
- Optionally the amount of time for the epoch


The training should take approximately 15-20 minutes (with Colab's GPU).


In [None]:
from timeit import default_timer as timer
import warnings
warnings.filterwarnings("ignore")

NUM_EPOCHS = 20

training_losses = # TODO
val_losses = # TODO

print("Untrained model")
source, target, translation = evaluate_single_sentence(transformer, 0)

print("*"*20)
print("Training")
print("*"*20)

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()


    # TODO


    end_time = timer()

## Solution

In [None]:
from timeit import default_timer as timer
import warnings
warnings.filterwarnings("ignore")

NUM_EPOCHS = 20

training_losses = []
val_losses = []

print("Untrained model")
source, target, translation = evaluate_single_sentence(transformer, 0)
print(f"Source: {source}")
print(f"Target: {target}")
print(f"Translation: {translation}")
print("\n\n")

print("*"*20)
print("Training")
print("*"*20)

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss, perplexity = evaluate(transformer)
    print(f"Epoch: {epoch}")
    print(f"Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, Perplexity: {perplexity}")
    print(f"Epoch time = {(end_time - start_time):.3f}s")
    training_losses.append(train_loss)
    val_losses.append(val_loss)
    source, target, translation = evaluate_single_sentence(transformer, epoch)

    print(f"Source: {source}")
    print(f"Target: {target}")
    print(f"Translation: {translation}")
    print("\n\n")

# Model evaluation

As usual let's plot out the evolution of the training and validation losses

In [None]:
import matplotlib.pyplot as plt

# TODO

## Solution

In [None]:
import matplotlib.pyplot as plt

plt.plot(range(1,NUM_EPOCHS+1),training_losses, label="Training loss")
plt.plot(range(1,NUM_EPOCHS+1),val_losses, label="Validation loss")
plt.legend()
plt.grid(alpha=0.5)
plt.show()

# Test translation

Let's test the translation with a few sentences, different conditions:

In [None]:
# For a sentence we know it's in the dataset
print("In dataset")
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu .")) # A group of people stands in front of an igloo.
# For some sentences that are appropriate for the dataset type (image captions)
print("In appropriate context")
print(translate(transformer, "Ein rotes Auto rast auf der Autobahn")) # A red car is racing on the highway
print(translate(transformer, "Ein kleiner Junge, der mit einem Ball spielt")) # A little boy playing with a ball
# Something completely different, a statement
print("Out of domain")
print(translate(transformer, "Der Präsident des Verbandes kündigte eine Konferenz an")) # The president of the association announced a conference
# Something that could be from a messaging app
print(translate(transformer, "Wir sehen uns morgen um 18 Uhr in der Nähe des städtischen Schwimmbades")) # See you tomorrow at 6 p.m. near the municipal swimming pool

You can try also with your own test sentences:

In [None]:
my_sentence = # TODO (note that this should be in German)
print(translate(transformer, my_sentence))

We've seen a lot in this exercise, but as you can see the model is limited. After all you wouldn't expect to solve machine translation on a laptop in 30 minutes, right?
If you want you can try several things:
- change minumum number of times a word has to appear in the training set to be included in the vocabulary
- change tokenization sm-md-lg
- change tokenization completely
- change hyperparameters of the transformer architecture (**emb size**, **ffnn_dim**, n_enc, n_dec, ...)
- change training hyperparameters (batch size, number of epochs)
- implement other evaluation metrics (e.g. [Beau score](https://))
- implement a different way to generate sequences in inference (e.g.[ Beam search](https://towardsdatascience.com/foundations-of-nlp-explained-visually-beam-search-how-it-works-1586b9849a24))
- You can also use this notebook as a starting point for a different machine translation dataset / task, the steps are approximately the same.

References
==========

1. [Multi30K: Multilingual English-German Image Descriptions](https://arxiv.org/abs/1605.00459)
2. Attention is all you need paper.
    <https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf>
3.  The annotated transformer.
    <https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding>
4.  Blogpost transformers explained https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452
5.  Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. <https://arxiv.org/abs/1301.3781>
6.  Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. <https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html>
7.  Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to alignand translate. <https://arxiv.org/abs/1409.0473>
8.  He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. <https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html>
9.  Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. <https://arxiv.org/abs/1607.06450>
10.  Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory <https://ieeexplore.ieee.org/abstract/document/6795963>
