<a href="https://colab.research.google.com/github/wileyw/DeepLearningDemos/blob/master/MachineTranslation/torchtext_translation_tutorial_with_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%matplotlib inline

In [None]:
!python -m pip install --upgrade pip
!python -m pip install torchtext==0.6.0
!python -m pip install einops

In [None]:
from einops import rearrange

In [None]:
!python -m pip install spacy

In [None]:
!python -m spacy download en
!python -m spacy download fr
!python -m spacy download de


Language Translation with TorchText
===================================

This tutorial shows how to use several convenience classes of ``torchtext`` to preprocess
data from a well-known dataset containing sentences in both English and German and use it to
train a sequence-to-sequence model with attention that can translate German sentences
into English.

- [Link to PyTorch Tutorial](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html?highlight=transformer)
- [Link to another translation tutorial](https://github.com/andrewpeng02/transformer-translation)

It is based off of
`this tutorial <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__
from PyTorch community member `Ben Trevett <https://github.com/bentrevett>`__
and was created by `Seth Weidman <https://github.com/SethHWeidman/>`__ with Ben's permission.

By the end of this tutorial, you will be able to:

- Preprocess sentences into a commonly-used format for NLP modeling using the following ``torchtext`` convenience classes:
    - `TranslationDataset <https://torchtext.readthedocs.io/en/latest/datasets.html#torchtext.datasets.TranslationDataset>`__
    - `Field <https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field>`__
    - `BucketIterator <https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.BucketIterator>`__



`Field` and `TranslationDataset`
----------------
``torchtext`` has utilities for creating datasets that can be easily
iterated through for the purposes of creating a language translation
model. One key class is a
`Field <https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L64>`__,
which specifies the way each sentence should be preprocessed, and another is the
`TranslationDataset` ; ``torchtext``
has several such datasets; in this tutorial we'll use the
`Multi30k dataset <https://github.com/multi30k/dataset>`__, which contains about
30,000 sentences (averaging about 13 words in length) in both English and German.

Note: the tokenization in this tutorial requires `Spacy <https://spacy.io>`__
We use Spacy because it provides strong support for tokenization in languages
other than English. ``torchtext`` provides a ``basic_english`` tokenizer
and supports other tokenizers for English (e.g.
`Moses <https://bitbucket.org/luismsgomes/mosestokenizer/src/default/>`__)
but for language translation - where multiple languages are required -
Spacy is your best bet.

To run this tutorial, first install ``spacy`` using ``pip`` or ``conda``.
Next, download the raw data for the English and German Spacy tokenizers:

::

   python -m spacy download en
   python -m spacy download de

With Spacy installed, the following code will tokenize each of the sentences
in the ``TranslationDataset`` based on the tokenizer defined in the ``Field``



In [None]:
import spacy
import torchtext
from torchtext.data import Field, BucketIterator

In [None]:
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

SRC = Field(tokenize = "spacy",
            tokenizer_language="de",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

TRG = Field(tokenize = "spacy",
            tokenizer_language="en",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), fields = (SRC, TRG))

Now that we've defined ``train_data``, we can see an extremely useful
feature of ``torchtext``'s ``Field``: the ``build_vocab`` method
now allows us to create the vocabulary associated with each language



In [None]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

## What does our data look like?
Torchtext preprocesses our data to give us a mapping from our source language to the target language and back.

In [None]:
# Printing a list of tokens mapping integer to strings
print(SRC.vocab.itos)
# Printing a dict mapping tokens to indices
print(SRC.vocab.stoi)
# Printing the index of an actual word
print(SRC.vocab.stoi['ein'])

Once these lines of code have been run, ``SRC.vocab.stoi`` will  be a
dictionary with the tokens in the vocabulary as keys and their
corresponding indices as values; ``SRC.vocab.itos`` will be the same
dictionary with the keys and values swapped. We won't make extensive
use of this fact in this tutorial, but this will likely be useful in
other NLP tasks you'll encounter.



``BucketIterator``
----------------
The last ``torchtext`` specific feature we'll use is the ``BucketIterator``,
which is easy to use since it takes a ``TranslationDataset`` as its
first argument. Specifically, as the docs say:
Defines an iterator that batches examples of similar lengths together.
Minimizes amount of padding needed while producing freshly shuffled
batches for each new epoch. See pool for the bucketing procedure used.



In [None]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

These iterators can be called just like ``DataLoader``s; below, in
the ``train`` and ``evaluate`` functions, they are called simply with:

::

   for i, batch in enumerate(iterator):

Each ``batch`` then has ``src`` and ``trg`` attributes:

::

   src = batch.src
   trg = batch.trg



Defining our ``nn.Module`` and ``Optimizer``
----------------
That's mostly it from a ``torchtext`` perspecive: with the dataset built
and the iterator defined, the rest of this tutorial simply defines our
model as an ``nn.Module``, along with an ``Optimizer``, and then trains it.

Our model specifically, follows the architecture described
`here <https://arxiv.org/abs/1409.0473>`__ (you can find a
significantly more commented version
`here <https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__).

Note: this model is just an example model that can be used for language
translation; we choose it because it is a standard model for the task,
not because it is the recommended model to use for translation. As you're
likely aware, state-of-the-art models are currently based on Transformers;
you can see PyTorch's capabilities for implementing Transformer layers
`here <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__; and
in particular, the "attention" used in the model below is different from
the multi-headed self-attention present in a transformer model.



In [None]:
import random
from typing import Tuple

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch import Tensor

import math

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)

ENC_EMB_DIM = 32
DEC_EMB_DIM = 32
ENC_HID_DIM = 64
DEC_HID_DIM = 64
ATTN_DIM = 8
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# Source: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=100):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

class TransformerModel(nn.Module):
    def __init__(self):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(INPUT_DIM, ENC_EMB_DIM)
        self.tgt_embedding = nn.Embedding(INPUT_DIM, ENC_EMB_DIM)
        self.transformer = nn.Transformer(nhead=8, num_encoder_layers=2, d_model=ENC_EMB_DIM)
        self.linear = nn.Linear(ENC_EMB_DIM, OUTPUT_DIM)
        pos_dropout = 0.1
        max_seq_length = 128
        self.pos_enc = PositionalEncoding(ENC_EMB_DIM, pos_dropout, max_seq_length)
    
    def forward(self, src, tgt, teacher_forcing_ratio=0.5, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, tgt_mask=None):
        # TODO: Investigate masks, positional encoding, understand Rearrange(), debug model output (output has negative numbers for some reason)
        # Original src shape: (sentence length=24?, batch_size=128)
        # Original tgt shape: (sentence length=24?, batch_size=128)
        # Transformer expects: (sentence length=24, batch_size=128, embedding_size=128)
        src_emb = self.pos_enc(self.src_embedding(src) * math.sqrt(ENC_EMB_DIM))
        tgt_emb = self.pos_enc(self.tgt_embedding(tgt) * math.sqrt(ENC_EMB_DIM))
        out = self.transformer(src_emb, tgt_emb, tgt_mask=tgt_mask, src_key_padding_mask=src_key_padding_mask, tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=memory_key_padding_mask)
        out = self.linear(out)
        return out

model = TransformerModel().to(device)
for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_normal_(p)


optimizer = optim.Adam(model.parameters(), betas=(0.9, 0.98))


def count_parameters(model: nn.Module):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f'The model has {count_parameters(model):,} trainable parameters')

Note: when scoring the performance of a language translation model in
particular, we have to tell the ``nn.CrossEntropyLoss`` function to
ignore the indices where the target is simply padding.



In [None]:
# Checking index of special tokens
PAD_IDX = TRG.vocab.stoi['<pad>']
SOS_IDX = TRG.vocab.stoi['<sos>']
EOS_IDX = TRG.vocab.stoi['<eos>']
UNK_IDX = TRG.vocab.stoi['<unk>']
print('pad index:', PAD_IDX)
print('sos index:', SOS_IDX)
print('eos index:', EOS_IDX)
print('unk index:', UNK_IDX)

criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

Finally, we can train and evaluate this model:



In [None]:
def gen_nopeek_mask(length):
    mask = rearrange(torch.triu(torch.ones(length, length)) == 1, 'h w -> w h')
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))

    return mask

In [None]:
def example_translate(model, example_sentence_src):
    # Translate example sentence
    example_tensor_src = string_to_indices(SRC, example_sentence_src).view(-1, 1)
    example_sentence_tgt = '<sos>'
    example_tensor_tgt = string_to_indices(TRG, example_sentence_tgt).view(-1, 1)
    src = example_tensor_src.to(device)
    tgt = example_tensor_tgt.to(device)

    print('----------Translating--------------')
    for i in range(128):
        print('Src:', src)
        print('Tgt:', tgt)
        src_key_padding_mask = src == PAD_IDX
        tgt_key_padding_mask = tgt == PAD_IDX
        memory_key_padding_mask = src_key_padding_mask.clone()
        src_key_padding_mask = rearrange(src_key_padding_mask, 'n s -> s n')
        tgt_key_padding_mask = rearrange(tgt_key_padding_mask, 'n s -> s n')
        memory_key_padding_mask = rearrange(memory_key_padding_mask, 'n s -> s n')
        tgt_mask = gen_nopeek_mask(tgt.shape[0]).to('cuda')
        print('Tgt mask:', tgt_mask)

        output = model(src, tgt, 0, src_key_padding_mask=src_key_padding_mask, tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=memory_key_padding_mask, tgt_mask=tgt_mask) #turn off teacher forcing

        print('Output:', output)
        # TODO: Check that the argmax line is correct
        output_index = torch.argmax(output, dim=2)[-1].item()
        output_word = TRG.vocab.itos[output_index]
        example_sentence_tgt = example_sentence_tgt + ' ' + output_word
        print('Translated sentence so far:', example_sentence_tgt)
        example_tensor_tgt = string_to_indices(TRG, example_sentence_tgt).view(-1, 1)
        tgt = example_tensor_tgt.to(device)
        if output_word == '<eos>':
            break
    print('-----------Finished Translating--------------')


In [None]:
import math
import time

def indices_to_string(LANGUAGE, batch):
    words_list = []
    for sentence in batch.transpose(1, 0):
        sentence_list = sentence.tolist()
        words = []
        for index in sentence_list:
            word = LANGUAGE.vocab.itos[index]
            words.append(word)
        words_list.append(words)
    return words_list

def string_to_indices(LANGUAGE, sentence):
    words = sentence.split()
    indices = []
    for word in words:
        if word in LANGUAGE.vocab.stoi:
            index = LANGUAGE.vocab.stoi[word]
            indices.append(index)
        else:
            index = LANGUAGE.vocab.stoi['<unk>']
            indices.append(index)
    result = torch.tensor(indices)
    return result

def train(model: nn.Module,
          iterator: BucketIterator,
          optimizer: optim.Optimizer,
          criterion: nn.Module,
          clip: float):

    model.train()

    epoch_loss = 0

    for _, batch in enumerate(iterator):

        src = batch.src
        tgt = batch.trg

        optimizer.zero_grad()

        # Original src shape: (sentence length=24, batch_size=128)
        # Transformer expects: (sentence length=24, batch_size=128, embedding_size=128)
        src_key_padding_mask = src == PAD_IDX
        tgt_key_padding_mask = tgt == PAD_IDX
        memory_key_padding_mask = src_key_padding_mask.clone()
        src_key_padding_mask = rearrange(src_key_padding_mask, 'n s -> s n')
        tgt_key_padding_mask = rearrange(tgt_key_padding_mask, 'n s -> s n')
        memory_key_padding_mask = rearrange(memory_key_padding_mask, 'n s -> s n')
        tgt_sentence_len = tgt.shape[0] - torch.sum(tgt_key_padding_mask, axis=1)
        tgt_inp, tgt_out = tgt[:-1, :], tgt[1:, :]
        tgt_key_padding_mask = tgt_key_padding_mask[:, :-1]
        tgt_mask = gen_nopeek_mask(tgt_inp.shape[0]).to('cuda')
        output = model(src, tgt_inp, src_key_padding_mask=src_key_padding_mask, tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=memory_key_padding_mask, tgt_mask=tgt_mask)
        from_one_hot = torch.argmax(output, dim=2)
        # output shape: (sentence length=24, batch_size=128, vocab=5893)
        # Original tgt shape: (sentence length=24, batch_size=128)

        output = output.view(-1, output.shape[-1])
        tgt_out = tgt_out.view(-1)

        loss = criterion(output, tgt_out)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)



def evaluate(model: nn.Module,
             iterator: BucketIterator,
             criterion: nn.Module):

    print('Evaluating')
    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for _, batch in enumerate(iterator):

            src = batch.src
            tgt = batch.trg

            src_key_padding_mask = src == PAD_IDX
            tgt_key_padding_mask = tgt == PAD_IDX
            memory_key_padding_mask = src_key_padding_mask.clone()
            src_key_padding_mask = rearrange(src_key_padding_mask, 'n s -> s n')
            tgt_key_padding_mask = rearrange(tgt_key_padding_mask, 'n s -> s n')
            memory_key_padding_mask = rearrange(memory_key_padding_mask, 'n s -> s n')
            tgt_mask = gen_nopeek_mask(tgt.shape[0]).to('cuda')
            output = model(src, tgt, 0, src_key_padding_mask=src_key_padding_mask, tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=memory_key_padding_mask, tgt_mask=tgt_mask) #turn off teacher forcing
            from_one_hot = torch.argmax(output, dim=2)
            #print(src.shape, output.shape, from_one_hot.shape, tgt.shape)

            output = output[1:].view(-1, output.shape[-1])
            #print('Src:', src.transpose(1, 0))
            #print('Predicted:', from_one_hot.transpose(1, 0))
            #print('Target:', tgt.transpose(1, 0))
            src_words = indices_to_string(SRC, src)
            predicted_words = indices_to_string(TRG, from_one_hot)
            tgt_words = indices_to_string(TRG, tgt)
            #for i, src in enumerate(src_words):
            #    print('----------------------------------')
            #    print(' '.join(src))
            #    print(' '.join(predicted_words[i]))
            #    print(' '.join(tgt_words[i]))

            tgt = tgt[1:].view(-1)
            loss = criterion(output, tgt)

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)


def epoch_time(start_time: int,
               end_time: int):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


In [None]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

In [None]:
test_loss = evaluate(model, test_iterator, criterion)

In [None]:
example_sentence_src = '<sos> der himmel ist blau <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>'
#example_sentence_src = '<sos> viele menschen haben sich versammelt , um etwas zu sehen , das nicht auf dem foto ist . <eos> <pad> <pad> <pad> <pad>'
example_translate(model, example_sentence_src)

Next steps
--------------

- Check out the rest of Ben Trevett's tutorials using ``torchtext``
  `here <https://github.com/bentrevett/>`__
- Stay tuned for a tutorial using other ``torchtext`` features along
  with ``nn.Transformer`` for language modeling via next word prediction!


