# Translation using Transformers

Transformers have become the defacto standard for NLP tasks nowadays. They started being used in NLP but they are now being used in Computer Vision and sometimes to generate music as well. I am sure you would all have heard about the GPT3 Transformer or the jokes thereof.

But everything aside, they are still hard to understand as ever. In my last post, I talked in quite a detail about transformers and how they work on a basic level. I went through the encoder and decoder architecture and the whole data flow in those different pieces of the neural network. 

But as I like to say we don't really understand something before we implement it ourselves. So in this post, we will implement an English to German language translator using Transformers.

# Task Description

We want to create a translator that uses transformers to convert English to German. So, if we look at it as a black-box, our network takes as input an English sentence and returns a German sentence.

![](transformers/1.png)


In [27]:
import copy
from typing import Optional, Any
import numpy as np
import torch
from torch.nn.init import xavier_uniform_
from torch import Tensor
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn as sns
#seaborn.set_context(context="talk")
%matplotlib inline
# For data loading.
from torchtext import data, datasets
import spacy
import torchtext

# Data Preprocessing

To create a translation Model,  we need translated sentence pairs between English and French. These are pretty much standard to get with the IWSLT(International Workshop on Spoken Language Translation) dataset which we can access using `torchtext.datasets`. This machine translation dataset is sort of the defacto standard used for translation tasks and has  translation of TED and TEDx talks on various topics in different languages.

Load the Spacy Models- These will be used for tokenization of german and english text.

In [23]:
# Load the Spacy Models
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

Define some special tokens we will use for specifying blank/padding words, and beginning and end of sentences.

In [24]:
# Special Tokens
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"

We start by defining a preprocessing pipeline for both our source and target sentence

In [25]:
SRC = data.Field(tokenize=tokenize_en, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_de, init_token = BOS_WORD, 
                 eos_token = EOS_WORD, pad_token=BLANK_WORD)

We then use the implemented function splits to divide our datasets into train,validation and test datasets. 
We also filter our sentences using the max_len parameter so that our code runs a lot faster.  

In [36]:
MAX_LEN = 20
train, val, test = datasets.translation.WMT14.splits(
    exts=('.en', '.de'), fields=(SRC, TGT), 
    filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN 
    and len(vars(x)['trg']) <= MAX_LEN)

downloading wmt16_en_de.tar.gz


wmt16_en_de.tar.gz: 503MB [00:33, 14.8MB/s] 


In [37]:
for i, example in enumerate([(x.src,x.trg) for x in train[0:5]]):
    print(f"Example_{i}:{example}")

Example_0:(['Res@@', 'um@@', 'ption', 'of', 'the', 'session'], ['Wiederaufnahme', 'der', 'Sitzungsperiode'])
Example_1:(['Please', 'rise', ',', 'then', ',', 'for', 'this', 'minute', '&', 'apos', ';', 's', 'silence', '.'], ['Ich', 'bitte', 'Sie', ',', 'sich', 'zu', 'einer', 'Schwei@@', 'ge@@', 'minute', 'zu', 'erheben', '.'])
Example_2:(['(', 'The', 'House', 'rose', 'and', 'observed', 'a', 'minute', '&', 'apos', ';', 's', 'silence', ')'], ['(', 'Das', 'Parlament', 'erhebt', 'sich', 'zu', 'einer', 'Schwei@@', 'ge@@', 'minute', '.', ')'])
Example_3:(['Madam', 'President', ',', 'on', 'a', 'point', 'of', 'order', '.'], ['Frau', 'Präsidentin', ',', 'zur', 'Geschäftsordnung', '.'])
Example_4:(['If', 'the', 'House', 'agrees', ',', 'I', 'shall', 'do', 'as', 'Mr', 'Evans', 'has', 'suggested', '.'], ['Wenn', 'das', 'Haus', 'damit', 'einverstanden', 'ist', ',', 'werde', 'ich', 'dem', 'Vorschlag', 'von', 'Herrn', 'Evans', 'folgen', '.'])


We also create a Source and Target Language vocaulary by using the built in function in data field object. We also specify a MIN_FREQ of 2 so that any word that doesn't occur atleast twice doesn't get to be a part of our vocabulary. 

In [38]:
MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

Once we are done with this, we can simply use data.Bucketiterator which is used to giver batches of similar lengths to get our train iterator and validation iterator. Note that we use a batch_size of 1 for our validation data. Its optional to do this but is actually done so that we don't do padding or do minimal padding while checking validation data performance|.  

In [39]:
BATCH_SIZE =350
# Create iterators to process text in batches of approx. the same length
train_iter = data.BucketIterator(train, batch_size=BATCH_SIZE, repeat=False, sort_key=lambda x: len(x.src))
val_iter = data.BucketIterator(val, batch_size=1, repeat=False, sort_key=lambda x: len(x.src))



We can see what's in a batch. And what we are sending to the model as an input while training. 

In [40]:
batch = next(iter(train_iter))
src_matrix = batch.src.T
print(src_matrix, src_matrix.size())



tensor([[ 330,   63, 2237,  ...,    1,    1,    1],
        [  38,    7,  347,  ...,    1,    1,    1],
        [ 490,  774,  269,  ...,  307,    2,    1],
        ...,
        [  44, 2396,   43,  ...,    1,    1,    1],
        [1749, 1255, 1873,  ..., 1367,   48,    2],
        [2594,  137,   20,  ...,    1,    1,    1]]) torch.Size([350, 20])


In [41]:
trg_matrix = batch.trg.T
print(trg_matrix, trg_matrix.size())

tensor([[    2, 14532,   341,  ...,     1,     1,     1],
        [    2,    44,   207,  ...,     1,     1,     1],
        [    2,  2327, 21046,  ...,     4,     3,     1],
        ...,
        [    2,    52,   102,  ...,     1,     1,     1],
        [    2,   450,  8266,  ...,    50,     4,     3],
        [    2,    13, 17141,  ...,     3,     1,     1]]) torch.Size([350, 22])


So what are these tensors? We can understand them better if we see their sizes first.

so the src tensor contains 350 sentrences of length 20 and the target tensor is 350 sentences of length 22. Cool but what are the numbers they represent?We can understand them by looking at the source and target vocabulary. 
For example, the number 1 forms a big part of the src batch. What is 1 in src sentence? We can check it out using the vocab method index to string: 

In [42]:
print(SRC.vocab.itos[1])
print(TGT.vocab.itos[2])
print(TGT.vocab.itos[1])

<blank>
<s>
<blank>


And what about the 2 that occurs in all the trg sentence starting? You guessed correct it is the start token.


In [43]:
TGT.vocab.stoi['</s>']

3

Important: Please note that it really doesn't matter here if you do any other sort of preprocessing or use some other functions than `data.field` or use other tokenizers. What eventually matters is that in the end you need to send the sentence source and targets to your model in a way that's intended to be used by transformer. i.e. source sentences should be padded with blank token and target sentences need to have a start token, an end token and rest padded by blank tokens.

So now that we have a way to send the input sentence and the shifted outputs to our transformer, we can look at creating the Transformer Ourself. A lot of the blocks here are taken from Pytorch nn module. Infact, Pytorech has a Transformer module too but it doesn't include a lot of functionalities present in the paper like the embedding layer, and the PositionalEncoding layer. So this is sort of a more complete implementation. 

We create our Transformer particularly using these various blocks from Pytorch nn module:

- [TransformerEncoderLayer](https://pytorch.org/docs/master/generated/torch.nn.TransformerEncoderLayer.html) :  A single encoder layer
- [TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html) : A stack of `num_encoder_layers` layers. In the paper it is by default kept as 6.
- [TransformerDecoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html) : A single decoder layer
- [TransformerDecoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html) : A stack of `num_decoder_layers` layers. In the paper it is by default kept as 6.

If you want you can look at the source of all these blocks also. I had to look a many times into the sourcdes myself to make sure that I was giving the right inputs to these layers.

In [44]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.d_model = d_model
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x * math.sqrt(self.d_model)
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

    
class MyTransformer(nn.Module):
    def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: str = "relu",source_vocab_length: int = 60000,target_vocab_length: int = 60000) -> None:
        super(MyTransformer, self).__init__()
        self.source_embedding = nn.Embedding(source_vocab_length, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
        encoder_norm = nn.LayerNorm(d_model)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
        self.target_embedding = nn.Embedding(target_vocab_length, d_model)
        decoder_layer = nn.TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = nn.TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
        self.out = nn.Linear(512, target_vocab_length)
        self._reset_parameters()
        self.d_model = d_model
        self.nhead = nhead

    def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        if src.size(1) != tgt.size(1):
            raise RuntimeError("the batch number of src and tgt must be equal")
        src = self.source_embedding(src)
        src = self.pos_encoder(src)
        memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
        tgt = self.target_embedding(tgt)
        tgt = self.pos_encoder(tgt)
        output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
                              tgt_key_padding_mask=tgt_key_padding_mask,
                              memory_key_padding_mask=memory_key_padding_mask)
        output = self.out(output)
        return output


    def _reset_parameters(self):
        r"""Initiate parameters in the transformer model."""
        for p in self.parameters():
            if p.dim() > 1:
                xavier_uniform_(p)


Now, we can initialize the transformer and the optimizer using:

In [45]:
source_vocab_length = len(SRC.vocab)
target_vocab_length = len(TGT.vocab)

model = MyTransformer(source_vocab_length=source_vocab_length,target_vocab_length=target_vocab_length)
optim = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
model = model.cuda()

AssertionError: Torch not compiled with CUDA enabled

We can train using the simple train loop

In [15]:
def train(train_iter, val_iter, model, optim, num_epochs,use_gpu=True): 
    train_losses = []
    valid_losses = []
    for epoch in range(num_epochs):
        train_loss = 0
        valid_loss = 0
        # Train model
        model.train()
        for i, batch in enumerate(train_iter):
            src = batch.src.cuda() if use_gpu else batch.src
            trg = batch.trg.cuda() if use_gpu else batch.trg
            #change to shape (bs , max_seq_len)
            src = src.transpose(0,1)
            #change to shape (bs , max_seq_len+1) , Since right shifted
            trg = trg.transpose(0,1)
            trg_input = trg[:, :-1]
            targets = trg[:, 1:].contiguous().view(-1)
            src_mask = (src != 0)
            src_mask = src_mask.float().masked_fill(src_mask == 0, float('-inf')).masked_fill(src_mask == 1, float(0.0))
            src_mask = src_mask.cuda() if use_gpu else src_mask
            trg_mask = (trg_input != 0)
            trg_mask = trg_mask.float().masked_fill(trg_mask == 0, float('-inf')).masked_fill(trg_mask == 1, float(0.0))
            trg_mask = trg_mask.cuda() if use_gpu else trg_mask
            size = trg_input.size(1)
            #print(size)
            np_mask = torch.triu(torch.ones(size, size)==1).transpose(0,1)
            np_mask = np_mask.float().masked_fill(np_mask == 0, float('-inf')).masked_fill(np_mask == 1, float(0.0))
            np_mask = np_mask.cuda() if use_gpu else np_mask   
            # Forward, backprop, optimizer
            optim.zero_grad()
            preds = model(src.transpose(0,1), trg_input.transpose(0,1), tgt_mask = np_mask)#, src_mask = src_mask)#, tgt_key_padding_mask=trg_mask)
            preds = preds.transpose(0,1).contiguous().view(-1, preds.size(-1))
            loss = F.cross_entropy(preds,targets, ignore_index=0,reduction='sum')
            loss.backward()
            optim.step()
            train_loss += loss.item()/BATCH_SIZE
        
        model.eval()
        with torch.no_grad():
            for i, batch in enumerate(val_iter):
                src = batch.src.cuda() if use_gpu else batch.src
                trg = batch.trg.cuda() if use_gpu else batch.trg
                #change to shape (bs , max_seq_len)
                src = src.transpose(0,1)
                #change to shape (bs , max_seq_len+1) , Since right shifted
                trg = trg.transpose(0,1)
                trg_input = trg[:, :-1]
                targets = trg[:, 1:].contiguous().view(-1)
                src_mask = (src != 0)
                src_mask = src_mask.float().masked_fill(src_mask == 0, float('-inf')).masked_fill(src_mask == 1, float(0.0))
                src_mask = src_mask.cuda() if use_gpu else src_mask
                trg_mask = (trg_input != 0)
                trg_mask = trg_mask.float().masked_fill(trg_mask == 0, float('-inf')).masked_fill(trg_mask == 1, float(0.0))
                trg_mask = trg_mask.cuda() if use_gpu else trg_mask
                size = trg_input.size(1)
                #print(size)
                np_mask = torch.triu(torch.ones(size, size)==1).transpose(0,1)
                np_mask = np_mask.float().masked_fill(np_mask == 0, float('-inf')).masked_fill(np_mask == 1, float(0.0))
                np_mask = np_mask.cuda() if use_gpu else np_mask

                preds = model(src.transpose(0,1), trg_input.transpose(0,1), tgt_mask = np_mask)#, src_mask = src_mask)#, tgt_key_padding_mask=trg_mask)
                preds = preds.transpose(0,1).contiguous().view(-1, preds.size(-1))         
                loss = F.cross_entropy(preds,targets, ignore_index=0,reduction='sum')
                valid_loss += loss.item()/1
            
        # Log after each epoch
        print(f'''Epoch [{epoch+1}/{num_epochs}] complete. Train Loss: {train_loss/len(train_iter):.3f}. Val Loss: {valid_loss/len(val_iter):.3f}''')
        
        #Save best model till now:
        if valid_loss/len(val_iter)<min(valid_losses,default=1e9): 
            print("saving state dict")
            torch.save(model.state_dict(), f"checkpoint_best_epoch.pt")
        
        train_losses.append(train_loss/len(train_iter))
        valid_losses.append(valid_loss/len(val_iter))
        
        # Check Example after each epoch:
        sentences = ["This is an example to check how our model is performing."]
        for sentence in sentences:
            print(f"Original Sentence: {sentence}")
            print(f"Translated Sentence: {greeedy_decode_sentence(model,sentence)}")
    return train_losses,valid_losses

In [16]:
def greeedy_decode_sentence(model,sentence):
    model.eval()
    sentence = SRC.preprocess(sentence)
    indexed = []
    for tok in sentence:
        if SRC.vocab.stoi[tok] != 0 :
            indexed.append(SRC.vocab.stoi[tok])
        else:
            indexed.append(0)
    sentence = Variable(torch.LongTensor([indexed])).cuda()
    trg_init_tok = TGT.vocab.stoi[BOS_WORD]
    trg = torch.LongTensor([[trg_init_tok]]).cuda()
    translated_sentence = ""
    maxlen = 25
    for i in range(maxlen):
        size = trg.size(0)
        np_mask = torch.triu(torch.ones(size, size)==1).transpose(0,1)
        np_mask = np_mask.float().masked_fill(np_mask == 0, float('-inf')).masked_fill(np_mask == 1, float(0.0))
        np_mask = np_mask.cuda()
        pred = model(sentence.transpose(0,1), trg, tgt_mask = np_mask)
        add_word = TGT.vocab.itos[pred.argmax(dim=2)[-1]]
        translated_sentence+=" "+add_word
        if add_word==EOS_WORD:
            break
        trg = torch.cat((trg,torch.LongTensor([[pred.argmax(dim=2)[-1]]]).cuda()))
        #print(trg)
    return translated_sentence

In [38]:
train_losses,valid_losses = train(train_iter, val_iter, model, optim, 35)

Epoch [1/35] complete. Train Loss: 86.092. Val Loss: 64.514
saving state dict
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Und die der der der der der der der der der der der der der der der der der der der der der der der
Epoch [2/35] complete. Train Loss: 59.769. Val Loss: 55.631
saving state dict
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Das ist ein paar paar paar sehr , die das ist ein paar sehr Jahre . </s>
Epoch [3/35] complete. Train Loss: 53.123. Val Loss: 49.685
saving state dict
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Das ist ein Beispiel , dass wir das Leben in der Welt ist , dass wir das Leben sein . </s>
Epoch [4/35] complete. Train Loss: 47.825. Val Loss: 43.588
saving state dict
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Hier ist ein Beispiel , wie wir 

Epoch [33/35] complete. Train Loss: 9.994. Val Loss: 31.756
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Hier ist ein Beispiel , um prüfen wie unser Modell ist . Wir spielen . </s>
Epoch [34/35] complete. Train Loss: 9.492. Val Loss: 31.005
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Hier ist ein Beispiel , um prüfen zu überprüfen , wie unser Modell ist . Wir spielen . </s>
Epoch [35/35] complete. Train Loss: 9.014. Val Loss: 32.097
Original Sentence: This is an example to check how our model is performing.
Translated Sentence:  Hier ist ein Beispiel , um prüfen wie unser Modell ist . Wir spielen . </s>


In [17]:
import pandas as pd
import plotly.express as px
losses = pd.DataFrame({'train_loss':train_losses,'val_loss':valid_losses})

In [18]:
losses.head()

Unnamed: 0,train_loss,val_loss
0,86.092171,64.513807
1,59.769207,55.631315
2,53.123179,49.684637
3,47.82473,43.588158
4,43.37032,39.692401


In [21]:
px.line(losses,y = ['train_loss','val_loss'])

We can load the best model using the saved checkpoint.

In [20]:
# Just load model for inference
model.load_state_dict(torch.load(f"checkpoint_best_epoch.pt"))

<All keys matched successfully>

Let us see the output on a single sentence

In [29]:
sentence = "Isn't Natural language processing just awesome? Please do let me know in the comments."
print(greeedy_decode_sentence(model,sentence))

 Ist es nicht einfach toll ? Bitte lassen Sie mich gerne in den Kommentare kennen . </s>
