<a href="https://colab.research.google.com/github/stefanocostantini/nlp/blob/main/seq2seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Seq2Seq model

- This notebook contains an implementation of the Seq2Seq architecture. 
- We will use it to set up a NMT model, translating from Italian to English.
- The dataset used for model training is a collection of transcripts of the European Parliament sessions since 2000. The data (for multiple data pairs) is available here: http://www.statmt.org/europarl/
- For this model, we will use a shorter version of the Italian-English dataset containing 50,000 examples. This has been pre-processed and uploaded to Drive. 

### Installs & imports

In [None]:
import spacy
import random
import torch
import torchtext
from torchtext.data import Field, BucketIterator
from torchtext.datasets import TranslationDataset
from torch import nn
from torch import optim # for the optimizers

In [None]:
# Download spacy components for languages of interest
!python -m spacy download it
!python -m spacy download en


Collecting it_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-2.2.5/it_core_news_sm-2.2.5.tar.gz (14.5MB)
[K     |████████████████████████████████| 14.5MB 24.5MB/s 
Building wheels for collected packages: it-core-news-sm
  Building wheel for it-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for it-core-news-sm: filename=it_core_news_sm-2.2.5-cp36-none-any.whl size=14471131 sha256=51104bae178c5e8d24a791d1490b94da01488e6b770a80aa338a76a5ebb14141
  Stored in directory: /tmp/pip-ephem-wheel-cache-h4ida3a_/wheels/a1/01/c2/127ab92cc5e3c7f36b5cd4bff28d1c29c313962a2ba913e720
Successfully built it-core-news-sm
Installing collected packages: it-core-news-sm
Successfully installed it-core-news-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('it_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/it_core_news_sm -->
/usr/local

In [None]:
# Initialise seed
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7f52b44aaee8>

In [None]:
# We also make sure that we can use the GPU if it is available:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


### Get data onto colab's filesystem

In [None]:
# First mount drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# Then we create the folder for our data in the local filesystem
!mkdir /content/translation/

In [None]:
# And finally we copy the files across
# Short version
!cp /content/drive/My\ Drive/ML/translation/short/europarl-v7.it-en-SHORT-ENGLISH.txt /content/translation
!cp /content/drive/My\ Drive/ML/translation/short/europarl-v7.it-en-SHORT-ITALIAN.txt /content/translation

# Very short version
!cp /content/drive/My\ Drive/ML/translation/very_short/europarl-v7.it-en-VERY-SHORT-ENGLISH.txt /content/translation
!cp /content/drive/My\ Drive/ML/translation/very_short/europarl-v7.it-en-VERY-SHORT-ITALIAN.txt /content/translation


In [None]:
# Files will be available at these paths
# Short version
english_raw_path = "/content/translation/europarl-v7.it-en-SHORT-ENGLISH.txt"
italian_raw_path = "/content/translation/europarl-v7.it-en-SHORT-ITALIAN.txt"

# Very short version
# english_raw_path = "/content/translation/europarl-v7.it-en-VERY-SHORT-ENGLISH.txt"
# italian_raw_path = "/content/translation/europarl-v7.it-en-VERY-SHORT-ITALIAN.txt"

### Setting up the dataset

Now we can import the data and set up the dataset. For this, we will use the package `torchtext` which makes things much easier.

The steps involved are as follows:
1. set up the tokenizer functions for the specific languages we're interested in - we will build this on top of Spacy
2. define the data `Fields` (from the `torchtext` library) which will detail the operations we want to do on the text datasets
3. set up a the training, validation and test dataset which will be needed to train the model and evaluate it. We will also create the vocabularies for each language.
4. Define a `DataLoader` to generate data batches. In this case we will use `BucketIterator` from `torchtext`

#### 1. Set up the tokenizer functions

In [None]:
# Let's load the spacy components and set up the tokenizer functions
spacy_it = spacy.load('it')
spacy_en = spacy.load('en')

# NOTE: we're going to translate from IT to EN. A way to improve the model performance is to
# reverse the sequence of the source sentence so that the final RNN hidden state of the encoder
# is more directly affected by the first word of the source sentence, which is likely to have an 
# stronger relationship to the first word(s) of the target sentence 
def tokenize_it(text): 
  return [token.text for token in spacy_it.tokenizer(text)][::-1]

def tokenize_en(text):
  return [token.text for token in spacy_en.tokenizer(text)]

# (note: torchtext has a tokenizer, but at the moment it would appear it doesn't support Italian - though need to doublecheck)

#### 2. Set up the data `Fields`

We note set up the text data `Fields` (https://pytorch.org/text/data.html#field), which is a way to define a datatype as well as operations that need to be apply to it (e.g. tokenisation, adding tokens, lower case, etc.). We want to do the same processing to both source and target text, but as the tokenizer function is language specific, we need to define two separate `Fields`, which we call `source` (for Italian) and `target` (for English)

In [None]:
source = Field(tokenize=tokenize_it, init_token='<sos>', eos_token='<eos>', lower=True)
target = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

This is all we need for now, we can now move to set up the `Dataset`, loading in the data imported above.

#### 3. Set up dataset

`torchtext` has the `TranslationDataset` class (https://pytorch.org/text/datasets.html#torchtext.datasets.TranslationDataset) which can be used to build a dataset specifically for this task.

In [None]:
# Short version
data = TranslationDataset("/content/translation/",
                         ("europarl-v7.it-en-SHORT-ITALIAN.txt", "europarl-v7.it-en-SHORT-ENGLISH.txt"),
                         (source, target))

# Very short version
# data = TranslationDataset("/content/translation/",
#                          ("europarl-v7.it-en-VERY-SHORT-ITALIAN.txt", "europarl-v7.it-en-VERY-SHORT-ENGLISH.txt"),
#                          (source, target))

We can now split the dataset into training, validation and test. To do so we use `torchtext.data.Dataset` split method

In [None]:
# We can now use the `split` method to generate training, validation and test data 
train_data, val_data, test_data = data.split(split_ratio=[0.7, 0.15, 0.15], strata_field='source')

In [None]:
# We can extract individual examples from the dataset, for either language using the
# .srt or .trg methods. This shows that the datasets are aligned correctly
print(train_data[222].src[::-1])
print(train_data[222].trg)

['i', 'nostri', 'stessi', 'dirigenti', 'sono', 'degli', 'irresponsabili', '.']
['our', 'own', 'political', 'leaders', 'are', 'acting', 'irresponsibly', '.']


In [None]:
# These are the dimensions of the datasets
print(f'Size of train dataset: {len(train_data)}')
print(f'Size of validation dataset: {len(val_data)}')
print(f'Size of test dataset: {len(test_data)}')

Size of train dataset: 34825
Size of validation dataset: 7463
Size of test dataset: 7462


Now that we have the data, the last thing to do is build a vocabulary for both the source and the target datasets. We build it using the training dataset to avoid any data leakage. We also set a minimum frequence of 2, not include those words that appear only once. These will be treated as `<unk>` going forward.

In [None]:
source.build_vocab(train_data, min_freq=2)
target.build_vocab(train_data, min_freq=2)
print(f'The source vocabulary contains {len(source.vocab)} unique words')
print(f'The target vocabulary contains {len(target.vocab)} unique words')

The source vocabulary contains 17436 unique words
The target vocabulary contains 11986 unique words


#### 4. Set up the data loader (`BucketIterator` in this case)

For text we use the `BucketIterator` which does the following:
- generates batches with the `.src` and `.trg` properties
- numericalises sequences (i.e. replaces tokens with indices - using the specific vocabulary for each of the `Fields`)
- automatically pads sentences in a batch so that they are all of the same length of the longest sequence


In [None]:
BATCH_SIZE = 100

train_iter = BucketIterator(train_data, batch_size=BATCH_SIZE, device=device)
val_iter = BucketIterator(val_data, batch_size=BATCH_SIZE, device=device)
test_iter = BucketIterator(test_data, batch_size=BATCH_SIZE, device=device)

In [None]:
# We can see that the length of the sentences in each batch is automatically set to
# that of the longest sentence
batch = next(iter(train_iter))
it_example = batch.src
en_example = batch.trg

In [None]:
# The iterator makes sure that all sequences in the batch are as long as the longest sentence in it.
print(it_example.shape, en_example.shape)

torch.Size([95, 100]) torch.Size([102, 100])


In [None]:
# We can check what word the various tokens correspond to (and viceversa which token words correspond to)
# For example, all sentences start with token '2', which as expected is '<sos>'
print(source.vocab.itos[2])
# Then sentences are finished with token '3', which is '<eos>'
print(source.vocab.itos[3])
# After that point, the sentence is padded with token '1', which is '<pad>' to make it of the same length as the longest sentence
print(source.vocab.itos[1])

<sos>
<eos>
<pad>


In [None]:
# We can use the lookup to see which word corresponds to which index:
print(source.vocab.itos[1234])
print(target.vocab.itos[5678])
# We can also lookup the number corresponding to each word
print(source.vocab.stoi["parlamento"])
print(target.vocab.stoi["parliament"])

statuto
biotechnology
43
42


In [None]:
# We can also reconstruct the words of a specific sequence. Let's take the first sentence
# of the batch for both source (italian) and target (english). Note that the source's order is inverted as required.
# Italian
print([source.vocab.itos[it_example[i,1].item()] for i in range(0, it_example.shape[0])])
# English
print([target.vocab.itos[en_example[i,1].item()] for i in range(0, en_example.shape[0])])

['<sos>', '.', 'unito', 'regno', 'del', 'favore', 'a', 'comunitario', 'bilancio', 'al', 'finanziamento', 'del', 'correzione', 'di', 'meccanismo', 'il', 'contro', ',', 'casi', 'alcuni', 'in', ',', 'espresso', 'appena', 'hanno', 'che', 'voto', 'il', 'per', 'britannici', 'socialisti', 'colleghi', 'miei', 'dei', 'imbarazzo', "'", 'l', 'sopportare', 'potrei', 'non', 'ma', ',', 'così', 'è', 'non', 'che', 'certo', 'sono', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<sos>', 'i', 'am', 'sure', 'you', 'are', 'not', ',', 'but', 'i', 'would', 'hate', 'my', 'british', 'socialist', 'colleagues', 'to', 'be', 'embarrassed', 'by

### 5. Setting up the Seq2Seq model

We can now set up the Seq2Seq model. This is made up by three components:

1. the Encoder (an LSTM model)
2. the Decoder (an LSTM model, which initialises its hidden state using the last hidden state of the Encoder, i.e. in the case of an encoder LSTM it would be its cell and its hidden states. Note also that it could be another type of model like a GRU)
3. the Seq2Seq model itself that bring the two models together 

#### Encoder

The encoder is a standard RNN, in this case an LSTM. For simplicity we just use one LSTM layer. We're not interested in the output of the LSTM in this case, just its hidden and cell states. So we discard the model's output at each iteration (and we don't need a final fully connected layer)

In [None]:
# Encoder
class Encoder(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, num_lstm_layers):
    super(Encoder, self).__init__()

    self.vocab_size = vocab_size           # sometimes also labelled 'input size' as this is the dimentionality of the input tokens
    self.embedding_dim = embedding_dim     # the size of the embeddings
    self.hidden_dim = hidden_dim           # the size of the hidden layer
    self.num_lstm_layers = num_lstm_layers # the number of layers used

    self.embeddings = nn.Embedding(num_embeddings=self.vocab_size,   # This layer requires 2 inputs: the number of possible embeddings
                                   embedding_dim=self.embedding_dim) # (i.e. the size of the vocabulary) and the embedding dimension
    
    self.lstm = nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_dim, # The input size needs to be the dimension of the
                        num_layers=self.num_lstm_layers)                            # embeddings, while we can set any value for the
                                                                                    # hidden layer. 

  def forward(self, source_sequence):
    embedded = self.embeddings(source_sequence)        # The output dimensions will be  sequence_length x batch_size x embedding_dim                           

    _, (hidden, cell) = self.lstm(embedded)          # We do not care about the output in this case, just the hidden and cell states
                                                     # These will have dimensions num_layers x batch_size x hidden_dim
    return hidden, cell

In [None]:
# We can now test it and print out the dimensions of the hidden and cell states.
# These are [num_layers x batch_size x hidden_dim]
encoder = Encoder(vocab_size=len(source.vocab), embedding_dim=128, hidden_dim=256, num_lstm_layers=1).to(device)
hidden_enc, cell_enc = encoder(it_example)
print(hidden_enc.shape, cell_enc.shape)

torch.Size([1, 100, 256]) torch.Size([1, 100, 256])


#### Decoder

In this case, the first inputs to the decoder will be:

- The first token of the target sequence (as the model, given this token, will learn to predict the next one - in our case it will be a `<sos>` token)
- The final cell and hidden states of the encoder module

The model will need to produce an output (prediction, so we will need to introduce a fully connected layer mapping from the hidden state to vocabulary i.e. `hidden_dim --> vocab_size`

In [None]:
class Decoder(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, num_lstm_layers):
    super(Decoder, self).__init__()

    self.vocab_size = vocab_size           # sometimes also labelled 'input size' as this is the dimentionality of the input tokens
    self.embedding_dim = embedding_dim     # the size of the embeddings
    self.hidden_dim = hidden_dim           # the size of the hidden layer
    self.num_lstm_layers = num_lstm_layers # the number of layers used
    
    
    self.embeddings = nn.Embedding(num_embeddings=self.vocab_size,   # This layer requires 2 inputs: the number of possible embeddings
                                   embedding_dim=self.embedding_dim) # (i.e. the size of the vocabulary) and the embedding dimension
    
    self.lstm = nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_dim, # The input size needs to be the dimension of the
                        num_layers=self.num_lstm_layers)                            # embeddings, while we can set any value for the
                                                                                    # hidden layer. 
    
    self.fc = nn.Linear(in_features=self.hidden_dim, out_features=self.vocab_size)  # In this case at each iteration we make a prediction
                                                                                    # so we need to map the hidden state onto the vocab size

  def forward(self, hidden, cell, target_sequence):
    embedded = self.embeddings(target_sequence.unsqueeze(0))      # Need to add dimension of size one. Embeddings will have size embedding_dim                                      # max_length x batch_size x embedding size
    out, (hidden, cell) = self.lstm(embedded, (hidden, cell))     # Hidden states will have dimensions num_layers x batch_size x hidden_dim
                                                                  # Output tensor will have dimensions max_length X batch_size x hidden_dim
    
    prediction = self.fc(out.squeeze(0))                          # Predictions will have dimensions batch_size x vocab_size

    return prediction, hidden, cell                                                                 

In [None]:
# We can now test it and print out the dimensions of the output, hidden and cell states.
# These will be
decoder = Decoder(vocab_size=len(target.vocab), embedding_dim=128, hidden_dim=256, num_lstm_layers=1).to(device)
prediction_dec, hidden_dec, cell_dec = decoder(hidden_enc, cell_enc, en_example[0]) # passing only one token from target sentence
print(prediction_dec.shape, hidden_enc.shape, cell_enc.shape)

torch.Size([100, 11986]) torch.Size([1, 100, 256]) torch.Size([1, 100, 256])


#### Full model

We can now join the two components together in the Seq2Seq model. 
- There is no need to define any further layers as the `Encoder` and `Decoder` classes already provide everything we need. 
- The only check to make is to make sure that both the hidden dimension and the number of layers of `Encoder` and `Decoder` are the same.

After that, the model will work as follows:

- The forward method takes the source and target batches of sequences as inputs
- First, the source batch is passed through the `Encoder` to obtain the hidden state which will be used to initialise the decoder
- Then, the target sequence is passed, **token by token** through the decoder. Note that this is the standard approach. However, a technique to make the model converge faster is **teacher forcing** where instead of inputing the next token in the sequence, the model is fed its prediction from the previous step. In this case, we add a parameters that will use teacher forcing with on a fixed % of iterations.

Here's an example

![](https://miro.medium.com/max/700/1*KtWwvLK-jpGPSnj3tStg-Q.png)



In [None]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder=Encoder, decoder=Decoder, device=device):
    super(Seq2Seq, self).__init__()
    self.encoder = encoder
    self.decoder = decoder
    self.device = device

    assert(encoder.hidden_dim==decoder.hidden_dim)
    assert(encoder.num_lstm_layers==decoder.num_lstm_layers)

  def forward(self, source_sequence, target_sequence, teacher_forcing_ratio=0.5):
    # Note that this is done in batches, so one iteration of the code below
    # will go through as many sequences as there are in batch

    # Initialise empty tensor for predictions
    # dimensions will be max_length x batch_size x vocab_size
    max_length, batch_size = target_sequence.shape 
    vocab_size = self.decoder.vocab_size # need to use the vocab size of the target corpus as the
                                         # predictions will be made from this dictionary
   
    predictions = torch.zeros(max_length, batch_size, vocab_size).to(device)

    # Get source sequence through encoder
    hidden, cell = self.encoder(source_sequence)

    # Get target sequence through decoder, token by token
    trg = target_sequence[0] # start from the first token
    for i in range(1, max_length): # then loop through all the other tokens
      output, hidden, cell = self.decoder(hidden, cell, trg)
      predictions[i] = output 

      if random.random() < teacher_forcing_ratio:
        trg = target_sequence[i] # in this case we feed in the next actual token in the target sequence
      else:
        trg = output.argmax(1) # in this case we feed in the predicted token
      
    return predictions

In [None]:
# Let's now check the output of the whole network
encoder = Encoder(vocab_size=len(source.vocab), embedding_dim=128, hidden_dim=256, num_lstm_layers=1).to(device)
decoder = Decoder(vocab_size=len(target.vocab), embedding_dim=128, hidden_dim=256, num_lstm_layers=1).to(device)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)
seq2seq

Seq2Seq(
  (encoder): Encoder(
    (embeddings): Embedding(17436, 128)
    (lstm): LSTM(128, 256)
  )
  (decoder): Decoder(
    (embeddings): Embedding(11986, 128)
    (lstm): LSTM(128, 256)
    (fc): Linear(in_features=256, out_features=11986, bias=True)
  )
)

In [None]:
predictions = seq2seq(it_example, en_example)
predictions.shape

torch.Size([102, 100, 11986])

As expected, the predictions have dimensions `max_length` x `batch_size` x `target.vocab_size`. Now, by picking, for each token, the probability with the highest weight, we can extract a seriens of tokens that can be compared with the ground truth (the output sequence batch).

In [None]:
# For simplicity, let's just focus on the first sequence of the predicted batch and compare it
# with the first sequence of the target batch
first_source = it_example[:,1]
first_predicted = predictions[:,1].argmax(1)
first_target = en_example[:,1]

In [None]:
# And we can print out the actual sentences as follows
print((" ").join([source.vocab.itos[first_source[i].item()] for i in range(0, first_source.shape[0])][::-1]))
print((" ").join([target.vocab.itos[first_target[i].item()] for i in range(0, first_target.shape[0])]))
print((" ").join([target.vocab.itos[first_predicted[i].item()] for i in range(0, first_predicted.shape[0])]))

<pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <eos> sono certo che non è così , ma non potrei sopportare l ' imbarazzo dei miei colleghi socialisti britannici per il voto che hanno appena espresso , in alcuni casi , contro il meccanismo di correzione del finanziamento al bilancio comunitario a favore del regno unito . <sos>
<sos> i am sure you are not , but i would hate my british socialist colleagues to be embarrassed by the way they have just voted , in some cases against the uk budget rebate . <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad

Of course the "predicted" text makes no sense as the model has not been trained yet. And note also that the model has not yet learnt when to stop the sentence.

### 6. Model training

At this stage, we just need to train the model (and evaluate it on the validation set). What we need is:

0. an initialised model (we already have it from above)
1. define an optimiser
2. define a loss criterion
3. a model training function
4. a model evaluation function

Let's define these components, and then combine them together in a training loop.

#### Training components

In [None]:
# 1. Optimizer
optimizer = optim.Adam(seq2seq.parameters(), lr=0.75)

In [None]:
# 2. Loss criterion
# we use CrossEntropyLoss - we should also exclude from the loss calculation the <PAD> tokens
index_for_padding = target.vocab.stoi['<pad>']
loss_criterion = nn.CrossEntropyLoss(ignore_index=index_for_padding)

In [None]:
# 3. Model training function
# we define a function to train the model through an epoch of the training set, batch by batch
# the function will take the model, the optimiser, the loss_criterion and the training data iterator as inputs
def train_model(seq2seq_model, optimizer, loss_criterion, training_data_iterator):

  seq2seq_model.train() # we place the model in training mode
  training_loss = 0 # reset the loss for this epoch

  for batch in training_data_iterator: # now we can go through all the batches in the dataset
    optimizer.zero_grad() # we reset the gradient
    output = seq2seq_model(batch.src, batch.trg) # the forward method in the seq2seq model takes the source and target
                                                 # sequences as inputs. We accept the 0.5 teacher-forcing ratio in this case
    # now we need to calculate the loss
    # first, we exclude the first element from both output and target
    # second, we need to flatten output and target as the loss function needs 2d output and 1d target
    
    output_flattened = output[1:].view(-1, output.shape[-1]) # this becomes 2d, i.e. (max_length x batch_size) x vocab_size
    target_flattened = batch.trg[1:].view(-1) # and the target becomes 1d, i.e. (max_length x batch_size)

    loss = loss_criterion(output_flattened, target_flattened)

    loss.backward() # backprop
    optimizer.step() # update model weights

    training_loss += loss.item() # add to the epoch loss

  return training_loss / len(training_data_iterator) # normalising by the number of batches, so we get the average batch loss

In [None]:
# 4. Model training function
# we define a function to evaluate the model through an epoch of the validation set, batch by batch
# the function will take the model, the loss_criterion and the validation data iterator as inputs
def evaluate_model(seq2seq_model, loss_criterion, validation_data_iterator):

  seq2seq_model.eval() # we place the model in evaluation mode
  val_loss = 0 # reset the loss for this epoch

  with torch.no_grad():
    for batch in validation_data_iterator: # now we can go through all the batches in the dataset
      output = seq2seq_model(batch.src, batch.trg, teacher_forcing_ratio=0) # in this case we need to turn off teacher forcing as we need
                                                                            # to evaluate on the predictions
      # now we need to calculate the loss
      # first, we exclude the first element from both output and target
      # second, we need to flatten output and target as the loss function needs 2d output and 1d target
    
      output_flattened = output[1:].view(-1, output.shape[-1]) # this becomes 2d, i.e. (max_length x batch_size) x vocab_size
      target_flattened = batch.trg[1:].view(-1) # and the target becomes 1d, i.e. (max_length x batch_size)

      loss = loss_criterion(output_flattened, target_flattened)

      val_loss += loss.item() # add to the epoch loss

  return val_loss / len(validation_data_iterator) # normalising by the number of batches, so we get the average batch loss

In [None]:
# we also define a helper function to measure training time
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

#### Training loop

In [None]:
# First let's re-initialise batch size and the data iterators
BATCH_SIZE = 100

train_iter = BucketIterator(train_data, batch_size=BATCH_SIZE, device=device)
val_iter = BucketIterator(val_data, batch_size=BATCH_SIZE, device=device)
test_iter = BucketIterator(test_data, batch_size=BATCH_SIZE, device=device)

In [None]:
# Then let's re-initialise the model
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_LSTM_LAYERS = 1

encoder = Encoder(vocab_size=len(source.vocab), embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM, num_lstm_layers=NUM_LSTM_LAYERS).to(device)
decoder = Decoder(vocab_size=len(target.vocab), embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM, num_lstm_layers=NUM_LSTM_LAYERS).to(device)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)

In [None]:
# And finally we can write the training loop
EPOCHS = 10
best_val_loss = float('inf')

for epoch in range(EPOCHS):
  start_time = time.time()
  train_loss = train_model(seq2seq_model=seq2seq, optimizer=optimizer, loss_criterion=loss_criterion, training_data_iterator=train_iter)
  val_loss = evaluate_model(seq2seq_model=seq2seq, loss_criterion=loss_criterion, validation_data_iterator=val_iter)
  end_time = time.time()

  elapsed_mins, elapsed_secs = epoch_time(start_time, end_time)

  # we can use the validation loss to save the best version of the model
  if val_loss < best_val_loss:
    best_val_loss = val_loss # update the best validation loss so far
    torch.save(seq2seq.state_dict(), 'best_seq2seq.pt') # save the model locally
  
  # finally we print some stats to monitor the training process
  print(f"Epoch: {epoch+1:02} | Time: {elapsed_mins}m {elapsed_secs}s")
  print(f"Training loss: {train_loss:.3f} | Val loss: {val_loss:.3f}")

KeyboardInterrupt: ignored

### Model evaluation

We can now load the best trained model, and make predictions on the test set (or also using our own inputs)

In [None]:
# Load the model first
encoder = Encoder(vocab_size=len(source.vocab), embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM, num_lstm_layers=NUM_LSTM_LAYERS).to(device)
decoder = Decoder(vocab_size=len(target.vocab), embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM, num_lstm_layers=NUM_LSTM_LAYERS).to(device)
loaded_model = seq2seq = Seq2Seq(encoder, decoder, device).to(device)
loaded_model = loaded_model.to(device=device)
loaded_model.load_state_dict(torch.load('best_seq2seq.pt', map_location=device))

<All keys matched successfully>

In [None]:
# Let's now try to predict on the test set and compare it with the target
test_batch = next(iter(test_iter))
it_test_example = test_batch.src
en_test_example = test_batch.trg

In [None]:
# Italian
print([source.vocab.itos[it_test_example[i,1].item()] for i in range(0, it_test_example.shape[0])])
# English
print([target.vocab.itos[en_test_example[i,1].item()] for i in range(0, en_test_example.shape[0])])

['<sos>', '.', 'kyoto', 'di', 'obiettivi', 'gli', 'raggiungere', 'a', 'mai', 'riusciremo', 'non', 'altrimenti', ',', 'economici', 'più', 'camion', 'sui', 'direttiva', 'una', 'presentarci', 'di', 'commissione', 'alla', 'chiedo', 'perché', 'ecco', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<sos>', 'this', 'is', 'why', 'i', 'would', 'ask', 'the', 'committee', 'to', 'submit', 'a', 'directive', 'for', 'more', 'economical', 'trucks', '.', 'otherwise', ',', 'we', 'will', 'never', 'meet', 'the', 'kyoto', 'objectives', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'

In [None]:
test_predictions = loaded_model(it_test_example, en_test_example, teacher_forcing_ratio=0)

In [None]:
test_source_1 = it_test_example[:,1]
test_predicted_1 = test_predictions[:,1].argmax(1)
test_target_1 = en_test_example[:,1]

print((" ").join([source.vocab.itos[test_source_1[i].item()] for i in range(0, test_source_1.shape[0])][::-1]))
print((" ").join([target.vocab.itos[test_target_1[i].item()] for i in range(0, test_target_1.shape[0])]))
print((" ").join([target.vocab.itos[test_predicted_1[i].item()] for i in range(0, test_predicted_1.shape[0])]))

<pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <eos> ecco perché chiedo alla commissione di presentarci una direttiva sui camion più economici , altrimenti non riusciremo mai a raggiungere gli obiettivi di kyoto . <sos>
<sos> this is why i would ask the committee to submit a directive for more economical trucks . otherwise , we will never meet the kyoto objectives . <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<unk> advocating enrichment pushed osce po corresponding embracing wholesale resources markedly dislike fort check

In [None]:
# Write a function that makes a translation given any string in italian