# Neural Machine Translation

This week and the next, we will build a neural machine translation model based on the sequence-to-sequence (seq2seq) models proposed by Sutskever et al., 2014 and Cho et al., 2014. The seq2seq model is widely used in Machine Translation systems such as Google’s neural machine translation system (GNMT) (Wu et al., 2016).

In today’s lab and the one next week, we will explore the seq2seq model, as well as attention in machine translation.

For training and evaluating our mode, we will use the English-Vietnamese parallel corpus of TED talks provided by the IWSLT Evaluation Campaign. For our tasks, we will translate from Vietnamese into English.

The parallel corpus has been provided for you:
1. **data.30.vi** - a file where each line contains a Vietnamese sentence to be translated (i.e. the source sentences)
2. **data.30.en** - a file where each line contains an English sentence corresponding to the Vietnamese sentence in the same line position. (i.e. the target sentences)


In [1]:
!wget 'https://github.com/juntaoy/ECS7001_LAB_DATASETS/raw/refs/heads/main/NMT_data.zip'
!unzip NMT_data.zip -x __MACOSX/*

--2025-03-16 19:06:31--  https://github.com/juntaoy/ECS7001_LAB_DATASETS/raw/refs/heads/main/NMT_data.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/juntaoy/ECS7001_LAB_DATASETS/refs/heads/main/NMT_data.zip [following]
--2025-03-16 19:06:31--  https://raw.githubusercontent.com/juntaoy/ECS7001_LAB_DATASETS/refs/heads/main/NMT_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4815130 (4.6M) [application/zip]
Saving to: ‘NMT_data.zip.8’


2025-03-16 19:06:31 (141 MB/s) - ‘NMT_data.zip.8’ saved [4815130/4815130]

Archive:  NMT_data.zip
replace data.30.en? [y]es, [n]o, [A]ll, [N]one, [r]e

Let's first install the `Sacrebleu` (https://github.com/mjpost/sacrebleu) package for BLEU computation.

In [2]:
!pip install sacrebleu



## Overview
This script defines a total of three classes: the main class (`NmtModel`), the attention layer class (`AttentionLayer`) and a helper class (`LanguageDict`). The `NmtModel` class contains most of the code of the NMT system and is the one you are asked to complete for Task 1 and 2. The `AttentionLayer` class is a custom layer to implement the attention mechanism, Task 3 is to finish this class. `LanguageDict` is a class that stores resources related to languages, such as vocab, word2ids, etc. The code for this class is provided.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import collections
import numpy as np
import time
from sacrebleu import corpus_bleu

SOURCE_PATH = 'data.30.vi'
TARGET_PATH = 'data.30.en'

## The `LanguageDict` class stores the language resources
This class has only an initialisation method. The method takes a corpus as the input and builds the vocab and word2ids for the language.

In [4]:
class LanguageDict():
  def __init__(self, sents):
    word_counter = collections.Counter(tok.lower() for sent in sents for tok in sent)

    self.vocab = []
    self.vocab.append('<pad>') #zero paddings
    self.vocab.append('<unk>')
    # add only words that appear at least 10 times in the corpus
    self.vocab.extend([t for t,c in word_counter.items() if c > 10])

    self.word2ids = {w:id for id, w in enumerate(self.vocab)}
    self.UNK = self.word2ids['<unk>']
    self.PAD = self.word2ids['<pad>']

## The `load_dataset()` method creates train/dev/test batches
The method reads the given file and loads the first max_num_examples sentences and split them into train/dev/test dataset:

In [5]:
def pad_sequences(seq_list, max_len=None, pad_value=0):
    """
    A simple PyTorch-like pad_sequences function.
    seq_list: List of lists of token IDs.
    max_len : If None, will use the length of the longest sequence.
    pad_value: ID to use for padding.
    Returns a 2D NumPy array with shape [batch_size, max_length].
    """
    if max_len is None:
        max_len = max(len(seq) for seq in seq_list)
    padded = []
    for seq in seq_list:
        seq = seq[:max_len]
        padded.append(seq + [pad_value]*(max_len - len(seq)))
    return np.array(padded)

In [6]:
def load_dataset(source_path,target_path, max_num_examples=30000):
  ''' This helper method reads from the source and target files to load max_num_examples
  sentences split them into train, development and testing and return relevant data.
  Inputs:
    source_path (string): the full path to the source data, SOURCE_PATHf
    target_path (string): the full path to the target data, TARGET_PATH
  Returns:
    train_data (list): a list of 3 elements: source_words, target words, target word labels
    dev_data (list): a list of 2 elements - source words, target word labels
    test_data (list): a list of 2 elements - source words, target word labels
    source_dict (LanguageDict): a LanguageDict object for the source language, Vietnamese.
    target_dict (LanguageDict): a LanguageDict object for the target language, English.
  '''
  # source_lines/target lines are list of strings
  # such that each string is a sentence in the corresponding file
  source_lines = open(source_path).readlines()
  target_lines = open(target_path).readlines()
  assert len(source_lines) == len(target_lines)
  if max_num_examples > 0:
    max_num_examples = min(len(source_lines), max_num_examples)
    source_lines = source_lines[:max_num_examples]
    target_lines = target_lines[:max_num_examples]

  # strip trailing/leading whitespaces and tokenize each sentence
  source_sents = [[tok.lower() for tok in sent.strip().split(' ')] for sent in source_lines]
  target_sents = [[tok.lower() for tok in sent.strip().split(' ')] for sent in target_lines]
  # for the target sentences, add <start> and <end> tokens to each sentence
  for sent in target_sents:
    sent.append('<end>')
    sent.insert(0,'<start>')

  # create the LanguageDict objects for each file
  source_lang_dict = LanguageDict(source_sents)
  target_lang_dict = LanguageDict(target_sents)


  # for the source sentences:
  # we'll use this proportion to split into train/dev/test
  unit = len(source_sents)//10
  # get the sents-as-ids for each sentence
  source_words = [[source_lang_dict.word2ids.get(tok,source_lang_dict.UNK) for tok in sent] for sent in source_sents]
  # 8 parts (80%) of the sentences go to the training data and are padded up to the maximum sentence length
  source_words_train = pad_sequences(source_words[:8*unit])
  # 1 part (10%) of the sentences go to the dev data and are padded up to the up to the maximum sentence length
  source_words_dev = pad_sequences(source_words[8*unit:9*unit])
  # 1 part (10%) of the sentences go to the test dataand are padded up to the up to the maximum sentence length
  source_words_test = pad_sequences(source_words[9*unit:])


  eos = target_lang_dict.word2ids['<end>']
  # for each sentence, get the word index for the tokens from <start> to up to but not including <end>,
  target_words = [[target_lang_dict.word2ids.get(tok,target_lang_dict.UNK) for tok in sent[:-1]] for sent in target_sents]
  # select the training set and pad the sentences
  target_words_train = pad_sequences(target_words[:8*unit])
  # the label for each target word is the next word, we also add <end> as the last token
  target_words_train_labels = [sent[1:]+[eos] for sent in target_words[:8*unit]]
  # pad the labels. Dim = [num_sents, max_sent_length]
  target_words_train_labels = pad_sequences(target_words_train_labels)
  # expand one dimension at the end for the loss computation. Dim = [num_sents, max_sent_length, 1].
  target_words_train_labels = np.expand_dims(target_words_train_labels,axis=2)

  # get the labels for the dev and test data. No need for inputs here and no need to expand dimensions
  target_words_dev_labels = pad_sequences([sent[1:] + [eos] for sent in target_words[8 * unit:9 * unit]])
  target_words_test_labels = pad_sequences([sent[1:] + [eos] for sent in target_words[9 * unit:]])

  # our final data
  train_data = [source_words_train,target_words_train,target_words_train_labels]
  dev_data = [source_words_dev,target_words_dev_labels]
  test_data = [source_words_test,target_words_test_labels]

  return train_data,dev_data,test_data,source_lang_dict,target_lang_dict

## The `AttentionLayer` class creates a custom layer for attention

The class takes two inputs: the `encoder_outputs` and the `decoder_outputs` and returns a `new_decoder_outputs` that leverages the `decoder_outputs` with the `encoder_outputs`.

This class contains three methods. The first one is used for passing the mask to the next layer. The mask is originally created by the `Embedding` layer with the `mask_zero` attribute set to `True`, so that the padding is not taken into account in the computations of loss or by LSTM layers. So, in this first method we return the mask for the `decoder_outputs`. The second method computes the output shape of our layer. The output shape of the layer is the same to the `decoder_outputs` in the first two dimensions and for the last dimension the embedding dimension is doubled.

The third method is the main method for the layer, and also the one you will need to implement for your Task 3. We will come back to this later.


In [7]:
class AttentionLayer(nn.Module):
    """
    Custom layer implementing Luong attention.
    """
    def __init__(self):
        super(AttentionLayer, self).__init__()

    def forward(self, encoder_outputs, decoder_outputs):
        """
        encoder_outputs : [batch_size, max_source_length, hidden_size]
        decoder_outputs : [batch_size, max_target_length, hidden_size]
        """

        if encoder_outputs is None or decoder_outputs is None:
            raise ValueError("encoder_outputs or decoder_outputs is None.")

        batch_size, max_source_len, hidden_size = encoder_outputs.shape
        _, max_target_len, _ = decoder_outputs.shape

        #transposing decoder outputs to match encoder shape
        decoder_outputs_t = decoder_outputs.permute(0, 2, 1)  # [batch_size, hidden_size, max_target_length]

        #computing attention scores with dot product
        luong_score = torch.bmm(encoder_outputs, decoder_outputs_t)  # [batch_size, max_source_length, max_target_length]

        #applying a softmax to obtain attention weights
        attention_weights = F.softmax(luong_score, dim=1)  # Normalize over source sequence

        #computing the context vector as a weighted sum of encoder outputs
        attention_weights = attention_weights.permute(0, 2, 1).unsqueeze(-1)  # [batch, max_target_length, max_source_length, 1]
        encoder_outputs_exp = encoder_outputs.unsqueeze(1)  # [batch, 1, max_source_length, hidden_size]

        encoder_vector = torch.sum(attention_weights * encoder_outputs_exp, dim=2)  # [batch, max_target_length, hidden_size]

        #ensuring decoder_outputs and encoder_vector have the same length
        min_len = min(decoder_outputs.shape[1], encoder_vector.shape[1])
        decoder_outputs = decoder_outputs[:, :min_len, :]
        encoder_vector = encoder_vector[:, :min_len, :]

        #concating context vector with decoder outputs
        new_decoder_outputs = torch.cat([decoder_outputs, encoder_vector], dim=-1)

        print("attention decoder outputs shape:", new_decoder_outputs.shape)
        return new_decoder_outputs


## NmtModel class `__init__()` method: initialises the network parameters.

This method takes three arguments. The first two are instances of `LanguageDict`, one for the source language (Vietnamese) and one for the target language (English); the third argument is a boolean variable (`use_attention`) that indicates which model (attention/basic) should be used.

It then creates all the layers will be used in later stages.



## Task 1: Implement the Embedding Layers and the Encoder
In this task, you will work at the beginning of the `__init__()` and `forward()` method. You will need to first create two `nn.Embedding` layers (one for the source language and one for the target language). Then pass the source embedding into an `nn.LSTM` layer.

Let’s first look at the inputs. You have in total two inputs:
- `source_words`: the word indices of the sentences in the source language. This input has the shape `[batch_size, max_source_sent_len]` during both training and inference.
- `target_words`: the word indices of the sentences in the target language. During training, this input will have the shape `[batch_size, max_target_sent_len]`, but during the inference, it will have the shape `[batch_size, 1]`.

You will need to first create two `nn.Embedding` layers `embedding_source` and `embedding_target`. The Embedding layers will randomly initialise the embeddings for individual words in the vocabulary and the embeddings will be trained together with the network.  The `nn.Embedding` layers have an `input_dim` of the `vocab_size` and an `output_dim` of the `embedding_size`.  Please note the `vocab_size` for the source and the target language are different. Also, you will need to set the `padding_idx` in order to ignore the paddings.
  
Secondly, you need to look up the embeddings for the current inputs (`source_words` and `target_words`) by passing them through the `nn.Embedding` layers you created. The embeddings for source and target words need to be called `source_words_embeddings` and `target_words_embeddings` respectively.

Thirdly, you can create an `nn.LSTM` layer to process the `source_words_embeddings`, you will need to set the `bidirectional` to `False` and set the `batch_first` to `True`.

In [8]:
class NmtModel(nn.Module):
    def __init__(self, source_dict, target_dict, use_attention):
        """
        Initializes the NMT Model hyperparameters and layers.
        """
        super().__init__()
        self.source_dict = source_dict
        self.target_dict = target_dict
        self.use_attention = use_attention

        # Hyperparams
        self.hidden_size = 200
        self.embedding_size = 100
        self.hidden_dropout_rate = 0.2
        self.embedding_dropout_rate = 0.2
        self.batch_size = 100
        self.max_target_step = 30

        # Special tokens
        self.SOS = target_dict.word2ids['<start>']
        self.EOS = target_dict.word2ids['<end>']

        # Vocab sizes
        self.vocab_source_size = len(source_dict.vocab)
        self.vocab_target_size = len(target_dict.vocab)

        print(f"number of tokens in source: {self.vocab_source_size}, "
              f"number of tokens in target: {self.vocab_target_size}")


        """
        Task 1: Implementing the encoder 1/2

        Begin
        """

        # embeddings for source and target with padding_idx specified
        self.embedding_source = nn.Embedding(self.vocab_source_size, self.embedding_size, padding_idx=source_dict.PAD)
        self.embedding_target = nn.Embedding(self.vocab_target_size, self.embedding_size, padding_idx=target_dict.PAD)


        # encoder lstm layer
        self.encoder_lstm = nn.LSTM(input_size=self.embedding_size,hidden_size=self.hidden_size,num_layers=1,batch_first=True,dropout=self.hidden_dropout_rate,bidirectional=False)
        """
        End Task 1 1/2
        """

        # Decoder LSTM
        self.decoder_lstm = nn.LSTM(
            input_size=self.embedding_size,
            hidden_size=self.hidden_size,
            num_layers=1,
            batch_first=True,
            dropout=self.hidden_dropout_rate,
            bidirectional=False
        )

        # Attention (if use_attention)
        if self.use_attention:
            self.decoder_attention = AttentionLayer()

        # Final projection
        # If attention, hidden_size * 2, else hidden_size
        if self.use_attention:
            self.decoder_dense = nn.Linear(self.hidden_size*2, self.vocab_target_size)
        else:
            self.decoder_dense = nn.Linear(self.hidden_size, self.vocab_target_size)

## NmtModel `forward()`, `decode_step()` and `encode()` methods:  builds the PyTorch models for training and inference.
The method first creates the inputs for both training and inference models, which include the source/target sentence batches; The inputs specifically used for the inference models are defined later.

Task 1 will be to create embeddings for both source/target languages as well as the encoder. We will discuss this in a later section.

After that, we define the decoder used for the training. In NMT separate decoders are often used for training and inference. During training, we feed the ground truth tokens into the decoder (teacher forcing), hence we process all tokens in the sentences in a single step. During inference, the system processes one token at a time, and the token predicted at the current step will be used as the input for the next step.  More specifically, the size of `target_words` will be `[batch, max_sent_len]` during training and `[batch, 1]` during inference. The training and inference models behave slightly differently, but they share all the layers (`decoder_lstm, decoder_attention and decoder_dense`);

Task 2 will be to implement the decoder for inference. We will discuss this later.

In [9]:
class NmtModel(NmtModel):
    def forward(self, source_words, target_words):
        """
        Forward pass for training:
          1) Encode the source sentences using the encoder LSTM.
          2) Use the final encoder state to initialize the decoder's hidden state.
          3) Feed all target words into the decoder LSTM in one go (teacher forcing).
          4) (Optional) apply the attention layer between the decoder outputs and the encoder outputs.
          5) Project the decoder outputs to vocabulary logits with self.proj.
        """

        """
        Task 1: Implementing the encoder 2/2

        Begin
        """

        #embedding lookup
        source_words_embeddings = self.embedding_source(source_words)  # [batch_size, max_source_len, embedding_size]
        target_words_embeddings = self.embedding_target(target_words)  # [batch_size, max_target_len, embedding_size]



        #encoding source words
        encoder_outputs, (enc_h, enc_c) = self.encoder_lstm(source_words_embeddings)

        #teacher forcing
        decoder_outputs, _ = self.decoder_lstm(target_words_embeddings, (enc_h, enc_c))

        # if attention is used
        if self.use_attention:
            print("applying attention")
            decoder_outputs = self.decoder_attention(encoder_outputs, decoder_outputs)
            if decoder_outputs is None:
              raise ValueError("decoder_outputs is none after attention!")


        """
        End Task 1 2/2
        """

        # 5) Projection
        decoder_outputs = self.decoder_dense(decoder_outputs)  # [batch, max_tgt_len, vocab_target_size]
        return decoder_outputs

    def decode_step(self, target_words, decoder_states, encoder_outputs):
        """
        A single step of decoder inference:
          - Embedding for the current token
          - One-step LSTM forward
          - (Optional) attention over encoder outputs
          - Project to vocab
        Inputs:
          tgt_input: shape [batch_size, 1]
          decoder_states: (dec_h, dec_c) each is [1, batch_size, hidden_size]
          encoder_outputs: [batch_size, max_src_len, hidden_size]
        Returns:
          logits for the next token, and the new decoder states
        """


        """
        Task 2: Implementing the decoder and the inference loop
        In this task, you will work on the decode_step() method.

The decoder for inference is similar to the encoder for training but it only performs one step of the decoding at a time.
 Remember the decoders share all the layers, you will need to use the layers created in the decoder for training.
  In total three layers are used in both decoders. These are the decoder_lstm (the decoder nn.LSTM layer),
   decoder_dense (the decoder final layer) and the decoder_attention (the attention layer for the attention based model) layers.

First, unlike the decoder for training that uses the encoder_states (enc_h, enc_c) as the hidden_size for decoder_lstm, we need to use the decoder states from the previous step instead (dec_h, dec_c). You need to put them together in a list to create the decoder_states. If you take a look at the eval_process method you will find out that for the first step, the decoder_states passed into the model are actually the encoder_states (same as in the decoder during training), while in the subsequent steps the decoder_states become the ones the decoder_states returns in the previous step.

Secondly, you will need to pass the target_word_embeddings and decoder_states to the decoder_lstm.

Thirdly you will write an if statement for the attention model just like we did in the decoder for training.

Finally, pass the output of the nn.LSTM (for basic model) or the attention layer (for attention model) into the final linear layer of the decoder (decoder_dense) to get probabilities for the next token.

You have now a functional NMT system, why not test it out to see how well it works. Please note you need to set the use_attention to False since you haven’t implemented the attention layer yet. The system will take about a minute to finish 10 epochs of training and you will get a BLEU score of around 4.

        Begin
        """
        #embedding lookup for the current target token [batch_size, 1, embedding_size]
        target_words_embeddings = self.embedding_target(target_words)

        # LSTM step using previous decoder states
        decoder_outputs, (dec_h, dec_c) = self.decoder_lstm(target_words_embeddings, decoder_states)

        #if attention is used
        if self.use_attention:
            decoder_outputs = self.decoder_attention(encoder_outputs, decoder_outputs)
            if decoder_outputs is None:
                raise ValueError("decoder_outputs is None after attention!")

        #projecting to vocabulary space [batch, 1, vocab_target_size]
        decoder_outputs = self.decoder_dense(decoder_outputs)

        """
        End Task 2
        """

        return decoder_outputs, (dec_h, dec_c)

    def encode(self, source_words):
        """
        Encode the source sequence once for inference.
        """
        source_words_embeddings = F.dropout(self.embedding_source(source_words), p=self.embedding_dropout_rate, training=False)
        encoder_outputs, (enc_h, enc_c) = self.encoder_lstm(source_words_embeddings)
        return encoder_outputs, (enc_h, enc_c)

## NmtModel, `time_used()` method: outputs the time differences between the current time and the input time.
It is always good practice to record the time usage of an individual process, so you always know which part is most expensive to run.

In [10]:
class NmtModel(NmtModel):
  def time_used(self, start_time):
          """
          Outputs the time differences between now and start_time.
          """
          curr_time = time.time()
          used_time = curr_time - start_time
          m = int(used_time // 60)
          s = int(used_time - 60 * m)
          return f"{m} m {s} s"

## The `get_target_sentences()` method takes sentence indices and returns the string tokens.
The method is a helper for the `eval_process` method, which is used to create reference and candidate sentences for evaluation.



In [11]:
class NmtModel(NmtModel):
    def get_target_sentences(self, sents, vocab):
        """
        Convert a batch of sequences of token-IDs into strings.
        Stop at <end> or skip <start>.
        """
        str_sents = []
        num_sent, max_len = sents.shape
        for i in range(num_sent):
            str_sent = []
            for j in range(max_len):
                t = int(sents[i, j])
                if t == self.SOS:
                    continue
                if t == self.EOS:
                    break
                str_sent.append(vocab[t])
            str_sents.append(" ".join(str_sent))
        return str_sents

## NmtModel, `eval_process()` method: runs evaluation on the given dataset.
The method first translates the source sentences into the target language, and then compares them to the reference sentences. As a result, it outputs standard BLEU scores (as computed by the state-of-the-art Sacrebleu (https://github.com/mjpost/sacrebleu) implementation). Note that here we do not tokenise our outputs and references as they are already tokenised and we compare the models internally. However, to ensure comparability to other published work for the same data you need to detokenise your outputs and then use the default tokenisation with the argument `tokenize=BLEU.TOKENIZER_DEFAULT`.

*`eval()` method and exist in `nn.module`, used to convert the model into a evaluation mode. for example turning off dropout*

In [12]:
def detokenize(sentence_ids, vocab):
    """
    Convert a list of token IDs back into a readable sentence using the vocabulary in order to understand the input sentence.
    """
    words = [vocab[idx] if idx < len(vocab) else "<unk>" for idx in sentence_ids]
    return " ".join(words)

In [13]:
import html
class NmtModel(NmtModel):
    def eval_process(self, dataset):
        """
        Evaluate on a given dataset, returning a BLEU score.
        """
        self.eval()  # set model to eval mode, turning off dropout

        source_words, target_words_labels = dataset
        device = next(self.parameters()).device

        # Convert to torch
        source_words_torch = torch.LongTensor(source_words).to(device)
        target_words_labels_torch = torch.LongTensor(target_words_labels).to(device)

        # 1) Encode
        with torch.no_grad():
            encoder_outputs, (enc_h, enc_c) = self.encode(source_words_torch)

        batch_size = source_words_torch.size(0)
        # Start tokens => shape [batch_size, 1]
        step_tgt = torch.LongTensor([self.SOS]*batch_size).unsqueeze(1).to(device)
        decoder_states = (enc_h, enc_c)

        predictions = []
        # 2) decode up to max_target_step
        for _ in range(self.max_target_step):
            with torch.no_grad():
                logits, decoder_states = self.decode_step(step_tgt, decoder_states, encoder_outputs)
            # argmax over vocab
            step_tgt = torch.argmax(logits, dim=-1)  # [batch_size, 1]
            predictions.append(step_tgt.cpu().numpy())

        # Convert predictions => [batch, max_target_step]
        predictions = np.concatenate(predictions, axis=1)
        # Convert to strings
        candidates = self.get_target_sentences(predictions, self.target_dict.vocab)
        references = self.get_target_sentences(target_words_labels_torch.cpu().numpy(), self.target_dict.vocab)

        # Fix tokenization issues
        candidates = [html.unescape(sent) for sent in candidates]
        references = [html.unescape(sent) for sent in references]


        # Score with sacrebleu
        score = corpus_bleu(candidates, [references], tokenize='none').score
        print(f"Model BLEU score: {score:.2f}")


        print("\n Sample Translations:")
        for i in range(5):  # Print first 5 samples
            #print(f"input (Vietnamese): {detokenize(source_words[i], self.source_dict.vocab)}")
            print(f"predicted Translation: {candidates[i]}")
            print(f"reference Translation: {references[i]}\n")


        return score

## The `train_main` method starts the training.
Please note you will need to change the argument of the `use_attention` parameter accordingly.


In [14]:
class NmtModel(NmtModel):
    def train_model(self, train_data, dev_data, test_data, epochs=10, lr=0.01, clip_norm=5.0, device='cpu'):
        """
        Oversees the training process.
        1) For each epoch, train on the entire training dataset.
        2) Evaluate on dev data after each epoch.
        3) Finally evaluate on test data.
        """
        self.to(device)
        optimizer = optim.Adam(self.parameters(), lr=lr)
        loss_fn = nn.CrossEntropyLoss(ignore_index=self.target_dict.PAD)

        # Unpack data
        source_words_train, target_words_train, target_words_train_labels = train_data
        source_words_dev,   target_words_dev_labels = dev_data
        source_words_test,  target_words_test_labels = test_data

        # For convenience, convert all to torch on CPU first
        source_words_train_torch = torch.LongTensor(source_words_train)
        target_words_train_torch = torch.LongTensor(target_words_train)
        target_words_train_labels_torch = torch.LongTensor(target_words_train_labels.squeeze(-1))  # [batch, max_len]

        # We won't build a fancy DataLoader here; just run with entire batch or smaller mini-batches
        num_samples = source_words_train_torch.size(0)
        idx_list = np.arange(num_samples)

        start_time = time.time()

        for epoch in range(1, epochs+1):
            print(f"Starting training epoch {epoch}/{epochs}")
            epoch_time = time.time()

            # Shuffle data
            np.random.shuffle(idx_list)

            # Mini-batch training
            self.train()  # set model to train mode
            batch_size = self.batch_size
            for start_idx in range(0, num_samples, batch_size):
                end_idx = start_idx + batch_size
                excerpt = idx_list[start_idx:end_idx]

                src_batch = source_words_train_torch[excerpt].to(device)
                tgt_batch = target_words_train_torch[excerpt].to(device)
                tgt_labels_batch = target_words_train_labels_torch[excerpt].to(device)

                optimizer.zero_grad()

                logits = self.forward(src_batch, tgt_batch)  # [batch, tgt_len, vocab_size]

                # Flatten for cross entropy
                # logits: [batch*tgt_len, vocab_size]
                # labels: [batch*tgt_len]
                logits_2d = logits.view(-1, logits.size(-1))
                labels_2d = tgt_labels_batch.view(-1)

                loss = loss_fn(logits_2d, labels_2d)
                loss.backward()

                # Clip gradients
                torch.nn.utils.clip_grad_norm_(self.parameters(), clip_norm)
                optimizer.step()

            print(f"Time used for epoch {epoch}: {self.time_used(epoch_time)}")

            # Evaluate on dev
            print(f"Evaluating on dev set after epoch {epoch}/{epochs}:")
            self.eval_process([source_words_dev, target_words_dev_labels])

        # Training finished
        print("Training finished!")
        print(f"Time used for training: {self.time_used(start_time)}")

        # Evaluate on test set
        print("Evaluating on test set:")
        self.eval_process([source_words_test, target_words_test_labels])

In [15]:
def main(source_path, target_path, use_attention=True):
    max_example = 30000
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print("loading dictionaries...")
    train_data, dev_data, test_data, source_dict, target_dict = load_dataset(
        source_path, target_path, max_num_examples=max_example
    )
    print(f"read {len(train_data[0])}/{len(dev_data[0])}/{len(test_data[0])} train/dev/test batches")

    # Create model
    model = NmtModel(source_dict, target_dict, use_attention=use_attention)
    # Train
    model.train_model(train_data, dev_data, test_data, epochs=10, lr=0.01, clip_norm=5.0, device=device)

## Task 1: Implement the Embedding Layers and the Encoder
In this task, you will work at the beginning of the `__init__()` and `forward()` method. You will need to first create two `nn.Embedding` layers (one for the source language and one for the target language). Then pass the source embedding into an `nn.LSTM` layer.

Let’s first look at the inputs. You have in total two inputs:
- `source_words`: the word indices of the sentences in the source language. This input has the shape `[batch_size, max_source_sent_len]` during both training and inference.
- `target_words`: the word indices of the sentences in the target language. During training, this input will have the shape `[batch_size, max_target_sent_len]`, but during the inference, it will have the shape `[batch_size, 1]`.

You will need to first create two `nn.Embedding` layers `embedding_source` and `embedding_target`. The Embedding layers will randomly initialise the embeddings for individual words in the vocabulary and the embeddings will be trained together with the network.  The `nn.Embedding` layers have an `input_dim` of the `vocab_size` and an `output_dim` of the `embedding_size`.  Please note the `vocab_size` for the source and the target language are different. Also, you will need to set the `padding_idx` in order to ignore the paddings.
  
Secondly, you need to look up the embeddings for the current inputs (`source_words` and `target_words`) by passing them through the `nn.Embedding` layers you created. The embeddings for source and target words need to be called `source_words_embeddings` and `target_words_embeddings` respectively.

Thirdly, you can create an `nn.LSTM` layer to process the `source_words_embeddings`, you will need to set the `bidirectional` to `False` and set the `batch_first` to `True`.


## Task 2: Implement the Decoder for inference
In this task, you will work on the `decode_step()` method.

The decoder for inference is similar to the encoder for training but it only performs one step of the decoding at a time. Remember the decoders share all the layers, you will need to use the layers created in the decoder for training. In total three layers are used in both decoders. These are the `decoder_lstm` (the decoder nn.LSTM layer), `decoder_dense` (the decoder final layer) and the `decoder_attention` (the attention layer for the attention based model) layers.

First, unlike the decoder for training that uses the `encoder_states` (`enc_h`, `enc_c`) as the `hidden_size` for `decoder_lstm`, we need to use the decoder states from the previous step instead (`dec_h`, `dec_c`).  You need to put them together in a list to create the `decoder_states`. If you take a look at the `eval_process` method you will find out that for the first step, the `decoder_states` passed into the model are actually the `encoder_states` (same as in the decoder during training), while in the subsequent steps the `decoder_states` become the ones the `decoder_states` returns in the previous step.

Secondly, you will need to pass the `target_word_embeddings` and `decoder_states` to the `decoder_lstm`.

Thirdly you will write an if statement for the attention model just like we did in the decoder for training.

Finally, pass the output of the nn.LSTM (for basic model) or the attention layer (for attention model) into the final linear layer of the decoder (`decoder_dense`) to get probabilities for the next token.

You have now a functional NMT system, why not test it out to see how well it works. Please note you need to set the `use_attention` to `False` since you haven’t implemented the attention layer yet.  The system will take about a minute to finish 10 epochs of training and you will get a BLEU score of around 4.


In [16]:
main(SOURCE_PATH, TARGET_PATH, use_attention=False)

loading dictionaries...
read 24000/3000/3000 train/dev/test batches
number of tokens in source: 2034, number of tokens in target: 2506




Starting training epoch 1/10
Time used for epoch 1: 0 m 2 s
Evaluating on dev set after epoch 1/10:




Model BLEU score: 1.74

 Sample Translations:
predicted Translation: and i 'm going to <unk> the <unk> of the <unk> , and the <unk> of the <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: and i 'm going to <unk> the <unk> of the <unk> , and the <unk> of the <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: and i 'm going to <unk> the <unk> of the <unk> , and the <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and i 'm going to <unk> the <unk> of the <unk> , and the <unk> of the <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: and i 'm going to <unk> the <unk> of the <unk> , and the <unk> of the <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 2/10




Model BLEU score: 1.15

 Sample Translations:
predicted Translation: so i 'm <unk> , and i 'm a <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: and i think we 're <unk> to <unk> , and i 'm a <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: so i 'm <unk> , and i 'm a <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and i think , you know , <unk> , <unk> , <unk> , <unk> , <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: so i 'm <unk> , and i 'm a <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 3/10
Time used for epoch 3: 0 m 2 s
Evaluating on dev set after epoch 3/10:




Model BLEU score: 1.97

 Sample Translations:
predicted Translation: now , i 'm not sure you can do it .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's not a <unk> <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: it 's not a <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and the <unk> <unk> <unk> <unk> <unk> , <unk> , <unk> , <unk> , <unk> , <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: it 's not a <unk> <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 4/10
Time used for epoch 4: 0 m 2 s
Evaluating on dev set after epoch 4/10:




Model BLEU score: 2.99

 Sample Translations:
predicted Translation: it 's not a <unk> <unk> <unk> , and it 's not a <unk> <unk> <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's not a <unk> <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: it 's a <unk> <unk> of <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you know , it 's not a <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: it 's a <unk> <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 5/10
Time used for epoch 5: 0 m 2 s
Evaluating on dev set after epoch 5/10:




Model BLEU score: 3.27

 Sample Translations:
predicted Translation: there 's a lot of <unk> in the world of <unk> and <unk> <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's not just the first <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: it 's not about selling <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and the <unk> <unk> is <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is the <unk> <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 6/10
Time used for epoch 6: 0 m 2 s
Evaluating on dev set after epoch 6/10:




Model BLEU score: 3.57

 Sample Translations:
predicted Translation: there 's a lot of <unk> in the <unk> , and it 's <unk> , and it 's <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is a <unk> <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: he 's <unk> <unk> , and the <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and the <unk> is <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is a <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 7/10
Time used for epoch 7: 0 m 2 s
Evaluating on dev set after epoch 7/10:




Model BLEU score: 3.95

 Sample Translations:
predicted Translation: there 's a <unk> <unk> , and there 's a <unk> <unk> , and there 's a <unk> <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's a <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> is <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see the <unk> of the <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: there 's a <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 8/10
Time used for epoch 8: 0 m 2 s
Evaluating on dev set after epoch 8/10:




Model BLEU score: 3.91

 Sample Translations:
predicted Translation: there 's a <unk> <unk> , and it 's <unk> , and it 's <unk> , but it 's <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is a <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: it 's a <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see the <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is a <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 9/10
Time used for epoch 9: 0 m 2 s
Evaluating on dev set after epoch 9/10:




Model BLEU score: 4.34

 Sample Translations:
predicted Translation: there 's a <unk> <unk> , and it 's <unk> , but it 's <unk> , but it 's <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but here 's the <unk> thing .
reference Translation: but this is really just the beginning .

predicted Translation: it 's like the <unk> of the <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see the <unk> of the <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is the <unk> <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 10/10
Time used for epoch 10: 0 m 2 s
Evaluating on dev set after epoch 10/10:




Model BLEU score: 4.35

 Sample Translations:
predicted Translation: it 's a <unk> <unk> , and it 's <unk> , <unk> , <unk> , <unk> , <unk> , <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is the <unk> <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> is <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and now , in the <unk> condition , you can see the <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is the <unk> <unk> .
reference Translation: these blocks <unk> <unk> .

Training finished!
Time used for training: 0 m 28 s
Evaluating on test set:




Model BLEU score: 4.46

 Sample Translations:
predicted Translation: the <unk> <unk> <unk> <unk> <unk> <unk> <unk> , and it 's <unk> .
reference Translation: the second quote is from the head of the u.k. financial services <unk> .

predicted Translation: it 's <unk> .
reference Translation: it gets worse .

predicted Translation: what 's the <unk> of the <unk> that we can do is <unk> ?
reference Translation: what 's happening here ? how can this be possible ?

predicted Translation: well , it 's not a <unk> .
reference Translation: unfortunately , the answer is yes .

predicted Translation: but <unk> , <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> .
reference Translation: but there 's an <unk> solution which is coming from what is known as the science of <unk> .



## Task 3: Implement the Attention layer
In this task, you will work on the `forward` method of the `AttentionLayer` class.

The attention decoder is the secret recipe for the success of the NMT. It enables the decoder to access all the encoder outputs and focus on their different parts during different steps. By contrast, the basic model only has access to the final states of the encoder. There are a few different ways to build an attention mechanism. Here we build an attention mechanism similar to the one proposed by Luong et al. (2015), which computes the score between `decoder_outputs` and `encoder_outputs` by dot product.

First, let’s take a look at the shape of our inputs (`encoder_outputs, decoder_outputs`). `encoder_outputs` has a shape of `[batch_size, max_source_sent_len, hidden_size]`. `decoder_outputs` has a shape of `[batch_size, max_target_sent_len, hidden_size]`. In order to multiply them, we need to first transpose the last two dimensions of `decoder_outputs` to make its shape become `[batch_size, hidden_size, max_target_sent_len]`. You will need to use the backend `permute_dimensions` method to do this.

Once the `decoder_output` is transposed we use the `batch_dot` to compute the dot product. Let’s call the output `luong_score`. It has a shape of `[batch_size, max_source_sent_len, max_target_sent_len]` then you need apply a softmax to the dimension that have a size of `max_source_sent_len` to create an attention score for the `encoder_outputs`.   

Finally, we are going to create the `encoder_vector` by doing element-wise multiplication between the `encoder_outputs` and their attention scores (`luong_score`). But as you may have noticed the shape of `luong_score` is actually not the same as that of `encoder_outputs`, so we need to use the `expand_dims` method to expand dimensions for both of them. For  `luong_score`, you need to expand the last dimension to accommodate the `hidden_size` dimension of `encoder_outputs`. So after expansion, the shape becomes `[batch_size, max_source_sent_len, max_target_sent_len, 1]`. For  `encoder_outputs`, the target shape is `[batch_size, max_source_sent_len, 1, hidden_size]`. When multiplying between the two tensors, the expanded dimensions will be broadcasted so that they have the same shape. The last step is to sum along the `max_source_sent_len` dimension to create the `encoder_vector`.

Before returning the `new_decoder_outputs` we concatenate the `decoder_outputs` and the `encoder_vector` using the concatenate method (the code is already provided).

You’ve created an attention NMT system, let’s run your code (remember to set `use_attention` to True), it will take about a minute on a GPU to train it and you will get a much better BLEU score, usually above 12 (three times better than the score for the basic version).

In [17]:
main(SOURCE_PATH, TARGET_PATH, use_attention=True)

loading dictionaries...
read 24000/3000/3000 train/dev/test batches
number of tokens in source: 2034, number of tokens in target: 2506
Starting training epoch 1/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs s



applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: 



Model BLEU score: 9.90

 Sample Translations:
predicted Translation: there are some <unk> , to <unk> , it 's going to be <unk> , it 's going to be <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is really the beginning of the beginning .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> <unk> <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and the <unk> is a <unk> that there are a <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is the <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 2/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: to



Model BLEU score: 12.17

 Sample Translations:
predicted Translation: there 's four <unk> , to each other , and each of the <unk> , it 's <unk> <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is actually the beginning of this .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> approach around the <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> of <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is the <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 3/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([100



Attention decoder outputs shape: torch.Size([3000, 1, 400])
Attention decoder outputs shape: torch.Size([3000, 1, 400])
Model BLEU score: 13.33

 Sample Translations:
predicted Translation: there 's four <unk> , which is a <unk> , which is a <unk> of the <unk> , it 's <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but here 's actually the <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> <unk> <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this particular <unk> of this <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 4/10
applying attention
Attenti



Model BLEU score: 14.46

 Sample Translations:
predicted Translation: there are four <unk> to <unk> , and every <unk> , it 's <unk> when it <unk> behind the <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is actually just beginning .
reference Translation: but this is really just the beginning .

predicted Translation: <unk> <unk> : the <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is the <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 5/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.Size([



Attention decoder outputs shape: torch.Size([3000, 1, 400])
Attention decoder outputs shape: torch.Size([3000, 1, 400])
Model BLEU score: 14.39

 Sample Translations:
predicted Translation: there 's four <unk> , to each other , to each other , to each other , it 's <unk> <unk> , it 's a <unk> shape .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is really just beginning .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> is made the <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is a <unk> <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 6/10
applyi



Model BLEU score: 14.10

 Sample Translations:
predicted Translation: there 's four different areas of <unk> , which gives it back to you , it 's <unk> when it was <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this is actually the beginning .
reference Translation: but this is really just the beginning .

predicted Translation: it 's <unk> by <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> <unk> on the <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 7/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outputs shape: torch.



Attention decoder outputs shape: torch.Size([3000, 1, 400])
Attention decoder outputs shape: torch.Size([3000, 1, 400])
Model BLEU score: 13.86

 Sample Translations:
predicted Translation: there are four , to every <unk> , and it 's <unk> when it <unk> up behind the location , it 's <unk> .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's actually just <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: <unk> <unk> <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this one needs to express <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 8/10
applying attention




Model BLEU score: 13.88

 Sample Translations:
predicted Translation: there 's four , to <unk> , to <unk> when it <unk> when it <unk> when it <unk> up in the <unk> shape .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but this really starts the beginning of the <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> has been <unk> the axis of <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see that this is the <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this particular expression of led to <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 9/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applyi



Model BLEU score: 13.68

 Sample Translations:
predicted Translation: there 's four <unk> to remote <unk> when it 's <unk> when it was <unk> , it 's a <unk> display on the mind .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's just a really <unk> thing .
reference Translation: but this is really just the beginning .

predicted Translation: the <unk> is <unk> the axis of <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> <unk> <unk> .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this can be <unk> .
reference Translation: these blocks <unk> <unk> .

Starting training epoch 10/10
applying attention
Attention decoder outputs shape: torch.Size([100, 31, 400])
applying attention
Attention decoder outpu



Model BLEU score: 13.72

 Sample Translations:
predicted Translation: there 's four , which <unk> to every <unk> , it 's a <unk> that goes back when it <unk> on shape .
reference Translation: there are four <unk> <unk> that , each time this ring <unk> it , as it <unk> the <unk> of the display , it <unk> up a position signal .

predicted Translation: but it 's really really <unk> .
reference Translation: but this is really just the beginning .

predicted Translation: it 's evidence that <unk> <unk> .
reference Translation: it <unk> this by <unk> <unk> about two <unk> .

predicted Translation: and you can see here is the <unk> board that reward .
reference Translation: so as you can see here , this is a , <unk> <unk> <unk> board .

predicted Translation: this is <unk> .
reference Translation: these blocks <unk> <unk> .

Training finished!
Time used for training: 0 m 40 s
Evaluating on test set:
Attention decoder outputs shape: torch.Size([3000, 1, 400])
Attention decoder outputs shape: t



Model BLEU score: 14.00

 Sample Translations:
predicted Translation: in the first few of the <unk> first <unk> comes from the first business comes from the way .
reference Translation: the second quote is from the head of the u.k. financial services <unk> .

predicted Translation: so , it 's more <unk> .
reference Translation: it gets worse .

predicted Translation: what 's happening in here ? why are you ? "
reference Translation: what 's happening here ? how can this be possible ?

predicted Translation: unfortunately , unfortunately , the answer is yes .
reference Translation: unfortunately , the answer is yes .

predicted Translation: but in fact , there 's a very interesting solution from the age of doing science .
reference Translation: but there 's an <unk> solution which is coming from what is known as the science of <unk> .

