<div class="alert alert-block alert-info">
<b>Number of points for this notebook:</b> 4
<br>
<b>Deadline:</b> March 30, 2020 (Monday). 23:00
</div>

# Exercise 5. Sequence-to-sequence modeling with recurrent neural networks

The goals of this exercise are
* to get familiar with recurrent neural networks used for sequential data processing
* to get familiar with the sequence-to-sequence model for machine translation
* to learn PyTorch tools for batch processing of sequences with varying lengths
* to learn how to write a custom `DataLoader`

You may find it useful to look at this tutorial:
* [Translation with a Sequence to Sequence Network and Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

In [1]:
skip_training = True  # Set this flag to True before validation and submission

In [2]:
# During evaluation, this cell sets skip_training to True
# skip_training = True

In [3]:
import os
import random
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import tools
import tests


In [4]:
# When running on your own computer, you can specify the data directory by:
# data_dir = tools.select_data_dir('/your/local/data/directory')
data_dir = tools.select_data_dir()

The data directory is /coursedata


In [5]:
# Select the device for training (use GPU if you have one)
#device = torch.device('cuda:0')
device = torch.device('cpu')

In [6]:
if skip_training:
    # The models are always evaluated on CPU
    device = torch.device("cpu")

## Data

The dataset that we are going to use consists of pairs of sentences in French and English.

In [7]:
from data import TranslationDataset, MAX_LENGTH, SOS_token, EOS_token

trainset = TranslationDataset(data_dir, train=True)

* `TranslationDataset` supports indexing as required by `torch.utils.data.Dataset`.
* Sentences are tensors of maximum length `MAX_LENGTH`.
* Words in a (sentence) tensor are represented as an index (integer) in a language vocabulary.
* The string representation of a word from the source language can be obtained from index `i` with `dataset.input_lang.index2word[i]`.
* Similarly for the target language `dataset.output_lang.index2word[j]`.

Let us look at samples from that dataset.

In [8]:
src_sentence, tgt_sentence = trainset[np.random.choice(len(trainset))]
print('Source sentence: "%s"' % ' '.join(trainset.input_lang.index2word[i.item()] for i in src_sentence))
print('Sentence as tensor of word indices:')
print(src_sentence)

print('Target sentence: "%s"' % ' '.join(trainset.output_lang.index2word[i.item()] for i in tgt_sentence))
print('Sentence as tensor of word indices:')
print(tgt_sentence)

Source sentence: "tu n es plus un enfant . EOS"
Sentence as tensor of word indices:
tensor([ 211,  246,  212,  152,   66, 1176,    5,    1])
Target sentence: "you are not a child any more . EOS"
Sentence as tensor of word indices:
tensor([ 130,  125,  148,   42,  645, 1249, 1241,    4,    1])


In [9]:
print('Number of source-target pairs in the training set: ', len(trainset))

Number of source-target pairs in the training set:  8682


## Sequence-to-sequence model for machine translation

In this exercise, we are going to build a machine translation system which transforms a sentence in one language into a sentence in another one. The computational graph of the translation model is shown below:

<img src="seq2seq.png" width=900>

We are going to use a simplified model without the dotted connections.

## Custom DataLoader

We would like to train the sequence-to-sequence model using mini-batch training.
One difficulty of mini-batch training in this case is that sequences may have varying lengths and this has to be taken into account when building the computational graph. Luckily, PyTorch has tools to support batch processing of such sequences.
To use those tools, we need to write a custom data loader which puts sequences of varying lengths in the same tensor. We can customize the data loader by providing a custom `collate_fn` as explained [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

Our collate function:
- combines sequences from the source language in a single tensor with extra values (at the end) filled with `padding_value=0`.
- combines sequences from the target language in a single tensor with extra values (at the end) filled with `padding_value=0`.

**Important**:
- Late in the code (not in this `collate` function), we will convert source sequences to objects of class [`PackedSequence`](https://pytorch.org/docs/stable/nn.html?highlight=packedsequence#torch.nn.utils.rnn.PackedSequence) which can be processed by recurrent units such as `GRU` or `LSTM`. `PackedSequence` requires sequences to be sorted by their lengths.
**Therefore, the returned source sequences should be sorted by length in a decreasing order.**
* The target sequences need not be sorted by their lengths because we have to keep the same order of sequences in the source and target tensors.

Your task is to implement the collate function.

In [10]:
padding_value = 0

In [11]:
from torch.nn.utils.rnn import pad_sequence

def collate(list_of_samples):
    """Merges a list of samples to form a mini-batch.

    Args:
      list_of_samples is a list of tuples (src_seq, tgt_seq):
          src_seq is of shape (src_seq_length,)
          tgt_seq is of shape (tgt_seq_length,)

    Returns:
      src_seqs of shape (max_src_seq_length, batch_size): Tensor of padded source sequences.
          The sequences should be sorted by length in a decreasing order, that is src_seqs[:,0] should be
          the longest sequence, and src_seqs[:,-1] should be the shortest.
      src_seq_lengths: List of lengths of source sequences.
      tgt_seqs of shape (max_tgt_seq_length, batch_size): Tensor of padded target sequences.
    """
    # YOUR CODE HERE
    
    list_of_samples = sorted(list_of_samples, key = lambda x: len(x[0]), reverse=True)
    src_seqs = [s[0] for s in list_of_samples]
    src_seq_lengths = [len(s) for s in src_seqs]
    max_src_length = len(list_of_samples[0][0])
#     for i in range(len(src_seqs)):
#         s = src_seqs[i]
#         src_seqs[i] = torch.cat( (s,torch.zeros(max_src_length-len(s), dtype=s.dtype)) )
#     src_seqs = torch.stack(src_seqs, axis=1)
    src_seqs = pad_sequence(src_seqs)
    #src_seq_lengths = [len(s) for s in src_seqs ]
    
    tgt_seqs = [s[1] for s in list_of_samples]
#     max_tgt_length = max([len(s) for s in tgt_seqs])
#     for i in range(len(tgt_seqs)):
#         s = tgt_seqs[i]
#         tgt_seqs[i] = torch.cat( (s,torch.zeros(max_tgt_length-len(s), dtype=s.dtype)) )
#     tgt_seqs = torch.stack(tgt_seqs, axis=1)
    tgt_seqs = pad_sequence(tgt_seqs)
    return (src_seqs, src_seq_lengths, tgt_seqs)
    #raise NotImplementedError()

In [12]:
def test_collate_shapes():
    pairs = [
        (torch.LongTensor([1, 2]), torch.LongTensor([3, 4, 5])),
        (torch.LongTensor([6, 7, 8]), torch.LongTensor([9, 10])),
    ]
    pad_src_seqs, src_seq_lengths, pad_tgt_seqs = collate(pairs)
    assert pad_src_seqs.shape == torch.Size([3, 2]), f"Bad pad_src_seqs.shape: {pad_src_seqs.shape}"
    assert pad_src_seqs.dtype == torch.long
    assert pad_tgt_seqs.shape == torch.Size([3, 2]), f"Bad pad_tgt_seqs.shape: {pad_tgt_seqs.shape}"
    assert pad_tgt_seqs.dtype == torch.long
    print('Success')

test_collate_shapes()

Success


In [13]:
# This cell tests collate() function

In [14]:
# We create custom DataLoader using the implemented collate function
# We are going to process 64 sequences at the same time (batch_size=64)
from torch.utils.data import DataLoader
trainloader = DataLoader(dataset=trainset, batch_size=64, shuffle=True, collate_fn=collate, pin_memory=True)

## Encoder

The encoder encodes a source sequence $(x_1, x_2, ..., x_T)$ into a single vector $h_T$ using the following recursion:
$$
  h_{t} = f(h_{t-1}, x_t) \qquad t = 1, \ldots, T
$$
where:
* intial state $h_0$ is often chosen arbitrarily (we choose it to be zero)
* function $f$ is defined by the type of the RNN cell (in our experiments, we will use [GRU](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU))
* $x_t$ is a vector that represents the $t$-th word in the source sentence.

A common practice in natural language processing is to _learn_ the word representations $x_t$ (instead of, for example, using one-hot coded vectors). In PyTorch, this is supported by class [Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) which we are going to use.

The computational graph of the encoder is shown below:

<img src="seq2seq_encoder.png" width=500>

Your task is to implement the `forward` function of the encoder. It should contain the following steps:
* Embed the words of the source sequences.
* Pack source sequences using [`pack_padded_sequence`](https://pytorch.org/docs/stable/nn.html?highlight=pack_padded_sequence#torch.nn.utils.rnn.pack_padded_sequence). This converts padded source sequences into an object that can be processed by PyTorch recurrent units such as `nn.GRU` or `nn.LSTM`.
* Apply GRU computations to packed sequences obtained in the previous step
* Convert packed sequence of GRU outputs into padded representation with [`pad_packed_sequence`](https://pytorch.org/docs/stable/nn.html?highlight=pad_packed_sequence#torch.nn.utils.rnn.pad_packed_sequence).

In [15]:
class Encoder(nn.Module):
    def __init__(self, src_dictionary_size, embed_size, hidden_size):
        """
        Args:
          src_dictionary_size: The number of words in the source dictionary.
          embed_size: The number of dimensions in the word embeddings.
          hidden_size: The number of features in the hidden state of GRU.
        """
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(src_dictionary_size, embed_size)
        self.gru = nn.GRU(input_size=embed_size, hidden_size=hidden_size)

    def forward(self, pad_seqs, seq_lengths, hidden):
        """
        Args:
          pad_seqs of shape (max_seq_length, batch_size): Padded source sequences.
          seq_lengths: List of sequence lengths.
          hidden of shape (1, batch_size, hidden_size): Initial states of the GRU.

        Returns:
          outputs of shape (max_seq_length, batch_size, hidden_size): Padded outputs of GRU at every step.
          hidden of shape (1, batch_size, hidden_size): Updated states of the GRU.
        """
        # YOUR CODE HERE
        embedded = self.embedding(pad_seqs)
        output = torch.nn.utils.rnn.pack_padded_sequence(embedded, seq_lengths)
#         print(output.shape)
#         print(hidden.shape)
        output, hidden = self.gru(output, hidden)
#         print("later")
#         print(output.shape)
#         print(hidden.shape)
        output = torch.nn.utils.rnn.pad_packed_sequence(output)[0]
        return output, hidden
        #raise NotImplementedError()

    def init_hidden(self, batch_size=1):
        return torch.zeros(1, batch_size, self.hidden_size)

In [16]:
# # test cell, delete this later

# src_dictionary_size, embed_size, hidden_size = 5, 10, 3
# embedding = nn.Embedding(src_dictionary_size, embed_size)

# pad_seqs = torch.tensor([
#         [1, 2],
#         [2, 3],
#         [3, 0],
#         [4, 0]
#     ])
# seq_lengths = [4,2]

# embedded = embedding(pad_seqs)
# output = embedded
# hidden = torch.zeros(1, 2, 3)

# embedded.shape

# output = torch.nn.utils.rnn.pack_padded_sequence(embedded, seq_lengths)

# gru = nn.GRU(input_size=embed_size, hidden_size=hidden_size)

# o,h = gru(output, hidden)

# torch.nn.utils.rnn.pad_packed_sequence(o)[0]

In [17]:
def test_Encoder_shapes():
    hidden_size = 3
    encoder = Encoder(src_dictionary_size=5, embed_size=10, hidden_size=hidden_size)

    max_seq_length = 4
    batch_size = 2
    hidden = encoder.init_hidden(batch_size=batch_size)
    pad_seqs = torch.tensor([
        [1, 2],
        [2, 3],
        [3, 0],
        [4, 0]
    ])

    outputs, new_hidden = encoder.forward(pad_seqs=pad_seqs, seq_lengths=[4, 2], hidden=hidden)
    assert outputs.shape == torch.Size([4, batch_size, hidden_size]), f"Bad outputs.shape: {outputs.shape}"
    assert new_hidden.shape == torch.Size([1, batch_size, hidden_size]), f"Bad new_hidden.shape: {new_hidden.shape}"
    print('Success')

test_Encoder_shapes()

Success


In [18]:
tests.test_Encoder(Encoder)

outputs[:, 0, :]:
 tensor([[ 0.0000, -0.0150],
        [ 0.0004, -0.0221],
        [ 0.0007, -0.0055],
        [ 0.0005,  0.0323]])
expected:
 tensor([[ 0.0000, -0.0150],
        [ 0.0004, -0.0221],
        [ 0.0007, -0.0055],
        [ 0.0005,  0.0323]])
outputs[:2, 1, :]:
 tensor([[ 0.0000, -0.0150],
        [ 0.0004, -0.0021]])
expected:
 tensor([[ 0.0000, -0.0150],
        [ 0.0004, -0.0021]])
new_hidden:
 tensor([[[ 0.0005,  0.0323],
         [ 0.0004, -0.0021]]])
expected:
 tensor([[[ 0.0005,  0.0323],
         [ 0.0004, -0.0021]]])
Success


## Decoder

The decoder takes as input the representation computed by the encoder and transforms it into a sentence in the target language. The computational graph of the decoder is shown below:

<img src="seq2seq_decoder.png" width=500 align="top">

* $z_0$ is the output of the encoder, that is $z_0 = h_5$, thus `hidden_size` of the decoder should be the same as `hidden_size` of the encoder.
* $y_{i}$ are the log-probabilities of the words in the target language, the dimensionality of $y_{i}$ is the size of the target dictionary.
* $z_{i}$ is mapped to $y_{i}$ using a linear layer `self.out` followed by `F.log_softmax` (because we use `nn.NLLLoss` loss for training).
* Each cell of the decoder is a GRU, it receives as inputs the previous state $z_{i-1}$ and relu of the **embedding** of the previous word. Thus, you need to embed the words of the target language as well. The previous word is taken as the word with the maximum log-probability.

Note that the decoder outputs a word at every step and the same word is used as the input to the recurrent unit at the next step. At the beginning of decoding, the previous word input is fed with a special word SOS which stands for "start of a sentence". During training, we know the target sentence for decoding, therefore we can feed the correct words $y_i$ as inputs to the recurrent unit.

There is one extra thing that it is wise to take care of. When the target sentence is fed to the decoder during training, the decoder learns to generate only the next word (this scenario is called "teacher forcing"). At test time, the decoder works differently: it generates the whole sequence using its own predictions as inputs at each step. Therefore, it makes sense to train the decoder to produce full sentences. In order to do that, we will alternate between two modes during training:
* "teacher forcing": the decoder is fed with the words in the target sequence
* no "teacher forcing": the decoder generates the output sequence using its own predictions. We will limit the maximum length of generated sequences to `MAX_LENGTH`.

You need to implement the decoder which has the structure shown in the figure above.

In [19]:
class Decoder(nn.Module):
    def __init__(self, tgt_dictionary_size, embed_size, hidden_size):
        """
        Args:
          tgt_dictionary_size: The number of words in the target dictionary.
          embed_size: The number of dimensions in the word embeddings.
          hidden_size: The number of features in the hidden state.
        """
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(tgt_dictionary_size, embed_size)
        self.gru = nn.GRU(input_size=embed_size, hidden_size=hidden_size)
        self.out = nn.Linear(hidden_size, tgt_dictionary_size)
        self.softmax = nn.LogSoftmax(dim=2) # TODO : check is dim is correct

    def forward(self, hidden, pad_tgt_seqs=None, teacher_forcing=False):
        """
        Args:
          hidden of shape (1, batch_size, hidden_size): States of the GRU.
          pad_tgt_seqs of shape (max_out_seq_length, batch_size): Tensor of words (word indices) of the
              target sentence. If None, the output sequence is generated by feeding the decoder's outputs
              (teacher_forcing has to be False).
          teacher_forcing (bool): Whether to use teacher forcing or not.

        Returns:
          outputs of shape (max_out_seq_length, batch_size, tgt_dictionary_size): Tensor of log-probabilities
              of words in the target language.
          hidden of shape (1, batch_size, hidden_size): New states of the GRU.

        Note: Do not forget to transfer tensors that you may want to create in this function to the device
        specified by `hidden.device`.
        """
        if pad_tgt_seqs is None:
            assert not teacher_forcing, 'Cannot use teacher forcing without a target sequence.'

        # YOUR CODE HERE
        #assume teacher forcing is true
        batch_size = hidden.shape[1]
        final_output = []
        if teacher_forcing==True:
            for i in range(-1, len(pad_tgt_seqs)-1):
                if i==-1: #input = SOS
                    word = SOS_token*torch.ones((1, batch_size), dtype=torch.long)
                else:
                    word = pad_tgt_seqs[i].view(1,batch_size)
                
                embedded = self.embedding(word)
                embedded = F.relu(embedded)
                output, hidden = self.gru(embedded, hidden)
                output = self.softmax(self.out(output))
#                 word_id = torch.argmax(output,dim=2)
#                 print("translation: "+' '.join(trainset.input_lang.index2word[w.item()] for w in word_id[0]))
                final_output.append(output[0])
            return torch.stack(final_output), hidden
        
        else:
            out_size = MAX_LENGTH if pad_tgt_seqs is None else len(pad_tgt_seqs)
            for i in range(out_size):
                if i==0:
                    word = SOS_token*torch.ones((1,batch_size), dtype=torch.long)
                else:
                    word = word_id.detach() #TODO: check if I should detach or not
                
                embedded = self.embedding(word)
                embedded = F.relu(embedded)
                output, hidden = self.gru(embedded, hidden)
                output = self.softmax(self.out(output))
                word_id = torch.argmax(output,dim=2) #size should be 1, batch_size
#                 print("translation: "+' '.join(trainset.input_lang.index2word[w.item()] for w in word_id[0]))
                final_output.append(output[0])
            return torch.stack(final_output), hidden

        #raise NotImplementedError()

In [20]:
def test_Decoder_shapes():
    hidden_size = 2
    tgt_dictionary_size = 5
    test_decoder = Decoder(tgt_dictionary_size, embed_size=10, hidden_size=hidden_size)

    max_seq_length = 4
    batch_size = 2
    pad_tgt_seqs = torch.tensor([
        [1, 2],
        [2, 3],
        [3, 0],
        [4, 0]
    ])  # [max_seq_length, batch_size]

    hidden = torch.zeros(1, batch_size, hidden_size)
    outputs, new_hidden = test_decoder.forward(hidden, pad_tgt_seqs, teacher_forcing=False)

    assert outputs.size(0) <= 4, f"Too long output sequence: outputs.size(0)={outputs.size(0)}"
    assert outputs.shape[1:] == torch.Size([batch_size, tgt_dictionary_size]), \
        f"Bad outputs.shape[1:]={outputs.shape[1:]}"
    assert new_hidden.shape == torch.Size([1, batch_size, hidden_size]), f"Bad new_hidden.shape={new_hidden.shape}"

    outputs, new_hidden = test_decoder.forward(hidden, pad_tgt_seqs, teacher_forcing=True)
    assert outputs.shape == torch.Size([4, batch_size, tgt_dictionary_size]), \
        f"Bad shape outputs.shape={outputs.shape}"
    assert new_hidden.shape == torch.Size([1, batch_size, hidden_size]), f"Bad new_hidden.shape={new_hidden.shape}"

    # Generation mode
    outputs, new_hidden = test_decoder.forward(hidden, None, teacher_forcing=False)
    assert outputs.shape[1:] == torch.Size([batch_size, tgt_dictionary_size]), \
        f"Bad outputs.shape[1:]={outputs.shape[1:]}"
    assert new_hidden.shape == torch.Size([1, batch_size, hidden_size]), f"Bad new_hidden.shape={new_hidden.shape}"

    print('Success')

test_Decoder_shapes()

Success


In [21]:
tests.test_Decoder_no_forcing(Decoder)
tests.test_Decoder_with_forcing(Decoder)
tests.test_Decoder_generation(Decoder)

outputs[:, 0, :]:
 tensor([[-1.1366, -2.1924, -1.4361, -1.9640, -1.6645],
        [-1.3540, -1.8630, -1.5249, -1.7793, -1.6085],
        [-1.4899, -1.7024, -1.5838, -1.6901, -1.5962],
        [-1.5665, -1.6246, -1.6166, -1.6457, -1.5956]])
expected:
 tensor([[-1.1366, -2.1924, -1.4361, -1.9640, -1.6645],
        [-1.3540, -1.8630, -1.5249, -1.7793, -1.6085],
        [-1.4899, -1.7024, -1.5838, -1.6901, -1.5962],
        [-1.5665, -1.6246, -1.6166, -1.6457, -1.5956]])
outputs[:, 1, :]:
 tensor([[-1.1366, -2.1924, -1.4361, -1.9640, -1.6645],
        [-1.3540, -1.8630, -1.5249, -1.7793, -1.6085],
        [-1.4899, -1.7024, -1.5838, -1.6901, -1.5962],
        [-1.5665, -1.6246, -1.6166, -1.6457, -1.5956]])
expected:
 tensor([[-1.1366, -2.1924, -1.4361, -1.9640, -1.6645],
        [-1.3540, -1.8630, -1.5249, -1.7793, -1.6085],
        [-1.4899, -1.7024, -1.5838, -1.6901, -1.5962],
        [-1.5665, -1.6246, -1.6166, -1.6457, -1.5956]])
new_hidden:
 tensor([[[0.1003, 0.0421],
         [0.1003

## Training of sequence-to-sequence model using mini-batches

Now we are going to train the sequence-to-sequence model on the toy translation dataset.

In [76]:
# Create the seq2seq model
hidden_size = embed_size = 256
encoder = Encoder(trainset.input_lang.n_words, embed_size, hidden_size).to(device)
decoder = Decoder(trainset.output_lang.n_words, embed_size, hidden_size).to(device)

In [77]:
teacher_forcing_ratio = 0.5

Implement the training loop in the cell below. In the training loop, we first encode source sequences using the encoder, then we decode the encoded state using the decoder. The decoder outputs log-probabilities of words in the target language. We need to use these log-probabilities and the indexes of the words in the target sequences to compute the loss.

Recommended hyperparameters:
- Encoder optimizer: Adam with learning rate 0.001
- Decoder optimizer: Adam with learning rate 0.001
- Number of epochs: 30
- Toggle `teacher_forcing` on and off (for each mini-batch) according to the `teacher_forcing_ratio` specified above.

Hints:
- Training should proceed relatively fast.
- If you do well, the training loss should reach 0.1 in 30 epochs.
- **Important:** When computing the loss, you need to ignore the padded values. This can easily be done by using argument `ignore_index` of function [`nll_loss`](
https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.nll_loss).

In [78]:
if not skip_training:
    # YOUR CODE HERE
    num_epoch = 30
    loss_list = []
    
    encoder_optim = torch.optim.Adam(lr=0.001, params=encoder.parameters())
    decoder_optim = torch.optim.Adam(lr=0.001, params=decoder.parameters())
    criterion = nn.NLLLoss(ignore_index=0)
#     (src, src_lengths, tgt) = iter(trainloader).next()
    for epoch in range(num_epoch):
        teacher_forcing_total = 0
        for (i, (src, src_lengths, tgt)) in enumerate(trainloader):
#         for i in range(1):
            batch_size = src.shape[1]
            encoder.train()
            decoder.train()
            encoder_optim.zero_grad()
            decoder_optim.zero_grad()
            teacher_forcing = random.choices([True,False], weights=[teacher_forcing_ratio, 1-teacher_forcing_ratio])[0]
            hidden = encoder.init_hidden(batch_size=batch_size)
            _, hidden = encoder.forward(pad_seqs=src, seq_lengths=src_lengths, hidden=hidden)
            output, hidden = decoder.forward(hidden, tgt, teacher_forcing=teacher_forcing)
            teacher_forcing_total+=teacher_forcing
            #max_output_length = 4
            #output.shape = (4,64,2925)
            #tgt.shape = (4,64)
#             print(output.shape, tgt.shape)
            loss = 0
            for j in range(output.size(0)):
                loss += F.nll_loss(output[j], tgt[j], ignore_index=0)
            loss = loss/output.size(0)
            
#             loss = criterion(output.permute(0,2,1), tgt) #(4,64,2925) to (4,2925,64)
            loss.backward()
            encoder_optim.step()
            decoder_optim.step()
            
            loss_list.append(loss.item())
            
#             test_index = 31
#             print('SRC:', ' '.join(trainset.input_lang.index2word[i.item()] for i in src[:,test_index]))
#             print('TGT:', ' '.join(trainset.output_lang.index2word[i.item()] for i in tgt[:,test_index]))
#             out_sentence = translate(encoder, decoder, src[:,test_index])
#             print('OUT:', ' '.join(trainset.output_lang.index2word[i.item()] for i in out_sentence))
            
        
        print("epoch: {}, epoch_loss: {}, teacher_forcing: {}/{}".format(epoch, np.mean(loss_list), teacher_forcing_total,i))
#         print('')
    #raise NotImplementedError()

epoch: 0, epoch_loss: 3.1308087899404415, teacher_forcing: 63/135
epoch: 1, epoch_loss: 2.7061173354878143, teacher_forcing: 62/135
epoch: 2, epoch_loss: 2.4627521081882366, teacher_forcing: 80/135
epoch: 3, epoch_loss: 2.299330343218411, teacher_forcing: 66/135
epoch: 4, epoch_loss: 2.1724358202779994, teacher_forcing: 55/135
epoch: 5, epoch_loss: 2.0584437248169207, teacher_forcing: 60/135
epoch: 6, epoch_loss: 1.9557547794044519, teacher_forcing: 60/135
epoch: 7, epoch_loss: 1.8573458658202606, teacher_forcing: 78/135
epoch: 8, epoch_loss: 1.7678282952873536, teacher_forcing: 70/135
epoch: 9, epoch_loss: 1.684002214056604, teacher_forcing: 75/135
epoch: 10, epoch_loss: 1.6051703218549969, teacher_forcing: 77/135
epoch: 11, epoch_loss: 1.5323480585151736, teacher_forcing: 68/135
epoch: 12, epoch_loss: 1.4636182326368348, teacher_forcing: 69/135
epoch: 13, epoch_loss: 1.3998402154088772, teacher_forcing: 65/135
epoch: 14, epoch_loss: 1.3381696978474364, teacher_forcing: 78/135
epoch: 

In [68]:
# test_index = 2
# print('SRC:', ' '.join(trainset.input_lang.index2word[i.item()] for i in src[:,test_index]))
# print('TGT:', ' '.join(trainset.output_lang.index2word[i.item()] for i in tgt[:,test_index]))
# out_sentence = translate(encoder, decoder, src[:,test_index])
# print('OUT:', ' '.join(trainset.output_lang.index2word[i.item()] for i in out_sentence))

SRC: je fais attention a ne pas trop depenser . EOS
TGT: i m careful not to spend too much . EOS
OUT: i m not sorry to spend too much . EOS


In [81]:
# Save the model to disk (the pth-files will be submitted automatically together with your notebook)
if not skip_training:
    tools.save_model(encoder, '5_encoder.pth')
    tools.save_model(decoder, '5_decoder.pth')
else:
    hidden_size = 256
    encoder = Encoder(trainset.input_lang.n_words, embed_size, hidden_size)
    tools.load_model(encoder, '5_encoder.pth', device)
    
    decoder = Decoder(trainset.output_lang.n_words, embed_size, hidden_size)
    tools.load_model(decoder, '5_decoder.pth', device)

Do you want to save the model (type yes to confirm)? yes
Model saved to 5_encoder.pth.
Do you want to save the model (type yes to confirm)? yes
Model saved to 5_decoder.pth.


In [70]:
# This cell tests training accuracy

## Evaluation

Next we need to implement a function that converts a source sequence to an output sequence using the trained sequence-to-sequence model.

In [71]:
def translate(encoder, decoder, src_seq):
    """Translate given sentence src_seq using trained encoder and decoder.
    
    Args:
      encoder (Encoder): Trained encoder.
      decoder (Decoder): Trained decoder.
      src_seq of shape (src_seq_length,): LongTensor of word indices of the source sequence.
    
    Returns:
      out_seq of shape (out_seq_length,): LongTensor of word indices of the output sequence.
    """
    # YOUR CODE HERE
    encoder.eval()
    decoder.eval()
    hidden = encoder.init_hidden(batch_size=1)
    _, hidden = encoder.forward(pad_seqs=src_seq.view(-1,1), seq_lengths=[len(src_seq)], hidden=hidden)
    output, hidden = decoder(hidden, None, teacher_forcing=False)
    return torch.argmax(output, dim=2)[:,0]
    #raise NotImplementedError()

In [72]:
def test_translate_shapes():
    src_seq = torch.tensor([1, 2, 3, 4]).to(device)
    out_seq = translate(encoder, decoder, src_seq)
    assert out_seq.shape[0] <= MAX_LENGTH, \
        f"Too long output sequence: tgt_seq.shape[0]={tgt_seq.shape[0]}"
    print('Success')

test_translate_shapes()

Success


Let us now translate random sentences from the training set and print the source, target, and produced output.

If you trained the model well enough, the model should memorize the training data well.

In [79]:
# Translate random sentences from the training set
print('Translate training data:')
print('-----------------------------')
for i in range(5):
    src_sentence, tgt_sentence = trainset[np.random.choice(len(trainset))]
    print('SRC:', ' '.join(trainset.input_lang.index2word[i.item()] for i in src_sentence))
    print('TGT:', ' '.join(trainset.output_lang.index2word[i.item()] for i in tgt_sentence))
    out_sentence = translate(encoder, decoder, src_sentence)
    print('OUT:', ' '.join(trainset.output_lang.index2word[i.item()] for i in out_sentence))
    print('')

Translate training data:
-----------------------------
SRC: j eprouve du ressentiment . EOS
TGT: i m resentful . EOS
OUT: i m resentful . EOS EOS EOS EOS EOS EOS

SRC: elle a une peur bleue des chiens . EOS
TGT: she s very afraid of dogs . EOS
OUT: she s very afraid of dogs . EOS EOS EOS

SRC: ils vont dans cette direction . EOS
TGT: they re headed this way . EOS
OUT: they re headed this way . EOS EOS EOS EOS

SRC: vous etes une etudiante . EOS
TGT: you are a student . EOS
OUT: you are a student . EOS EOS EOS EOS EOS

SRC: il est attire par les negresses . EOS
TGT: he s attracted to black women . EOS
OUT: he s attracted to black women . EOS EOS EOS



Now we translate random sentences from the test set. A well-trained model should output sentences that look similar to the target ones. The mistakes are usually done for words that were rare in the training set.

In [74]:
testset = TranslationDataset(data_dir, train=False)

In [80]:
print('Translate test data:')
print('-----------------------------')
for i in range(5):
    src_sentence, tgt_sentence = testset[np.random.choice(len(testset))]
    print('SRC:', ' '.join(testset.input_lang.index2word[i.item()] for i in src_sentence))
    print('TGT:', ' '.join(testset.output_lang.index2word[i.item()] for i in tgt_sentence))
    out_sentence = translate(encoder, decoder, src_sentence)
    print('OUT:', ' '.join(testset.output_lang.index2word[i.item()] for i in out_sentence))
    print('')

Translate test data:
-----------------------------
SRC: je suis desolee de vous avoir mal compris . EOS
TGT: i m sorry i misunderstood you . EOS
OUT: i m sorry i yelled at you . EOS EOS

SRC: elle n est pas d humeur . EOS
TGT: she s not in the mood . EOS
OUT: she s not even . EOS EOS EOS EOS EOS

SRC: il est dehors en train de se promener . EOS
TGT: he s out taking a walk . EOS
OUT: he s out of time . EOS EOS EOS EOS

SRC: je suis vannee . EOS
TGT: i m exhausted . EOS
OUT: i am exhausted . EOS EOS EOS EOS EOS EOS

SRC: tu es paresseux . EOS
TGT: you re lazy . EOS
OUT: you re lazy . EOS EOS EOS EOS EOS EOS

