<a href="https://colab.research.google.com/github/vlamen/tue-deeplearning/blob/main/practicals/P3.3_seq2seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# P3.3 - Sequence to Sequence: Text Translation

In this practical we will develop a model for translation of sentences from German to English using the sequence to sequence architecture. 

### Learning outcomes
- Understand the basic concepts of a sequence to sequence (seq2seq) model
- How to preprocess textual data.
- How to train an seq2seq model for parametrisation of the joint probability distribution $P(y_0, ..., y_k | x_0, ..., x_n)$ over the words $Y$ in the target language, conditioned on the words $X$ of the source sentence.
- How to develop a model for translation of sentences from $P(y_0, ..., y_k | x_0, ..., x_n)$.

**References**
* [1] *Ilya Sutskever, Oriol Vinyals, Quoc V. Le, "Sequence to Sequence Learning with Neural Networks"*, NIPS, 2014. https://arxiv.org/abs/1409.3215

### Download data

We train a translation model on the IWSLT2016 dataset that can be accessed through the `torchtext` library. The dataset was specifically designed for machine translation and evaluation tasks and contains translations from/to English to/from Arabic, Czech, French, German. We restrict ourselves to German-English translation, i.e. we download only the DE, EN language pairs. 

Note that we are in a similar setting as in P3.1_rnn_classification since we are downloading the dataset from the torchtext library. This implies that we can preprocess the datasets in the same way as in the aforementioned practical session.

In [1]:
import torch
from torchtext import data     #pip install torchtext
from torchtext import datasets
from torch.utils.data import Subset, Dataset, IterableDataset
%pylab inline


# downloading dataset may take a while...
train_iter, val_iter, test_iter = datasets.IWSLT2016(split=('train', 'valid', 'test'), language_pair=('de', 'en'))


# we take a subset of the data to prevent memory issues in Colab 
N=10000 #increasing this may cause Colab crashes..
train_iter = Subset(train_iter, torch.arange(N)).dataset
train_iter.num_lines = N

print(f"Number of training sentences: {len(train_iter)}")
print(f"Number of validation sentences: {len(val_iter)}")
print(f"Number of test sentences: {len(test_iter)}\n\n")

for _ in range(3):
    en, de = next(test_iter)
    print("DE: " + de)
    print("EN: " + en + '\n')

Populating the interactive namespace from numpy and matplotlib
Number of training sentences: 10000
Number of validation sentences: 993
Number of test sentences: 1305


DE: When I was in my 20s, I saw my very first psychotherapy client.

EN: Als ich in meinen 20ern war, hatte ich meine erste Psychotherapie-Patientin.


DE: I was a Ph.D. student in clinical psychology at Berkeley.

EN: Ich war Doktorandin und studierte Klinische Psychologie in Berkeley.


DE: She was a 26-year-old woman named Alex.

EN: Sie war eine 26-jährige Frau namens Alex.




# Preprocessing textual input data

### Create vocabulary
As we have seen in practical P1.2 and P3.2, word embeddings are useful for encoding words into vectors of real numbers. The first step is to build a custom vocabulary from the raw training dataset. To this end, we tokenize each sentence and thereafter count the number of occurances of each token (=word or punctuation mark) in each of the articles using `counter`. Finally, we create the vocabulary by using the frequencies of each token in the counter. 

Note that each datapoint consists of a German and English sentence, thus we create seperate tokenizers and vocabulary for both languages. Futhermore, we add special tokens to both vocabulary: $<unk>$ for unknown tokens, $<start>$ and $<end>$ as the first and last tokens of each sentence, respectively.

In [2]:
# uncomment to install prerequisite modules
# !pip install spacy
# !python -m spacy download en
# !python -m spacy download de

In [3]:
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
from tqdm.notebook import tqdm

de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')

de_counter, en_counter = Counter(), Counter()

for (de, en) in tqdm(train_iter):
    de_counter.update(de_tokenizer(de))
    en_counter.update(en_tokenizer(en))

de_vocab = Vocab(de_counter, min_freq=1, specials=['<unk>', '<start>', '<stop>'])
en_vocab = Vocab(en_counter, min_freq=1, specials=['<unk>', '<start>', '<stop>'])

print(f"Unique tokens in source (de) vocabulary: {len(de_vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(en_vocab)}")



HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=10000.0), HTML(value='')))


Unique tokens in source (de) vocabulary: 20053
Unique tokens in target (en) vocabulary: 13226


### Create pipelines 

In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier". We adopt this approach and reverse the German sentence after it has been transformed into a list of tokens.

**Exercise**

Complete the pipeline functions that preprocess German and English sentences respectively. The German sentences should be reversed first. Then, for both German and English sentences your code should add start and stop tokens to each sentence at appropriate positions. 

In [4]:
def de_pipeline(text):
    """
    Reverses German sentence and tokizes from a string into a list of strings (tokens). Then converts each token
    to corresponding indices. Furthermore, it adds start and stop tokens at the appropriate positions.
    """
    ### Your code here ###
    
    return word_idcs

def en_pipeline(text):
    """
    Tokenizes English sentence from a string into a list of strings (tokens), then converts each token
    to corresponding indices. Furthermore, it adds start and stop tokens at the appropriate positions
    """
    ### Your code here ###
    
    return word_idcs

The pipelines allow us to convert a string sentence into integers:

    en_pipeline('Here is an example!')
    >>> [1, 316, 14, 53, 241, 283, 2]

### Create DataLoaders

**Exercise**

Use the pipelines from the previous exercise to create a `collate_batch` method produces batches of source and target sentences. As you may have foreseen, the `collate_batch` will be used in the `DataLoader` which enables iterating over the dataset in batches. In each iteration, a batch of source sentences (German) and target sentences (English) should be returned. Encode the tokens of the sentences as indices by using the vocabulary. Finally, your code should pad all sequences to be able to create two tensors: one containing the input sentences, and another one for the target sentences. Pad the sequences with the appropriate special token.

In [5]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

# check if gpu is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    """
    Concatenate multiple datapoints to obtain a single batch of data
    """
    
    ### Your code here ###
    
    # return source (DE) and target sequences (EN) after transferring them to GPU (if available)
    return de_padded.to(device), en_padded.to(device)


# Building the Seq2Seq translation model

In the implementation we define three objects: the encoder, the decoder and a full translation model that encapsulates the encoder and decoder. The given code also proposes the main hyperparameters that your implementation should use. Feel free to change the values of these parameters!

The referenced paper uses a 4-layer LSTM, but in the interest of training time we can reduce this to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers.

In [6]:
BATCH_SIZE = 64
EPOCHS = 50
DROPOUT = 0.5
N_LAYERS = 2 #paper uses 4

EMB_DIM = 256  #dimension of the word embedding
HIDDEN_DIM = 512 #dimension of the lstm's hidden state

## Encoder

The encoder takes as input a (batch) German sentence. We already converted all sentences into a zero-padded 2D matrix (shape batch_size, max_seq_len)) containing the tokens that make up the sequences. 

**Exercise**:
Complete the Encoder's class. In the `__init__(self)` you should declare the approriate layers. The encoder has to return a compact representation of the input sequence.

In [8]:
import torch.nn as nn

class Encoder(nn.Module):
    
    def __init__(self, source_vocab=de_vocab, emb_dim=EMB_DIM, hid_dim=HIDDEN_DIM, dropout=DROPOUT, n_layers=N_LAYERS):
        super().__init__()
        
        self.source_vocab = source_vocab
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.dropout = dropout
        self.n_layers = n_layers
        
        ### Your code here ###
    
        
    def forward(self, padded_word_idcs):
        
        """
        Forward pass of encoder model. It aims at
        transforming the input sentence to a dense vector 
        
        Input:
        padded_word_idcs shape:  (batch_size, max_seq_len_in_batch)

        Output:
        a dense vector
        which contains all sentence information
        """
        
        ### Your code here ###
        
        #hidden = [n layers, max_seq_len, hid dim]
        #cell = [n layers, max_seq_len, hid dim]
       
        return hidden, cell

### Decoder

**Exercise**

The next step is to implement the decoder. The Decoder class aims at performing a single step of decoding, i.e. it ouputs a single token per time-step. In the first decoding step ($t=1$), the decoder takes as input the dense representation first token $y_2 = f$(<<l>start>). With these inputs, it should update the cell and hidden state and thereafter predict the first real word $s_2$ (no start token) of the target sentence. In all later decoder steps, the first layer will receive a hidden and cell state from the previous time-step, $(h_{t-1}, c_{t-1})$, and feed it through the LSTM with the current embedded token, $y_t$ (i.e the embedding that of the token predicted at the end of the previous step), to produce a new hidden and cell state, $(h_t, c_t)$. 

You should then pass the hidden state of the RNN, $h_t$, through a linear layer, $g$, to make a prediction of what the next token in the target (output) sequence should be, i.e. $\hat{y}_{t+1} = g(h_t)$. An example is provided in the diagram below.

![alt text](https://raw.githubusercontent.com/vlamen/tue-deeplearning/main/img/lstm_decoder.png "diagram")

In [8]:
class Decoder(nn.Module):
    def __init__(self, target_vocab, emb_dim=EMB_DIM, hid_dim=HIDDEN_DIM, dropout=DROPOUT, n_layers=N_LAYERS,):
        super().__init__()
        
        self.target_vocab = target_vocab
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.dropout = dropout
        self.n_layers = n_layers
        
        ### Your code here ###
        
        
    def forward(self, hidden, cell, padded_word_idcs):
        """
        Forward pass of the decoder model. It aims at transforming
        the dense representation of the encoder into a sentence in
        the target language
        
        Input:
        hidden shape: [n layers, max_seq_len, hid dim]
        cell shape: [n layers * n directions, batch size, hid dim]
        padded_word_idcs shape: [batch size]
        
        Output:
        prediction shape: [batch size, num_words target_vocabulary]
        hidden shape: [n layers, max_seq_len, hid dim]
        cell shape: [n layers * n directions, batch size, hid dim]
        """
        
        #### Your code here ###
        
        #prediction = [batch size, num_words target_vocabulary]
        return prediction, hidden, cell

## The seq2seq model

**Exercise**

The Seq2Seq model takes in an Encoder, Decoder, and a device (used to place tensors on the GPU, if it exists).
For this implementation, we you have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the Encoder and Decoder. 

Start with declaring the optimizer and loss function of the model. The loss function should not penalize if the ground truth token is the <<l>stop> token. Use the `ignore_index` input argument of the loss function to realize this behavior.


The forward method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step the decoder will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t)$. With probability equal to the teaching forcing ratio (`teacher_forcing_ratio`) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability 1 - `teacher_forcing_ratio`, your model should use the token that the LSTM predicted at the end of the previous step, even if it doesn't match the actual next token in the sequence. The `random.random()` will be useful here, the module has already been imported.

    

In [9]:
import torch.optim as optim
import random

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
        ### Your code here ###
        
        self.optimizer = optim.Adam(self.parameters())

        stop_token_idx = self.decoder.target_vocab['<stop>']
        self.criterion = nn.CrossEntropyLoss(ignore_index = stop_token_idx)
        
    def forward(self, padded_src_sen, padded_trg_sen, teacher_forcing_ratio = 0.75):
        """
        Forward pass of the seq2seq model. It encodes the source sentence into
        a dense representation and thereafter transduces into the target
        sentence.
        
        Inputs:
        padded_src_sen: padded index representation of source sentences with shape [batch size, src len]
        padded_trg_sen:  padded index representation of target sentences with shape [batch size, trg len]
        teacher_forcing_ratio: probability to use teacher forcing, e.g. 0.75 we use ground-truth target sentence 75% of the time
        
        Outputs:
        outputs: padded index representation of the predicted sentences with shape [batch_size, trg_len, trg_vocab_size]
        """
        
        batch_size = padded_src_sen.shape[0]
        trg_len = padded_trg_sen.shape[1]
        trg_vocab_size = len(self.decoder.target_vocab)
        
        ### Your code here ###
        
        return outputs

## Training

**Exercise** 

Write functions for training and evaluating your model. You should iterate over the dataset and update the weights of the networks with the computed loss value. Use accuracy as a metric and print the evaluation accuracy at the end of each epoch. 

Next, you will need to call your `seq2seq` model and train it using the functions that you implemented. Finally, make a plot of the training and validation accuracy.

As the model needs extensive training, it could be useful to save the best model to your drive. In this way, you can do the next exercise at another time. Use the following code inside your training loop:
    

    if val_acc[-1] > best_valid_acc:
        best_valid_acc = val_acc[-1]
        torch.save(model.state_dict(), 'tut1-model.pt')
        
Don't forget to declare `best_valid_acc` at the top of the cell, e.g. with 

    best_valid_acc = float(0)
    
Finally, the GPU memory will gradually increase which eventually triggers a memory error. Make sure to clear the GPU memory before running the forward pass using the `torch.cuda.empty_cache()` command.

In [10]:
import time

def train(dataset):
    
    ### Your code here ###        
    
    return total_acc/total_count


def evaluate(dataset):
    
    ### Your code here ###
    
    return total_acc/total_count

In [10]:
device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu")

# empty the GPU memory
torch.cuda.empty_cache()

best_valid_acc = float(0)

# initiate seq2seq translation model
enc = Encoder(de_vocab, EMB_DIM, HIDDEN_DIM, DROPOUT, N_LAYERS)
dec = Decoder(en_vocab, EMB_DIM, HIDDEN_DIM, DROPOUT, N_LAYERS)

seq2seq = Seq2Seq(enc, dec, device).to(device)



train_acc, val_acc = [], []
# training loop
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    
    train_iter, val_iter = datasets.IWSLT2016(split=('train', 'valid'), language_pair=('de', 'en'))
    train_iter = Subset(train_iter, torch.arange(N)).dataset
    train_iter.num_lines = N
    
    train_acc.append(train(train_iter))

    val_acc.append(evaluate(val_iter))
    
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           val_acc[-1]))
    print('-' * 59)
    
    if val_acc[-1] > best_valid_acc:
        best_valid_acc = val_acc[-1]
        torch.save(seq2seq.state_dict(), 'tut1-model.pt')

In [9]:
### Make a plot with training/testing accuracy vs. epochs ###


# Inference

The trained model parametrizes the joint probability distribution $P(Y|X)$ of an English target sentence $Y$ that is a correct translation of the German source sentence $X$. Formally, we seek the sentence $Y$ which maximizes $P(Y|X)$, i.e. 

$$
Y = \underset{Y^{'}}{\operatorname{argmax}} p(Y^{′}|X). \quad{(1)}
$$

**Exercise** 

During inference using the seq2seq model you can make certain assumptions that should affect your implementation choices. You can assume conditional indepedence of the targets $P(Y|X)=P(y_{0:k}|X)=P(y_0|X)P(y_1|X)...P(y_k|X)$. In this case you can implement a greedy decoder that computes the most likely output at each step without taking into acount the selected outputs at previous steps. Or you can implement an autoregressive decoder that computes the joint probability of the output given the input $P(Y|X)=P(y_{0:k}|X)=P(y_0|X)P(y_1|y_0, X)...P(y_k|y_{0:k-1},X)$. 
