# Bonus work - RNN for machine translation

As mentionned in the previous script, we can implement several changes and see how it affects performance.

- Increase the model capacity.
- Add dropout to the recurrent units when they use more than 1 layer.
- Use GRU instead of LSTM: most of the implementation is the same, except you don't need to handle the extra "cell" state.
- Monitoring training with validation: after each epoch, compute the loss on the validation set and save the model with the best performance (= lowest loss) on this set.

We will then compare the performance of the LSTM and GRU models trained with these strategies.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from matplotlib import pyplot as plt
import random
import math
import time

# We'll be using torchtext and spacy to do most of the pre-processing
import spacy
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

# Set a random seed for reproducibility
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

## Pre-processing

All the pre-processing part is the same, so we simply copy the code from the previous script here (the only difference is that we no longer take a subset of the data).

In [None]:
# German and English specific pipelines
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

# Tokenizers
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# Fields
SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

# Dataset
train_data, valid_data, test_data = Multi30k.splits(root='data/', exts = ('.de', '.en'), fields = (SRC, TRG))

# Vocabulary
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

# Dataloader (here we keep the validation dataloader)
batch_size = 128
train_dataloader, valid_dataloader, test_dataloader = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size = batch_size)

In [None]:
# Index to string functions
def itos_list_de(tensor_indx):
    return [SRC.vocab.itos[tensor_indx[i]] for i in range(len(tensor_indx))]

def itos_list_en(tensor_indx):
    return [TRG.vocab.itos[tensor_indx[i]] for i in range(len(tensor_indx))]

## LSTM model with dropout

**TO DO**: write the LSTM encoder, decoder, and full seq2seq model using dropout within the recurrent units.
Check the [doc](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) to see how it's done (99% of the code can be copied from the previous script)

In [None]:
# Parameters
input_dim = len(SRC.vocab)
output_dim = len(TRG.vocab)
embedding_dim_enc = 256
embedding_dim_dec = 256
hidden_dim = 512 # we use the same hidden_dim for the encoder and the decoder
n_layers = 2
dropout_rate = 0.5

# Instanciate the LSTM model
lstm_encoder = LSTMEncoder(input_dim, embedding_dim_enc, hidden_dim, n_layers, dropout_rate)
lstm_decoder = LSTMDecoder(output_dim, embedding_dim_dec, hidden_dim, n_layers, dropout_rate)
lstm_model = LSTMSeq2Seq(lstm_encoder, lstm_decoder)

## GRU model with dropout

**TO DO**: write the encoder, decoder, and full seq2seq model using GRU with recurrent dropout instead of LSTM. Again, most of the previous script's code can be reused.

In [None]:
# Instanciate the GRU model
gru_encoder = GRUEncoder(input_dim, embedding_dim_enc, hidden_dim, n_layers, dropout_rate)
gru_decoder = GRUDecoder(output_dim, embedding_dim_dec, hidden_dim, n_layers, dropout_rate)
gru_model = GRUSeq2Seq(gru_encoder, gru_decoder)

## Training with validation

In [None]:
# The evaluation function is the same as in the previous script
def evaluate_seq2seq(model, eval_dataloader, loss_fn, device='cpu', verbose=True):

    model.eval()
    model.to(device)
    loss_eval = 0

    for i, batch in enumerate(eval_dataloader):

        # Get the source and target sentence, and the target length, copy it to device
        src, trg = batch.src.to(device), batch.trg.to(device)
        trg_len = trg.shape[0]

        # Apply the model
        pred_probas = model(src, trg_len)

        # Remove the first token (always <sos>) to compute the loss
        output_dim = pred_probas.shape[-1]
        pred_probas = pred_probas[1:]

        # Reshape the pred_probas and target so that they have appropriate shapes:
        pred_probas = pred_probas.view(-1, output_dim)
        trg = trg[1:].view(-1)

        # Compute the loss
        loss = loss_fn(pred_probas, trg)

        # Record the loss
        loss_eval += loss.item()

    return loss_eval

**TO DO**: write the training function that monitor the loss using the validation set

In [None]:
# Training parameters
num_epochs = 10
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
loss_fn = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)
optimizer_lstm = optim.Adam(lstm_model.parameters())
optimizer_gru = optim.Adam(gru_model.parameters())

# TO DO: train the LSTM and GRU models


## Evaluation

Now the models are trained, we can compare them.

**TO DO**: plot the training and validation losses, the number of parameters and print the loss on the test set for both models.