# Recurrent networks with attention mechanism

In the previous lab, we have implemented a seq2seq model based on the encoder-decoder structure. The information about the input (source) sentence was passed to the decoder using a single **context vector**, which was the final output of the encoder. Doing so allows to "compress" all the information from the input sentence, and it's then sequentially decoded and updated in the hidden states of the decoder.

However, this might be limitting for the decoding part: indeed, at a given step of the decoding process, it might be preferrable to have access to *all* the hidden states from the encoder rather than a single hidden state. This would allow to know which parts of the input sentence are the most relevent to generate the current word in the output sentence.

<img src="https://blog.floydhub.com/content/images/2019/09/Slide36.JPG" width="700"/>
<center><a href="https://blog.floydhub.com/attention-mechanism/">Source</a></center>

This can be implemented using a mechanism called **attention**, which is the topic of this lab.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import math
import time

from matplotlib import pyplot as plt
import matplotlib.ticker as ticker

# We'll be using torchtext and spacy to do most of the pre-processing
import spacy
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

# Set a random seed for reproducibility
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

## Pre-processing

The pre-processing is the same as in lab 4, so we re-use it.

In [None]:
# German and English specific pipelines
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

# Tokenizers
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# Fields
SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

# Dataset
train_data, valid_data, test_data = Multi30k.splits(root='data/', exts = ('.de', '.en'), fields = (SRC, TRG))

# Take a subset of the dataset (for speed)
train_data.examples = train_data.examples[:1000]
valid_data.examples = valid_data.examples[:100]
test_data.examples = train_data.examples[:100]

# Vocabulary
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

# Dataloader (here we keep the validation dataloader)
batch_size = 128
train_dataloader, valid_dataloader, test_dataloader = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size = batch_size)

# Fetch one example
example_batch = next(iter(train_dataloader))
print(example_batch.src.shape)

In [None]:
# Index to string functions
def itos_list_de(tensor_indx):
    return [SRC.vocab.itos[tensor_indx[i]] for i in range(len(tensor_indx))]

def itos_list_en(tensor_indx):
    return [TRG.vocab.itos[tensor_indx[i]] for i in range(len(tensor_indx))]

In [None]:
# Define all the parameters of the network (small model for speed)
input_dim = len(SRC.vocab)
output_dim = len(TRG.vocab)
embedding_dim_enc = 32
embedding_dim_dec = 32
hidden_dim_enc = 50
hidden_dim_dec = 50
n_layers = 1
dropout_rate = 0.5

## GRU encoder

We use a similar encoder to the previous lab, based on GRU instead of LSTM for simplicity (no need to handle the cell state). The main difference is that it needs to output the whole sequence of hidden states (not only the final hidden state / context vector) because we will use it in the attention module.

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim_enc, hidden_dim_enc, n_layers, dropout_rate):
        super().__init__()
        
        # TO DO: Store the parameters
        
        # TO DO: Create the layers
        
    def forward(self, src):
        
        # TO DO: Write the forward pass and return both outputs of the GRU (the set of all outputs and the context vector)
        
        return

In [None]:
# TO DO: Instanciate the encoder and print the number of parameters

# TO DO: Apply the encoder to the example batch and print the shapes of the outputs


## The attention mechanism

We now implement the attention mechanism. Intuitively, the idea behind attention is to say which word of the source sentence is the most important to generate the current word at decoding.

Mathematically, let's note $t'$ the current decoding step, $s_{t'-1}$ the previous hidden state of the decoder, and $H = \{h_1, h_2,...,h_T \}$  the set of all encoder outputs at the last layer. The attention vector $a_{t'}$ will therefore be calculated from $s_{t'-1}$ and $H$ through a set of operation with learnable parameters.

### An example

For the very first target token, the previous hidden state is given by $s_0$ = $z$ (= the context vector). The attention mechanism is illustrated below in this case, and we will consider it as example.

<img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013//assets/seq2seq9.png" width="500"/>
<center><a href="https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb">Source</a></center>

In [None]:
# Take the context vector of the encoder and treat it as the first hidden state to the decoder
dec_hidden = enc_context.squeeze()

The first step is to assemble / combine the last hidden decoder state $s_{t'-1}$ and a given encoder output $h_t$. There are many ways to do so (addition, multiplication...), so we will consider here a concatenation $c_{t} = [ h_t, s_{t'-1}]$.

Instead of doing a loop over all $t$ in the source sentence, we repeat the hidden state $s_{t'-1}$ $T$ times, and simply concatenate these: $C = [ H, S]$.

**Note**: this is easy to implement when the recurrent part uses 1 layer; otherwise the computation would be more involved as we'd need to consider the extra dimension corresponding to the number of layers and duplicate $H$. We leave that to further exploration.

In [None]:
# TO DO: compute this concatenation:
# - unsqueeze dec_hidden to add an extra dimension (over which repeat)
# - repeat it using the 'repeat' function (check the doc!)
# - concatenate the features using the 'cat' function
# - permute the dimensions so the resulting combined input has shape [batch_size, src_length, hidden_dim_enc+hidden_dim_dec]


We then pass this combined input to a network $f$ to compute the *energy* of the corresponding tuple:

$$
e_t = f(c_t)
$$

In practice, we use a linear layer (with learnable parameters) with a tanh activation, and we set the dimensions such that $e_t$ is a vector of length `hidden_dim_dec`.

In [None]:
# TO DO: Implement energy calculation


For each batch and sequence element, the energy vector has a length of `hidden_dim_dec`. We want to reduce it to a single scalar value, therefore we need to reduce its dimension. We do so using a weighted sum (or equivalently, a dot product):

$$
\hat{a}_t = v . e_t
$$

where $v$ contains these (learnable) weights. As a result we have a scalar value $\hat{a}_t$ which is (almost) the attention for word $t$. The last step is to apply a softmax function to $\hat{a}_t$ so that every entry is between $0$ and $1$ and the sum of attentions for a all words in a sentence sums up to $1$. This yields the attention vector $a_t$.

In [None]:
# TO DO: Compute attention: the weighted sum is simply implemented using a linear layer with output size = 1 and no bias


### The attention module

Now, we can use what we did above on the example to write the full attention module in the general case.

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_dim_enc, hidden_dim_dec):
        super().__init__()
        
        # TO DO: define the energy layer and the weighted sum layer
        
    def forward(self, dec_hidden, enc_outputs):
        
        # TO DO: compute attention
        
        return attn

In [None]:
# TO DO: Instanciate an Attention module, and apply it to the encoder outputs


## Decoder with attention

At the decoding step, we can now use the attention vector by applying it to the encoder outputs. This results in the *weighted* vector which is the average of encoder outputs scaled by attention:

$$
w = \sum_t a_t \times h_t
$$

In [None]:
# TO DO: Compute the weighted vector. It's needed to expand / permute the appropriate tensors
# Hint : to perform vector/matrix multiplication for batches, no need for a loop: use the 'torch.bmm' function.
# Finally, permute the weighted vector so that it has shape [1, batch_size, hidden_dim_dec]


Finally, we need to take this weighted vector into account when using the RNN (which here is a GRU). We remind that without attention, the RNN computation is simply $s_{t'} = \text{RNN}(y_{t'}, s_{t'-1})$. When using attention, the formula becomes:

$$
s_{t'} = \text{RNN}([y_{t'}, w], s_{t'-1})
$$

This means we concatenate the weighted vector with the RNN input $y_{t'}$, which is the embedding after dropout.

**Note**: using this concatenation changes the dimension of the RNN input. Therefore, when defining the RNN, the input dim should no longer be `embedding_dim_dec` but `embedding_dim_dec + hidden_dim_dec`.

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim_dec, hidden_dim_enc, hidden_dim_dec, n_layers, dropout_rate):
        super().__init__()
        
        # Store parameters
        self.output_dim = output_dim
        self.embedding_dim_dec = embedding_dim_dec
        self.hidden_dim_enc = hidden_dim_enc
        self.hidden_dim_dec = hidden_dim_dec
        self.n_layers = n_layers
        self.dropout_rate = dropout_rate
        
        # TO DO: Create the layers and attention module

        
    def forward(self, input_idx, input_hidden, enc_outputs):
        
        # Get the embeddings for the input token (same as in the previous lab)
        y = self.dropout_layer(self.embedding_layer(input_idx))
        y = y.unsqueeze(0)
        
        # TO DO: Compute attention
        
        # TO DO: Compute the weighted vector
        
        # TO DO: Concatenate the embeddings (after dropout) and the weighted vector
        
        # TO DO: apply the GRU layer
        
        # TO DO: squeeze the output of the GRU and pass it to the linear layer to have the predicted probabilites
        
        return pred_proba, hidden, a

In [None]:
# TO DO: Instanciate the decoder, print the number of parameters

# Initialize an input index tensor (corresponds to <sos>)
input_idx = torch.ones(batch_size).int() * 2

# TO DO: Apply the decoder, print the shape of the outputs


## Full model

The full model is the same as in the previous lab. The only difference comes from the fact that we store the attention at every step to output it (it will be used for vizualization).

In [None]:
# TO DO: write the Seq2Seq model


In [None]:
# TO DO: Instanciate the full model, apply it to the example_batch, and print the output shapes


## Training (with validation) and evaluation

We now implement the training function with validation.

In [None]:
# The evaluation function is provided, since it's the same as in the previous lab
def evaluate_seq2seq(model, eval_dataloader, loss_fn, device='cpu', verbose=True):

    model.eval()
    model.to(device)
    loss_eval = 0

    for i, batch in enumerate(eval_dataloader):

        # Get the source and target sentence, and the target length, copy it to device
        src, trg = batch.src.to(device), batch.trg.to(device)
        trg_len = trg.shape[0]

        # Apply the model
        pred_probas, _ = model(src, trg_len)

        # Remove the first token (always <sos>) to compute the loss
        output_dim = pred_probas.shape[-1]
        pred_probas = pred_probas[1:]

        # Reshape the pred_probas and target so that they have appropriate shapes:
        pred_probas = pred_probas.view(-1, output_dim)
        trg = trg[1:].view(-1)

        # Compute the loss
        loss = loss_fn(pred_probas, trg)

        # Record the loss
        loss_eval += loss.item()

    return loss_eval

In [None]:
# Below is the training function, which is also the same as in the previous lab, up to validation, which is left to implement.

def training_validation_seq2seq(model, train_dataloader, num_epochs, loss_fn, optimizer, model_name, valid_dataloader=None, device='cpu', verbose=True):

    model.train()
    model.to(device)

    loss_train_total = []
    loss_val_total = []
    
    loss_val_optim = float('inf')

    for epoch in range(num_epochs):

        loss_current_epoch = 0

        for i, batch in enumerate(train_dataloader):

            # Get the source and target sentence, and the target length, copy it to device
            src, trg = batch.src.to(device), batch.trg.to(device)
            trg_len = trg.shape[0]

            # Set the gradients at 0
            optimizer.zero_grad()

            # Apply the model
            pred_probas, _ = model(src, trg_len)

            # Remove the first token (always <sos>) to compute the loss
            output_dim = pred_probas.shape[-1]
            pred_probas = pred_probas[1:]

            # Reshape the pred_probas and target
            pred_probas = pred_probas.view(-1, output_dim)
            trg = trg[1:].view(-1)
            
            # Backpropagation
            loss = loss_fn(pred_probas, trg)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            # Record the loss
            loss_current_epoch += loss.item()

        # At the end of each epoch, save the average loss over batches and display it
        loss_train_total.append(loss_current_epoch)
        if verbose:
            print ('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss_current_epoch))

        # TO DO: Perform validation: save the current model only if it increases performance (i.e., decreases loss) on the validation set

                
    return loss_train_total, loss_val_total

In [None]:
# Training parameters
num_epochs = 10
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
loss_fn = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)
optimizer = optim.Adam(model.parameters())

# Train the model
loss_train, loss_val = training_validation_seq2seq(model, train_dataloader, num_epochs, loss_fn, optimizer, 'model_attention', valid_dataloader)

In [None]:
# Evaluate the model on the test set
model.load_state_dict(torch.load('model_attention.pt'))
loss_test = evaluate_seq2seq(model, test_dataloader, loss_fn)
print('Test loss: ', loss_test)

## Checking the results

In [None]:
# Get some examples (source and target) in the test set
example_batch_src, example_batch_trg = example_batch.src, example_batch.trg

# Compute predictions with the model
example_batch_trg_pred, a = model(example_batch_src, len(example_batch_trg))
indx_pred = torch.argmax(example_batch_trg_pred, -1)

# Print the true and predicted target sentences
indx_sentence_print = 1

sentence = itos_list_de(example_batch_src[:, indx_sentence_print])
translation = itos_list_en(example_batch_trg[:, indx_sentence_print])
translation_pred = itos_list_en(indx_pred[:, indx_sentence_print])
attention = a[indx_sentence_print]

print(sentence)
print(translation)
print(translation_pred)

## Vizualizing attention

We provide here a function to vizuale attention, that is, how "strongly" words in the source sentence relate to the words in the target sentence.

<img src="https://miro.medium.com/max/2000/1*FP3zFjdFhNUWEJ9hxeIYOA.png" width="800"/>
<center><a href="https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-attention-mechanism-9e844763d07b">Source</a></center>

In [None]:
def vizualize_attention(attention, sentence, translation):

    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)

    attention = attention.squeeze(1).detach().numpy()

    y_ticks = [] 
    for t in translation:
        y_ticks.append(t)
        if t == '<eos>':
            break

    x_ticks = [] 
    for t in sentence:
        x_ticks.append(t)
        if t == '<eos>':
            break

    x_ticks = [''] + x_ticks

    attention = attention[1:len(y_ticks), :len(x_ticks)-1]

    cax = ax.matshow(attention, cmap='bone')
    ax.tick_params(labelsize=15)

    ax.set_xticklabels(x_ticks, rotation=45)
    ax.set_yticklabels(y_ticks)

    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    plt.close()

    return

vizualize_attention(attention, sentence, translation_pred)

Once again, the results are no good. As a first "bonus" work (do it at home because it takes time), you can:

- use the whole dataset instead of a subset
- increase the model capacity: use embedding dimensions of 256 and hidden dimensions of 512

Compare the test loss of this model (GRU with attention) with the one from the previous bonus lab (GRU without attention, and with 1 layer and no recurrent dropout).