<a href="https://colab.research.google.com/github/rahiakela/dive-to-deep-learning/blob/main/10-attention-mechanisms/2_sequence_to_sequence_with_attention_mechanisms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence to Sequence with Attention Mechanisms

In this notebook, we add the attention mechanism to the sequence to sequence (seq2seq) model to explicitly aggregate states with weights.It shows the model
architecture for encoding and decoding at the time step $t$. 

Here, the memory of the attention layer consists of all the information that the encoder has seen—the encoder output at each time step. During the decoding, the decoder output from the previous time step $t - 1$ is used as the
query. The output of the attention model is viewed as the context information, and such context is concatenated with the decoder input Dt. Finally, we feed the concatenation into the decoder.

<img src='https://github.com/rahiakela/img-repo/blob/master/seq2seq.png?raw=1' width='800'/>

To illustrate the overall architecture of seq2seq with attention model, the layer structure of its encoder and decoder is shown below.

<img src='https://github.com/rahiakela/img-repo/blob/master/seq2seq-layers.png?raw=1' width='800'/>

## Setup

In [None]:
%%shell

pip install -U mxnet-cu101==1.7.0
pip install d2l

In [6]:
from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.gluon import rnn, nn
npx.set_np()

## Encoder

Technically speaking, the encoder transforms an input sequence of variable length into a fixedshape context variable c, and encodes the input sequence information in this context variable. We can use an RNN to design the encoder.

Let us consider a sequence example (batch size: 1). Suppose that the input sequence is $x_1,..., x_T$ , such that $x_t$ is the tth token in the input text sequence. At time step $t$, the RNN transforms the input feature vector $X_t$ for $x_t$ and the hidden state $h_t-1$ from the previous time step into the current hidden state $h_t$. We can use a function $f$ to express the transformation of the RNNʼs recurrent layer:

$$h_t = f(X_t, h_t - 1)$$

In general, the encoder transforms the hidden states at all the time steps into the context variable through a customized function $q$:

$$c = q(h_1, ..., h_T)$$

For example, when choosing $q(h_1, ..., h_T ) = h_T$, the context variable is just the hidden state $h_T$ of the input sequence at the final time step.

Now let us implement the RNN encoder. Note that we use an embedding layer to obtain the feature vector for each token in the input sequence. The weight of an embedding layer is a matrix whose number of rows equals to the size of the input vocabulary (vocab_size) and number of columns equals to the feature vectorʼs dimension (embed_size).

For any input token index $i$, the embedding layer fetches the $i^{th}$ row (starting from 0) of the weight matrix to return its feature vector. Besides,
here we choose a multilayer GRU to implement the encoder.

In [7]:
class Seq2SeqEncoder(d2l.Encoder):
  """The RNN encoder for sequence to sequence learning."""

  def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, dropout=0, **kwargs):
    super(Seq2SeqEncoder, self).__init__(**kwargs)

    # Embedding layer
    self.embedding = nn.Embedding(vocab_size, embed_size)
    self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout)

  def forward(self, X, *args):
    # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
    X = self.embedding(X)
    # In RNN models, the first axis corresponds to time steps
    X = X.swapaxes(0, 1)
    state = self.rnn.begin_state(batch_size=X.shape[1], ctx=X.ctx)
    output, state = self.rnn(X, state)
    # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
    # `state[0]` shape: (`num_layers`, `batch_size`, `num_hiddens`)
    return output, state

Let us still use a concrete example to illustrate the above encoder implementation. Below we instantiate a twolayer GRU encoder whose number of hidden units is 16. Given a minibatch of sequence inputs X (batch size: 4, number of time steps: 7), the hidden states of the last layer at all the time steps (output return by the encoderʼs recurrent layers) are a tensor of shape (number of time steps, batch size, number of hidden units).

In [8]:
encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16, num_layers=2)
encoder.initialize()

X = np.zeros((4, 7))
output, state = encoder(X)
output.shape

(7, 4, 16)

Since a GRU is employed here, the shape of the multilayer hidden states at the final time step is (number of hidden layers, batch size, number of hidden units). If an LSTM is used, memory cell information will also be contained in state.

In [9]:
len(state), state[0].shape

(1, (2, 4, 16))

## Decoder