# ***Seq2Seq with Attention Mechanism (model only)***

Here, in this notebook, we will implement a Seq2Seq model with attention mechanism. There are mainly two types of alignment mechanism for attention namely Luong and Badhanau Attention. We will be using Badhanau Attention in this case.


Bahdanau attention can learn more complex relations between the data than other types of attention mechanisms because it employs a neural network to compute the attention weights rather than a simple mathematical algorithm.


You can find more information about these attention types [here](https://https://www.baeldung.com/cs/attention-luong-vs-bahdanau).

In [7]:
import torch
import torch.nn as nn
import random

## **Encoder**

In [8]:
class Encoder(nn.Module):

  def __init__(self,input_size, embed_size, hidden_size, num_layers, dropout):
    super(Encoder,self).__init__()

    self.hidden_size = hidden_size
    self.num_layers = num_layers

    self.embedding = nn.Embedding(input_size, embed_size)
    self.rnn = nn.LSTM(embed_size, hidden_size, num_layers, bidirectional = True)

    self.fc_hidden = nn.Linear(hidden_size * 2, hidden_size)
    self.fc_cell = nn.Linear(hidden_size * 2, hidden_size)

    self.dropout = nn.Dropout(p=dropout)


  def forward(self, x):

    embedding = self.dropout(self.embedding(x))

    encoder_states, (hidden, cell) = self.rnn(embedding)

    hidden = self.fc_hidden(torch.cat((hidden[0:1], hidden[1:2]),dim=2))

    cell = self.fc_cell(torch.cat((cell[0:1], cell[1:2]), dim=2))

    return encoder_states, hidden, cell

**Explaining the Encoder block.**

Seq2Seq is for language related task that carries out operation sequentially using Encoder Decoder RNN architectures.



We will need following elements for making an encoder block.
*   input_size = size of source vocab
*   embed_size = dimension of embedding that you want the words to be represented in
* hidden_size = size of hidden layer created by RNN
* num_layer = how many layer of RNN that you want

We have used a Bidirectional LSTM. When `bidirectional=True`, output will contain a concatenation of the forward and reverse hidden states at each time step in the sequence.

So, the encoder_state, hidden, and cell will have size of `2*hidden_size`.

As we will be forming a context vector that has information of each LSTM cell. The cell and hidden values of each LSTM cell will carry that informaition forward to context_vector. So, instead of using all or just one of those hidden information by choosing by ourselves, we will let the NN decide by itself.

`fc_hidden` and `fc_cell` will convert size of `2*hidden_size` into size of `hidden_size` via the use of Linear Neural Network.


**Forward Block**


* Initially, `X` has dimension of `(sequence_length, batch_size)`.

.


* `embedding = self.dropout(self.embedding(x))` will create embedding of X with dropout. It has extra dimension as we have created embeddings of each word for Machine to understand it well. The more dimension, the better but it will require higher computation resources.

.


* `encoder_states, (hidden, cell) = self.rnn(embedding)`. LSTM will output in encoder_state, (hidden,cell) where encoder_state has dimension `(seq_length, N, hidden_size)` where N is batch_size. But as we have used `bidirectional=True`, the dimension will be `(seq_length, N, hidden_size * 2)`. [For more info on LSTM](https://https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)

.


* hidden and cell have dimension of `(num_layer, N, hidden_size*2)` or let's say, `(1,N,hidden_size*2)`


.


* `hidden = self.fc_hidden(torch.cat((hidden[0:1], hidden[1:2]), dim=2))`, we have concatenated forward and backward hidden into one. Then we sent it through `fc_hidden` for NN to make it into size of `hidden_size` by choosing itself what information it finds important.


.

* We have done similar for `cell` using `fc_cell`.

.


* We return `encoder_states, hidden, cell` from the function


.

.

## **Decoder**

In [9]:
class Decoder(nn.Module):

  def __init__(self, input_size, embed_size, hidden_size, output_size, num_layers, dropout):

    super(Decoder, self).__init__()

    self.hidden_size = hidden_size
    self.num_layers = num_layers

    self.embedding = nn.Embedding(input_size, embed_size)
    self.dropout = nn.Dropout(dropout)
    self.rnn = nn.LSTM(2*hidden_size+embed_size, hidden_size, num_layers)

    self.energy = nn.Linear(3*hidden_size, 1)
    self.fc = nn.Linear(hidden_size, output_size)

    self.relu = nn.ReLU()
    self.softmax = nn.Softmax(0)


  def forward(self, x, encoder_states, hidden, cell):

    # Dimensions are as
    # x -> (N)  : <SOS> tokens of all instances i.e. of size = Batch size
    # encoder_states -> (seq_len, N, 2*hidden_size)
    # hidden -> (1, N, hidden_size)
    # cell -> (1, N, hidden_size)

    x = x.unsqueeze(0)
    # x -> (1,N)

    embeddings = self.dropout(self.embedding(x))
    # embeddings -> (1,N,embed_size)

    seq_len = encoder_states.shape[0]
    h_reshaped = hidden.repeat(seq_len, 1, 1)
    # h_reshaped -> (seq_length, N, hidden_size)
    # you can check what it does in your terminal by trying (>> x = torch.randn(1, 2, 5) >> x.repeat(3,1,1) >>x)

    # we will concat h_reshaped with encoder_states to compute energy. It was the reason for reshaping the hidden.
    energy = self.relu(self.energy(torch.cat((h_reshaped, encoder_states), dim=2)))
    # self.energy takes 3*hidden_size as input and gives out a single output. That output is then passed via relu.
    # energy -> (seq_len, N, 1)

    attention = self.softmax(energy)
    # attention -> (seq_len,N,1)
    # The softmax operation normalizes the values along the sequence length dimension, ensuring they sum to 1 and represent probabilities.

    # attention -> snk -> (seq_len, N, 1)
    # encoder_states -> snl -> (seq_len, N, hidden_size*2)
    # we want context vector of dimension knl i.e. (1,N,hidden_size*2)
    context_vector = torch.einsum("snk,snl->knl", attention, encoder_states)
    # context_vector -> (1,N,hidden_size*2)


    # For decoder rnn, input will be context_vector and embeddings of target, the context vector will be concatenated with embedding
    # embeddings -> (1,N,embed_size)
    rnn_input = torch.cat((context_vector, embeddings), dim=2)
    # rnn_input: (1, N, hidden_size*2 + embed_size)

    outputs, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
    # outputs shape: (1, N, hidden_size)
    # (hidden,cell) are provided to the LSTM (in self.rnn layer) so that it will take previous context info of (hidden,cell) for initialization
    #When processing the first time step of a sequence, the RNN or LSTM needs an initial hidden state (hidden) and cell state (cell) to start the sequence processing.
    # These states act as memory from previous time steps.

    # outputs -> (1,N,hidden_size) ; self.fc = Linear(hidden_size,output_size) -> (1,N,output_size)
    predictions = self.fc(outputs).squeeze(0)
    # predictions: (N, output_size)

    return predictions, hidden, cell


## **Seq2Seq**

In [17]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder,target_vocab_size,device):
      super(Seq2Seq, self).__init__()
      self.encoder = encoder
      self.decoder = decoder
      self.target_vocab_size = target_vocab_size
      self.device= device

    def forward(self, source, target, teacher_force_ratio=0.5):
      batch_size = source.shape[1]
      target_len = target.shape[0]
      target_vocab_size = self.target_vocab_size

      outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(self.device)
      # outputs -> (seq_len, N, output_size)

      encoder_states, hidden, cell = self.encoder(source)
      # encoder_states -> (seq_len, N, hidden_size*2)
      # hidden -> (1,N,hidden)
      # cell -> (1,N,hidden)

      # First input will be <SOS> token
      x = target[0] # x -> (N)

      for t in range(1, target_len):
          # At every time step use encoder_states and update hidden, cell

          output, hidden, cell = self.decoder(x, encoder_states, hidden, cell)
          # output -> (N, output_size)
          # hidden -> (1,N,hidden)
          # cell -> (1,N,hidden)


          # Store prediction for current time step
          outputs[t] = output

          # Get the best word the Decoder predicted (index in the vocabulary)
          best_guess = output.argmax(1)

          # With probability of teacher_force_ratio we take the actual next word
          # otherwise we take the word that the Decoder predicted it to be.
          # Teacher Forcing is used so that the model gets used to seeing
          # similar inputs at training and testing time, if teacher forcing is 1
          # then inputs at test time might be completely different than what the
          # network is used to. This was a long comment.
          x = target[t] if random.random() < teacher_force_ratio else best_guess

      return outputs


**Model Building**

In [11]:
encoder_net = Encoder(input_size=10000,
                      embed_size=256,
                      hidden_size=512,
                      num_layers=1,
                      dropout=0.5
                      )

In [12]:
decoder_net = Decoder(input_size=10000,
                      embed_size=256,
                      hidden_size=512,
                      output_size=5000,
                      num_layers=1,
                      dropout=0.5
)

In [13]:
device = torch.device('cuda') if torch.cuda.is_available() else "cpu"
device

'cpu'

In [18]:
model = Seq2Seq(encoder=encoder_net,
                decoder=decoder_net,
                target_vocab_size=5000,
                device=device
                )

In [19]:
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(10000, 256)
    (rnn): LSTM(256, 512, bidirectional=True)
    (fc_hidden): Linear(in_features=1024, out_features=512, bias=True)
    (fc_cell): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(10000, 256)
    (dropout): Dropout(p=0.5, inplace=False)
    (rnn): LSTM(1280, 512)
    (energy): Linear(in_features=1536, out_features=1, bias=True)
    (fc): Linear(in_features=512, out_features=5000, bias=True)
    (relu): ReLU()
    (softmax): Softmax(dim=0)
  )
)

In [20]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 15,564,169 trainable parameters


## This is it for Seq2Seq model with attention. Thank you😀