<a href="https://colab.research.google.com/github/saranshikens/Basic-ML/blob/main/AttentionClass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\Huge \text{ATTENTION LAYERS}$

●	CNNs  
●	RNNs  
●	LSTM  
●	Sequence to Sequence Modelling  
●	Intro to Attention

In this notebook, I will define 4 different kinds of attention mechanisms.  
These will be used in conjuction with 4 architectures, namely RNNs and LSTMs (both uni-directional and bi-directional), in Models.ipynb    
The classes have been created in such a way that they can be readily imported into the architectures and  
are compatible with pytorch.  
Each class accepts encoder hidden states and decoder state and returns context vector and attention weights.  
By - Saransh

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F



---



$\Large \text{Theory}$

For all the attention architectures mentioned below:  
Let the annotation generated by the encoder, for every word $x_i$ in an input sentence of length $T$, be $h_i$ (key),   
and the previous hidden decoder state be $s_{t-1}$ (query).  
The decoder takes each $h_i$ and feeds it to an alignment model, $a(.)$, together with $s_{t-1}$.  
This generates an attention score: $e_{t,i} = a(s_{t-1}, h_i)$.  
For each architecture, it is the alignment model that will change.  
A softmax function is appplied to each attention score to obtain the corresponding weight: $\alpha_{t,i} = \text{softmax}(e_{t,i})$.  
The context vector is given as a weighted sum of the annotations: $\displaystyle c_t = \sum_{i=1}^{T} \alpha_{t,i} h_i$.



---



$\Large \text{Bahdanau Attention}$

The alignment model combines $s_{t-1}$ and $h_i$ by either a weighted addition, or concatenation.  
Here, $a(s_{t-1},h_i) = \textbf{v}^T \text{tanh}(\textbf{W}[h_i;s_{t-1}])$, or  
$a(s_{t-1},h_i) = \textbf{v}^T \text{tanh}(\textbf{W}_1 h_i + \textbf{W}_2 s_{t-1})$, where $\textbf{v}$ is a weight vector and $\textbf{W}, \textbf{W}_1, \textbf{W}_2$ are weight matrices.  
These parameters will be learned by the model.  
I have implemented the second version of Bahdanau Attention here.

In [None]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size, attn_size):
        super(BahdanauAttention, self).__init__()
        self.W1 = nn.Linear(hidden_size, attn_size)
        self.W2 = nn.Linear(hidden_size, attn_size)
        self.v = nn.Linear(attn_size, 1, bias=False)

    def forward(self, key, query):
        # dimension of encoder_ouput is [batch_size, seq_len, hidden_size]
        # dimension of decoder_output is [batch_size, hidden_len]
        # to add these two, their dimension must be equal, so we 'unsqueeze' the
        # decoder_output, so that its 2nd dimension is sequence_len
        sequence_len = key.size(1) # length of input sentence
        query = query.unsqueeze(1).repeat(1, sequence_len, 1)

        score = torch.tanh(self.W1(key) + self.W2(query))
        # dimension of context vector should be  (batch_size, hidden_size), so
        # we squeeze and unsqueeze accordingly
        attention_weights = F.softmax(self.v(score).squeeze(-1), dim=1)

        context_vector = torch.bmm(attention_weights.unsqueeze(1), key).squeeze(1)

        return attention_weights, context_vector



---



$\Large \text{Luong Dot Attention}$  


Here, $a(s_t, h_i) = s_{t}^{T} h_i$

In [None]:
class LuongDotAttention(nn.Module):
    def __init__(self):
        super(LuongDotAttention, self).__init__()

    def forward(self, key, query):
        query = query.unsqueeze(2)
        att_scores = torch.bmm(key, query).squeeze(2)
        att_weights = F.softmax(att_scores, dim=1)
        context_vector = torch.bmm(att_weights.unsqueeze(1), key).squeeze(1)

        return att_weights, context_vector



---



$\Large \text{Luong General Attention}$

Here, $a(s_t, h_i) = s_{t}^{T}\textbf{W}_a h_i$, where $\textbf{W}_a$ is learned by the model.

In [None]:
class LuongGeneralAttention(nn.Module):
    def __init__(self, hidden_size):
        super(LuongGeneralAttention, self).__init__()
        self.W = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, key, query):
        transformed_key = self.W(key)
        query = query.unsqueeze(2)
        att_scores = torch.bmm(transformed_key, query).squeeze(2)
        att_weights = F.softmax(att_scores, dim=1)
        context_vector = torch.bmm(att_weights.unsqueeze(1), key).squeeze(1)

        return att_weights, context_vector



---



$\Large \text{Luong Concat Attention}$

Here, $a(s_t,h_i) = \textbf{v}^T \text{tanh}(\textbf{W}[h_i;s_t])$.

In [None]:
class LuongConcatAttention(nn.Module):
    def __init__(self, hidden_size, attn_size):
        super(LuongConcatAttention, self).__init__()
        # Input to the attention is the concatenation → size is 2 * hidden_size
        self.W = nn.Linear(2 * hidden_size, attn_size)
        self.v = nn.Linear(attn_size, 1, bias=False)

    def forward(self, key, query):
        batch_size, seq_len, hidden_size = key.size()
        query = query.unsqueeze(1).repeat(1, seq_len, 1)
        concat_matrix = torch.cat((key, query), dim=2)
        scores = torch.tanh(self.W(concat_matrix))
        att_scores = self.v(scores).squeeze(2)
        att_weights = F.softmax(att_scores, dim=1)
        context_vector = torch.bmm(att_weights.unsqueeze(1), key).squeeze(1)

        return att_weights, context_vector

To look at the models, run Models.ipynb