## Objective:
            
<p> We all talk about language modelling these days. There are plenty of models coming up every month or so but the core idea behind these models remains same which is `Attention`. The motivation for this notebook is to get an complete picture of attention mechanisms. It's types and various architectures being used currently for different types of problems.</p>

Note:
<p>This work is an extensive research/study based on different resources(acknowledged below) for my personal reference as well as a knowledge sharing to the community</p>

<font color='#31a04b' size=4>Can I get your attention?</font><br>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F

import sys
from torchtext import data
from torchtext import datasets
from torchtext.vocab import Vectors, GloVe
from sklearn import preprocessing
import random
from torchtext.vocab import Vectors

# Table of Contents

- 1. Attention and its types
   - 1.1 Evolution
   - 1.1 Different types

- 2. Attention and its application in different architecture
   - 2.1 Seq2seq architecture
   - 2.2 Different alignment scores/functions
   
- 3. Some working gifs
   - 3.1 Encoder decoder gifs
   
- 4. Attention vs self attention

- 5. Attention for text classification
   - 5.1 Preparing dataset
   - 5.2 Using attention
   - 5.3 Using self attention
  
- 6. Acknowledgements

# 1. Attention and its types

## What is attention?

<p>Attention is a mechanism combined in the RNN allowing it to focus on certain parts of the input sequence when predicting a certain part of the output sequence, enabling easier learning and of higher quality. With that said its not only applicable to RNNs. It can be applied to any set of problems including vision as the idea is generic.</p>

## 1.1 Evolution

> Now let us see the evolution of different attentions across time:


<img src='https://buomsoo-kim.github.io/data/images/2020-01-01/2.png' width=1000>
<div align="center"><font size="3">Source: Google</font></div>


At a broader level attention can be classified as two types:

1. Between the input and output elements (General Attention)
2. Within the input elements (Self-Attention)

Let's briefly walk through the image shown above with simple examples

## Seq2seq architecture

<img src='https://miro.medium.com/max/1400/1*iK8Wel75Ri55rSZfwAKHCA.jpeg' width=1000>
<div align="center"><font size="3">Source: Google</font></div>

*  The RNN encoder has an input sequence x1, x2, x3, x4. We denote the encoder states by c1, c2, c3. The encoder outputs a single output vector c which is passed as input to the decoder. Like the encoder, the decoder is also a single-layered RNN, we denote the decoder states by s1, s2, s3 and the network’s output by y1, y2, y3, y4.

*  A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus.

## Align and translate

<img src='https://miro.medium.com/max/1400/1*wnXVyE8LXPfODvB_Z5vu8A.jpeg' width=1000>
<div align="center"><font size="3">Source: Google</font></div>

* Our attention model has a single layer RNN encoder, again with 4-time steps. We denote the encoder’s input vectors by x1, x2, x3, x4 and the output vectors by h1, h2, h3, h4. The attention mechanism is located between the encoder and the decoder, its input is composed of the encoder’s output vectors h1, h2, h3, h4 and the states of the decoder s0, s1, s2, s3, the attention’s output is a sequence of vectors called context vectors denoted by c1, c2, c3, c4.

## Visual attention

Xu et al. (2015) proposed an attention framework that extends beyond the conventional Seq2Seq architecture. Their framework attempts to align the input image and output word, tackling the image captioning problem.

<img src='https://buomsoo-kim.github.io/data/images/2020-01-01/5.png' width=1000>
<div align="center"><font size="3">Source: Google</font></div>

> Accordingly, they utilized a convolutional layer to extract features from the image and align such features using RNN with attention. The generated words (captions) are aligned with specific parts of the image, highlighting the relevant objects as below. Their framework is one of the earlier attempts to apply attention to other problems than neural machine translation.

## Hierarchical attention:

Yang et al. (2016) demonstrated with their hierarchical attention network (HAN) that attention can be effectively used on various levels. Also, they showed that attention mechanism applicable to the classification problem, not just sequence generation.

<img src='https://buomsoo-kim.github.io/data/images/2020-01-01/7.png' width=700>
<div align="center"><font size="3">Source: Google</font></div>

> HAN comprises two encoder networks - i.e., word and sentence encoders. The word encoder processes each word and aligns them a sentence of interest. Then, the sentence encoder aligns each sentence with the final output. HAN enables hierarchical interpretation of results as below. The user can understand (1) which sentence is crucial in classifying the document and (2) which part of the sentence, i.e., which words, are salient in that sentence.



<img src='https://buomsoo-kim.github.io/data/images/2020-01-01/8.png' width=700>
<div align="center"><font size="3">Source: Google</font></div>

## 1.2 Types

Here is the summary of categories for attention mechanisms:

* `Self-Attention(&)`	Relating different positions of the same input sequence.
* `Global/Soft`	    Attending to the entire input state space
* `Local/Hard`	Attending to the part of input state space; i.e. a patch of the input image.

There are other categories based on the alignment scores used in the attention which we will see in the next section

# 2. Attention and its application in different architecture
## 2.1 Seq2seq architecture

## Bahdanau attention/Additive attention

<img src='https://blog.floydhub.com/content/images/2019/09/Slide38.JPG' width=1000>
<div align="center"><font size="3">Source: Google</font></div>



The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

* Producing the Encoder Hidden States - Encoder produces hidden states of each element in the input sequence
* Calculating Alignment Scores between the previous decoder hidden state and each of the encoder’s hidden states are calculated (Note: The last encoder hidden state can be used as the first hidden state in the decoder)
* Softmaxing the Alignment Scores - the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
* Calculating the Context Vector - the encoder hidden states and their respective alignment scores are multiplied to form the context vector
* Decoding the Output - the context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new   output
* The process (steps 2-5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

The below image sums the above steps:


<img src='https://miro.medium.com/max/1400/1*IoNs3pdgl57_HqRXufZ0lA.png' width=1000 height=1000>
<div align="center"><font size="3">Source: Google</font></div>


## Luong attention/Multiplicative attention:




<img src='https://miro.medium.com/max/1400/1*ICeT6bTWmzUaGQkpKWVnLQ.png' width=1000 height=1000>
<div align="center"><font size="3">Source: Google</font></div>


The process is as below:

* Producing the Encoder Hidden States - Encoder produces hidden states of each element in the input sequence
* Decoder RNN - the previous decoder hidden state and decoder output is passed through the Decoder RNN to generate a new hidden state for that time step
* Calculating Alignment Scores - using the new decoder hidden state and the encoder hidden states, alignment scores are calculated
* Softmaxing the Alignment Scores - the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
* Calculating the Context Vector - the encoder hidden states and their respective alignment scores are multiplied to form the context vector
* Producing the Final Output - the context vector is concatenated with the decoder hidden state generated in step 2 as passed through a fully connected layer to produce a new output
* The process (steps 2-6) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

## 2.2 Different alignment scores/functions

## Aligment score functions

<img src='https://miro.medium.com/max/1400/1*oosK1XGaYr0AoSxfs9fx5A.png' width=1000 height=1000>
<div align="center"><font size="3">Source: Google</font></div>


Based on this notations, each attention has a different name as follows:

<img src='https://miro.medium.com/max/1400/1*XzPD6cyrbWPP0r27PXVWOw.png' width=1000 height=1000>
<div align="center"><font size="3">Source: Google</font></div>


# 3. Some working gifs

## 3.1 Encoder-Decoder gif

<img src='https://miro.medium.com/max/1400/1*wBHsGZ-BdmTKS7b-BtkqFQ.gif' width=1000 height=1000>
<div align="center"><font size="3">Source: Google</font></div>

> As we can see, once the encoder hidden states are computed, the decoder hidden state at the previous timestep i-1 is multiplied/summed with the all of the encoder states and softmaxed to get the aligment score which is again multiplied with encoder states to get the attention weights for each of the encoder states and then all of it are summed to get the context vector which is passed into the decoder for that time step i.

# 4. Attention vs self attention 

Now let's see the major difference between attention and self attention(highly used in recent architectures):

* Attention is often applied to transfer information from encoder to decoder. I.e. decoder neurons receive addition input (via Attention) from the encoder states/activations. So in this case Attention connects 2 different components - encoder and decoder. If Self-attention is applied - it doesn't connect 2 different components, it's applied within one component

* Self-attention may be applied many times independently within a single model (e.g. 18 times in Transformer, 12 times in BERT BASE) while Attention is usually applied once in the model and connects some 2 components (e.g. encoder and decoder).

* Self-attention is good at modeling dependencies between different parts of the sequence. For example - understand the syntactic function between words in the sentence. Attention on the other hand models only the dependencies between 2 different sequences (for example, the original text and the translation of the text). While still the Self-attention is good in translation task 

# 5. Attention for text classification

## 5.1 Preparing the dataset

In [None]:
#Reproducing same results
SEED = 2019

#Torch
torch.manual_seed(SEED)

#Cuda algorithms
torch.backends.cudnn.deterministic = True  

In [None]:
vectors = Vectors(name='../input/glove6b/glove.6B.300d.txt')
vectors.dim

In [None]:
TEXT = data.Field(tokenize='spacy', lower=True,batch_first=True,include_lengths=True,fix_length=200,sequential=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True,) 

fields = [(None, None), ('text',TEXT),(None,None),('sentiment', LABEL)]

#loading custom dataset
training_data=data.TabularDataset(path = '../input/tweet-sentiment-extraction/train.csv',format = 'csv',fields = fields,skip_header = True)

#print preprocessed text
print(vars(training_data.examples[0]))

In [None]:
train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))

In [None]:
#initialize glove embeddings
TEXT.build_vocab(train_data,min_freq=3,vectors =vectors)  
LABEL.build_vocab(train_data)

#No. of unique tokens in text
print("Size of TEXT vocabulary:",len(TEXT.vocab))

#No. of unique tokens in label
print("Size of LABEL vocabulary:",len(LABEL.vocab))

#Commonly used words
print(TEXT.vocab.freqs.most_common(10))  

#Word dictionary
print(TEXT.vocab.stoi)

In [None]:
#check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  

#set batch size
BATCH_SIZE = 64

#Load an iterator
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

## 5.2 Using attention

In [None]:
class AttentionModel(torch.nn.Module):  ## General attention
    def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
        super(AttentionModel, self).__init__()

        """
        Arguments
        ---------
        batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
        output_size : 3 = (pos, neg,neutral)
        hidden_sie : Size of the hidden_state of the LSTM
        vocab_size : Size of the vocabulary containing unique words
        embedding_length : Embeddding dimension of GloVe word embeddings
        weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 

        --------

        """

        self.batch_size = batch_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.embedding_length = embedding_length

        self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
        self.word_embeddings.weights = nn.Parameter(weights, requires_grad=False)
        self.lstm = nn.LSTM(embedding_length, hidden_size)
        self.label = nn.Linear(hidden_size, output_size)
        #self.attn_fc_layer = nn.Linear()

    def attention_net(self, lstm_output, final_state):

        """ 
        Now we will incorporate Attention mechanism in our LSTM model. In this new model, we will use attention to compute soft alignment score corresponding
        between each of the hidden_state and the last hidden_state of the LSTM. We will be using torch.bmm for the batch matrix multiplication.

        Arguments
        ---------

        lstm_output : Final output of the LSTM which contains hidden layer outputs for each sequence.
        final_state : Final time-step hidden state (h_n) of the LSTM

        ---------

        Returns : It performs attention mechanism by first computing weights for each of the sequence present in lstm_output and and then finally computing the
                  new hidden state.

        Tensor Size :
                    hidden.size() = (batch_size, hidden_size)
                    attn_weights.size() = (batch_size, num_seq)
                    soft_attn_weights.size() = (batch_size, num_seq)
                    new_hidden_state.size() = (batch_size, hidden_size)

        """

        hidden = final_state.squeeze(0)
        #print("++++",hidden.unsqueeze(2).shape)
        attn_weights = torch.bmm(lstm_output, hidden.unsqueeze(2)).squeeze(2)
        soft_attn_weights = F.softmax(attn_weights, 1)
        new_hidden_state = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)

        return new_hidden_state

    def forward(self, input_sentences):

        """ 
        Parameters
        ----------
        input_sentence: input_sentence of shape = (batch_size, num_sequences)
        batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)

        Returns
        -------
        Output of the linear layer containing logits for pos & neg class which receives its input as the new_hidden_state which is basically the output of the Attention network.
        final_output.shape = (batch_size, output_size)

        """

        input = self.word_embeddings(input_sentences) #m,200,300
        input = input.permute(1, 0, 2)  #200,m,300

        if batch_size is None:
            h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) #1,m,128
            c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) #1,m,128
        else:
            h_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
            c_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())

        output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0)) # final_hidden_state.size() = (1, batch_size, hidden_size) 
        output = output.permute(1, 0, 2) # output.size() = (batch_size, num_seq, hidden_size)
        #print("--",output.size(),final_hidden_state.shape)
        attn_output = self.attention_net(output, final_hidden_state)
        logits = self.label(attn_output)

        return logits

In [None]:
def clip_gradient(model, clip_value):
    params = list(filter(lambda p: p.grad is not None, model.parameters()))
    for p in params:
        p.grad.data.clamp_(-clip_value, clip_value)
    
def train_model(model, train_iter, epoch):
    total_epoch_loss = 0
    total_epoch_acc = 0
    model.cuda()
    
    optim = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
    steps = 0
    model.train()
    for idx, batch in enumerate(train_iter):
        
        text = batch.text[0]
        target = batch.sentiment.long()
     
        if torch.cuda.is_available():
            text = text.cuda()
            target = target.cuda()
            
        if (text.size()[0] is not 64):# One of the batch returned by BucketIterator has length different than 64.
            continue
        
        optim.zero_grad()
        prediction = model(text)
        
        #print(prediction.shape,target.shape)
        loss = loss_fn(prediction, target)
        
        num_corrects = (torch.max(prediction, 1)[1].data == target.squeeze()).float().sum()
        acc = 100.0 * num_corrects/len(batch)
        loss.backward()
        clip_gradient(model, 1e-1)
        optim.step()
        steps += 1
        
        if steps % 500 == 0:
            print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%')
        
        total_epoch_loss += loss.item()
        total_epoch_acc += acc.item()
        
    return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter)

def eval_model(model, val_iter):
    total_epoch_loss = 0
    total_epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for idx, batch in enumerate(val_iter):
            text = batch.text[0]
            target = batch.sentiment.long()
            
            if (text.size()[0] is not 64):
                continue
            
            if torch.cuda.is_available():
                text = text.cuda()
                target = target.cuda()
                
            prediction = model(text)
            loss = loss_fn(prediction, target)
            num_corrects = (torch.max(prediction, 1)[1].data == target.squeeze()).float().sum()
            acc = 100.0 * num_corrects/len(batch)
            
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()

    return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter)

In [None]:
#define hyperparameters
learning_rate = 2e-5
batch_size = 64
output_size = 3
hidden_size = 128
embedding_length = 300

model = AttentionModel(batch_size, output_size, hidden_size, len(TEXT.vocab), embedding_length, TEXT.vocab.vectors)
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
for epoch in range(10):
    train_loss, train_acc = train_model(model, train_iterator, epoch)
    val_loss, val_acc = eval_model(model, valid_iterator)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')

## 5.2 Using self attention

In [None]:
class SelfAttention(nn.Module):
	def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
		super(SelfAttention, self).__init__()

		"""
		Arguments
		---------
		batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 3 = (pos, neg,neutral)
		hidden_sie : Size of the hidden_state of the LSTM
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embeddding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 
		
		--------
		
		"""

		self.batch_size = batch_size
		self.output_size = output_size
		self.hidden_size = hidden_size
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		self.weights = weights

		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
		self.word_embeddings.weights = nn.Parameter(weights, requires_grad=False)
		self.dropout = 0.8
		self.bilstm = nn.LSTM(embedding_length, hidden_size, dropout=self.dropout, bidirectional=True)
		# We will use da = 350, r = 30 & penalization_coeff = 1 as per given in the self-attention original ICLR paper
		self.W_s1 = nn.Linear(2*hidden_size, 350)
		self.W_s2 = nn.Linear(350, 30)
		self.fc_layer = nn.Linear(30*2*hidden_size, 2000)
		self.label = nn.Linear(2000, output_size)

	def attention_net(self, lstm_output):

		"""
		Now we will use self attention mechanism to produce a matrix embedding of the input sentence in which every row represents an
		encoding of the inout sentence but giving an attention to a specific part of the sentence. We will use 30 such embedding of 
		the input sentence and then finally we will concatenate all the 30 sentence embedding vectors and connect it to a fully 
		connected layer of size 2000 which will be connected to the output layer of size 3 returning logits for our three classes i.e., 
		pos & neg ,neutral.
		Arguments
		---------
		lstm_output = A tensor containing hidden states corresponding to each time step of the LSTM network.
		---------
		Returns : Final Attention weight matrix for all the 30 different sentence embedding in which each of 30 embeddings give
				  attention to different parts of the input sentence.
		Tensor size : lstm_output.size() = (batch_size, num_seq, 2*hidden_size)
					  attn_weight_matrix.size() = (batch_size, 30, num_seq)
		"""
		attn_weight_matrix = self.W_s2(torch.tanh(self.W_s1(lstm_output)))
		attn_weight_matrix = attn_weight_matrix.permute(0, 2, 1)
		attn_weight_matrix = F.softmax(attn_weight_matrix, dim=2)

		return attn_weight_matrix

	def forward(self, input_sentences, batch_size=None):

		""" 
		Parameters
		----------
		input_sentence: input_sentence of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for pos & neg class.
		
		"""

		input = self.word_embeddings(input_sentences)
		input = input.permute(1, 0, 2)
        
		if batch_size is None:
			h_0 = Variable(torch.zeros(2, self.batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(2, self.batch_size, self.hidden_size).cuda())
		else:
			h_0 = Variable(torch.zeros(2, batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(2, batch_size, self.hidden_size).cuda())

		output, (h_n, c_n) = self.bilstm(input, (h_0, c_0))
		output = output.permute(1, 0, 2)
		# output.size() = (batch_size, num_seq, 2*hidden_size)
		# h_n.size() = (1, batch_size, hidden_size)
		# c_n.size() = (1, batch_size, hidden_size)
		attn_weight_matrix = self.attention_net(output)
		# attn_weight_matrix.size() = (batch_size, r, num_seq)
		# output.size() = (batch_size, num_seq, 2*hidden_size)
		hidden_matrix = torch.bmm(attn_weight_matrix, output)
		# hidden_matrix.size() = (batch_size, r, 2*hidden_size)
		# Let's now concatenate the hidden_matrix and connect it to the fully connected layer.
		fc_out = self.fc_layer(hidden_matrix.view(-1, hidden_matrix.size()[1]*hidden_matrix.size()[2]))
		logits = self.label(fc_out)
		# logits.size() = (batch_size, output_size)

		return logits

In [None]:


#define hyperparameters
learning_rate = 2e-5
batch_size = 64
output_size = 3
hidden_size = 128
embedding_length = 300

model = SelfAttention(batch_size, output_size, hidden_size, len(TEXT.vocab), embedding_length, TEXT.vocab.vectors)
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
for epoch in range(10):
    train_loss, train_acc = train_model(model, train_iterator, epoch)
    val_loss, val_acc = eval_model(model, valid_iterator)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')

# 6 Acknowledgements

1. https://blog.floydhub.com/attention-mechanism/
    
2. https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3

3. https://buomsoo-kim.github.io/attention/2020/01/01/Attention-mechanism-1.md/

4. https://github.com/prakashpandey9/Text-Classification-Pytorch

<font color='#31a04b' size=4>Kindly upvote, if you find it useful! Thanks!!</font><br>