# Intuition Of Multi-Head Attention

#### Decoding the Multi-Head Attention:-

* <a href="#attention-layer-components">Attention Layer And Components</a>
* <a href="#steps-multi-head-attention"> Steps To Compute Multihead Attention </a>
* <a href="#implementation"> Implementing Multihead Attention Module </a>

## Attention Layer and components
<a name="attention-layer-components"></a>

![Attention layer](https://miro.medium.com/max/2000/1*9nUzdaTbKzJrAsq1qqJNNA.png)

####  Three linear layer:
* A linear layer is a type of neural network that does not include an activation function. Its primary functions are to map input data onto output data and to change the dimensionality of the input data through dimensionality reduction. The weights of the linear layer can be updated during downstream tasks.


* Transformers are a type of neural network architecture that use three linear layers.


* The Attention layer in a Transformer model takes input in the form of three parameters: Query, Key, and Value. Each of these parameters represents each word in the input sequence as a vector, with similar structure.


* The input to each of the linear layers in the Transformer model includes positional embeddings, which are obtained earlier in the model and each layer has its own weights.


* The output from the Query and Key parameters is used to derive similarity between the input tokens. This similarity is then used to weight the values, producing the final output from the Attention layer.

* #### Lets look at the initution of how Transformers uses similarity to predict the relevent words based on Query and Keys:
    
    we can find the similarity of two vectors, considering a cosine similarity as follows,
    
    $$\text{cosine similarity} = \frac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{a} \rVert \lVert \mathbf{b} \rVert} $$

    so, similary finding the similarity between two matrices as follows:

    $$\text{similarity(a,b)} = \frac{\mathbf{a} \cdot \mathbf{b}^T}{\lVert \sqrt {scaling} \rVert} $$

    now, deducing the above to find the similarities between Query and Keys:

    $$\text{similarity(Q,K)} = \frac{\mathbf{Q} \cdot \mathbf{K}^T}{\lVert \sqrt {d} \rVert} $$

* The dot product between the query and key is first computed which is termed as an attention filter.
 initially the scores in the attention filter are random values but post the training the scores hold a better meaning.
 

* Then, the attention scores are scaled by dividing the scores with the dimension of the key vector (as per author of paper Attention is all you need). this is done to minimize the variance of the dot product: $$ Q*K^T $$


* Using the softmax function . the scores are transformed between 0 and 1.

* the tranformed attention scores is then multiplied with the value matrix to obtain the final attention filter which inturn is termed as an *Attention Head*.


$$\text{Attention(Q,K,V)} = softmax(\frac{\mathbf{Q} \cdot \mathbf{K}^T}{\lVert \sqrt {d} \rVert}) * V $$


* The Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. 


* The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.(Note, as per the paper "Attention is all you need" 8 such attention head are emphasised).



* multi-head attention mechanism linearly projects the queries, keys, and values multiple times, using a different learned projection each time.



* The single attention mechanism is then applied to each of these projections in parallel to produce  outputs, which, in turn are concatenated and projected again over a linear layer to produce a final result.

 
 
* The idea behind multi-head attention is to allow the attention function to extract information from different representation subspaces, which would otherwise be impossible with a single attention head.


$$\text{Multihead(Q,K,V)} = Concat(head1,head2.....,headh) * W^0 $$



## Steps To Compute Multihead Attention:
<a name="steps-multi-head-attention"></a>


* The step-by-step procedure for computing multi-head attention is as follows:

* Compute the linearly projected versions of the queries, keys, and values through multiplication with the respective weight matrices $$ W^Q , W^k and W^V $$  $$ one for each Head^i $$.

* Apply the single attention function for each head by

    1. multiplying the queries and keys matrices 
    
    2. applying the scaling and softmax operations, and 
    
    3. weighting the values matrix to generate an output for each head.
    

* Concatenate the outputs of the heads. $$ Head^i ,i=1,2....h$$

* Apply a linear projection to the concatenated output through multiplication with the weight matrix, to generate the final result of multi-head attention.

## Implementing Multihead Attention Module:
<a name="implementation"></a>

In [113]:
import torch
import torch.nn as nn
import math
import torch.nn.functional as Func


class MultiheadAttention(nn.Module):

    def __init__(self, **kwargs):
        super().__init__()
        self.input_dim = kwargs.get("input_dim")
        self.d_model = kwargs.get("d_model")
        self.num_heads = kwargs.get("num_heads")
        self.head_dim = self.d_model // self.num_heads
        ## creating 3 linear layer to represent (Query,Key and Value)
        self.qkv_layer = nn.Linear(self.input_dim , 3 * self.d_model)
        ## final layer for projecting the concatinated attention filter obatined from all the multiple heads
        self.linear_layer = nn.Linear(self.d_model, self.d_model)
        
        ## initializing the three 3 linear vectors
        self.query = []
        self.key= []
        self.value=[]
        
        ## flag to check if its encoder or decoder. 
        self.is_decoder = kwargs.get("is_decoder")
        
    def scaled_dot_product(self):
        d_k = self.query.size()[-1]
        scaled = torch.matmul(self.query, self.key.transpose(-1, -2)) / math.sqrt(d_k)
        
        # Masking is done only for decoders
        if self.is_decoder:
            # creating the mask vector.
            mask = torch.full(scaled.size() , float('-inf'))
            mask = torch.triu(mask, diagonal=1)
            # masking it with scaled vectors
            scaled += mask
        
        self_attention = Func.softmax(scaled, dim=-1)
        values = torch.matmul(self_attention, self.value)
        return values, self_attention
    
    def compute_attention(self, input_vector):
        batch_size, sequence_length, input_dim = input_vector.size()
        qkv = self.qkv_layer(input_vector)
        
        ## creating 8 (query,key,values) for all the 3 attention heads
        qkv = qkv.reshape(batch_size, sequence_length, self.num_heads, 3 * self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3)
        self.query, self.key,self.value = qkv.chunk(3, dim=-1)
        
        ## obtaining the attention filters and attention scores
        values, attention = self.scaled_dot_product()
        values = values.reshape(batch_size, sequence_length, self.num_heads * self.head_dim)
        ## projecting over a liner layer to obtain the final multihead attention values
        output_vector = self.linear_layer(values)
        return output_vector
    
    def get_attention_details(self,output_vector):
        print(f"The Query,Key,Value layer is ----> {self.qkv_layer} \n")
        print(f"The Query vector is ----> {self.query.size()} \n")
        print(f"The key vector is ----> {self.key.size()} \n")
        print(f"The value vector is ----> {self.value.size()} \n")
        print(f"The multihead output vector is ----> {output_vector.size()} \n")
        print(f"Multi-head Attention is at decoder ----> {self.is_decoder} \n")
    


In [114]:
## CONSTANTS

INPUT_DIM = 1024    ## Dimension of the input
MODEL_DIM = 512     ## Dimention of the vectors in multihead attention
NUM_HEADS = 8       ## No.of attention head required in multihead attention
SEQ_LEN = 5         ## Sequence length of input
BATCH_SIZE = 30     ## Batch size

## computing random Input Vector
input_vector = torch.randn( (BATCH_SIZE, SEQ_LEN, INPUT_DIM) )

In [118]:
## if computing the multihead attention at Decoder, use attribute. is_decoder = True

multi_head = MultiheadAttention(input_dim=INPUT_DIM, d_model=MODEL_DIM, num_heads=NUM_HEADS,is_decoder=False)

In [119]:

output_vector = multi_head.compute_attention(input_vector)

In [120]:

multi_head.get_attention_details(output_vector)

The Query,Key,Value layer is ----> Linear(in_features=1024, out_features=1536, bias=True) 

The Query vector is ----> torch.Size([30, 8, 5, 64]) 

The key vector is ----> torch.Size([30, 8, 5, 64]) 

The value vector is ----> torch.Size([30, 8, 5, 64]) 

The multihead output vector is ----> torch.Size([30, 5, 512]) 

Multi-head Attention is at decoder ----> False 

