# Attention is All You Need

Refer to paper [```Attention is all you need```](https://arxiv.org/abs/1706.03762)

In [1]:
import torch
import torch.nn as nn


In [2]:
import math

## Input Embedding
* refers to the numerical representation of words. Say embedding dimension is 4, then each word will be represented in 4 numbers.

example:
* King:     ```-0.17``` ```-0.001``` ```-0.77``` ```0.91```
* Queen:     ```-0.28``` ```-0.011``` ```-0.88``` ```0.74```
* Mountain:  ```0.67``` ```0.691``` ```-0.007``` ```0.17```

The word embedding are created in this format. They are represented in random values of given dimension that machine will understand.

In [3]:
class InputEmbedding(nn.Module):

  def __init__(self, d_model: int, vocab_size: int) -> None:
    super().__init__()
    self.d_model = d_model # dimension of embedding (the more the size, more it can distinguish each word)
    self.vocab_size = vocab_size # total length/size of vocabulary
    self.embedding = nn.Embedding(vocab_size, d_model) # creates dictionary of word embeddings of size (vocab_size * dimension of embedding of model)


  def forward(self, x):
    return self.embedding(x) * math.sqrt(self.d_model)


## Positional Encoding

* In RNN and sequential models, each word of a sentence is used sequentially so the predictive model learns to act in that way. But in transformer, that is not the case. To denote position of words of input sentence, positional encoding is used.
* It too is of a dimension (say 'd') and is added with input embedding and used in further tasks.

In [4]:
class Positional_Embedding(nn.Module):

  def __init__(self, d_model: int, seq_length: int, dropout: float) -> None:
    super().__init__()
    self.d_model = d_model # dimension of positional embedding
    self.seq_length = seq_length # max length that a sentence can be
    self.dropout = nn.Dropout(dropout) # dropout so that it won't overfit

    # create a position vector of shape (seq_length,  1)
    position = torch.arange(0, seq_length, dtype=torch.float).unsqueeze(1)
    # Denominator in formula (using log for better stability and convergence)
    div_term = torch.exp(torch.arange(0,d_model,2).float() * (- math.log(10000.0)/ d_model))

    # Embedding matrix of shape (seq_length,  d_model)
    pe = torch.zeros(seq_length, d_model)

    # Apply sin to even positions and cosine to odd
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cosin(position * div_term)

    # since there will be batches, adding another dimension in the embedding
    pe = pe.unsqueeze(0) # shape = (1, seq_length, d_model)


    # to save the embedding tensor along in the file when saving state of the model
    self.register_buffer("pe", pe)


  def forward(self, x):
    x = x + (self.pe[:, :x.shape[1], :]).requires_grad(False) #add positional embedding along with each data instance and Grad=False to disable learning embedding
    return self.dropout(x)


## Layer Normalization

* Say we have a batch of 3 items/words. Each are represented in form of tennsors having different means and variance. We use layer normalization i.e. normalization in only the layer of computation to get a new mean and variance for the layer.
* Sometimes it may be needed to be modified so we also introduce two parameters gamma (multiplicative) and beta (additive).
* Also we add epsilon in denominator to avoid divide by zero.

In [5]:
class LayerNormalization(nn.Module):

  def __init__(self, eps: float = 10**-6) -> None:
    super().__init__()

    self.eps = eps
    self.alpha = nn.Parameter(torch.ones(1)) # parameter is already learnable # multplicative
    self.bias = nn.Parameter(torch.zeros(1)) # additive

  def forward(self, x):
    mean = x.mean(dim=-1, keepdim=True) # get mean of dimension excluding batch. Retain the data of the dimension
    std = x.std(dim=-1, keepdim=True)
    return self.aplha * ((x-mean)/(std + self.eps)) + self.bias




## Feed Forward Layer

```FNN(x) = max(0,xW1+b1)W2+b2```

In [6]:
class FeedForwardLayer(nn.Module):

  def __init__(self, d_model:int, d_ff:int, dropout:float) -> None:
    super().__init__()

    self.linear1 = nn.Linear(d_model, d_ff) # w1 and b1
    self.dropout = nn.Dropout(dropout)
    self.linear2 = nn.Linear(d_ff, d_model) # w2 and b2

  def forward(self, x):
    # we have batch of sentences of dimension (Batch, seq_len, d_model) converted to (Batch, seq_len, d_ff) and again into (Batch, seq_len, d_model)
    return self.linear(self.dropout(torch.relu(self.linear1(x))))



## **MULTI-HEAD ATTENTION**

* Multihead attention is the part in transformer that gives it an edge over other previously existing sequential model architectures for language.
* Multi head attention is used to calculate attention scores for all the words in the sentence or text that helps the model to understand which part is important and focus more on that part.
* Multi head attention can be calculated in parallel that makes it fast.
* The embeddings/ tensors are duplicated or used 3 times as query, key and value to calculate attention scores.
* Query: What to look for. **Input * W_q**
* Key : What can I offer. **Input * W_k**
* Value: What I actually offer. **Input * W_v**
* h = heads (d_model should be exactly divisible by h)
* d_k = d_v = d_model/h = dimension of key/value scores per head



In [7]:
class MultiHeadAttention(nn.Module):

  def __init__(self, d_model:int, h:int, dropout:float):
    super().__init__()
    self.d_model = d_model
    self.h = h
    self.dropout = nn.Dropout(dropout)

    assert d_model % h ==0, "Dimension of model is not exactly divisible by heads(h)"

    self.d_k = self.d_v = d_model // h

    self.w_q = nn.Linear(d_model, d_model) # weight matrix of query
    self.w_k = nn.Linear(d_model, d_model) # weight matrix of key
    self.w_v = nn.Linear(d_model, d_model) # weight matrix of value

    self.w_o = nn.Linear(d_model, d_model) # weight matrix of output



# Create a function for attention calculation
  @staticmethod # means that the function doesn't need an instance of MultiHeadAttention to be called
  def attention_calc(query, key, value, mask, dropout:nn.Dropout()):
    d_k = query.shape[-1]

    attention_scores = (query @ key.transpose(2,3)) / math.sqrt(d_k)  # -> (Batch, h, seq_len, seq_len)

    if mask is not None:
      attention_scores.masked_fill_(mask==0,10**-9 ) # as -inf

    attention_scores = attention_scores.softmax(dim=-1) # (batch, h, seq_len, seq_len)

    if dropout is not None:
      attention_scores = dropout(attention_scores)

    return (attention_scores @ value) , attention_scores  # attention value  = (Attention).V -> (batch, h, seq_len, d_v)



  def forward(self, q, k, v, mask=None):

    # Pass through the pre-attention projection: # (batch, seq_len, d_model) -> (batch, seq_len, d_model)
    query = self.w_q(q)# Q'
    key = self.w_k(k) # K'
    value = self.w_v(v) # V'

    # Separate different heads: # (batch, seq_len, d_model) -> (batch, seq_len, h, d_k)
    query = query.view(query.shape[0], query.shape[1], self.h, self.d_k ) # view() reshapes the tensor without copying memory, similar to numpy's reshape().
    key = key.view(key.shape[0], key.shape[1], self.h, self.d_k )
    value = value.view(value.shape[0], value.shape[1], self.h, self.d_v )

    # Transpose for attention dot product: batch x h x seq_len x d_v
    query, key, value = query.transpose(1,2) , key.transpose(1,2), value.transpose(1,2)

    # Created query, key and value for all heads of size d_k or d_v

    # Attention Calculation
    x, self.attention_scores = MultiHeadAttention.attention_calc(query, key, value, mask, self.dropout)

    # right now we have h heads of matrices/ vectors of attention and we have to concatenate to get a output attention matrix
    # (batch, h, seq_len, d_v) -> (batch, seq_len, h, d_v)
    x = x.transpose(1,2)

    # (batch, seq_len, h, d_v) -> (batch, seq_len, d_model)
    x = x.contiguous().view(x.shape[0], -1, self.h*self.d_k)

    return self.w_o(x) # batch, seq_len, d_model) -> batch, seq_len, d_model)










## **Residual Connection**

*   Skip connection used on layer connection in between Layer Normalization of sublayers




In [8]:
class ResidualConnection(nn.Module):

  def __init__(self,dropout:float) -> None:

    super().__init__()
    self.dropout = nn.Dropout(dropout)
    self.norm  = LayerNormalization()

  def forward(self, x, sublayer):
    return x + self.dropout(sublayer(self.norm(x)))






# **ENCODER**

Encoder has N number of following blocks:
* Multi head Attention
* Feed Forward
* 2 Add & Norm

Output of one block is fed as input to another and the output of last block is given to decoder.
Initially, Input Embeddings and Positional Embeddings are provided to the first encoder block.

In [9]:
class EncoderBlock(nn.Module):

  def init(self, self_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardLayer, dropout:float) -> None:

    super().__init__()
    self.self_attention_block = self_attention_block
    self.feed_forward_block = feed_forward_block
    self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])

# we don't want the padded space/words to interact with other words so we will be applying mask here
  def forward(self, x, src_mask):

    # one sent to normalization and one to multihead attention -> Norm. Added to send to feed forward, # feedforward is again normalized
    x = self.residual_connections[0](x, lambda x: self.self_attention_block(x,x,x,src_mask))
    x = self.residual_connections[1](x, self.feed_forward_block)

    return x





# Encoder has Nx EncoderBlocks
class Encoder(nn.Module):

  def __init__(self, layers:nn.ModuleList) -> None:
    super().__init__()
    self.layers = layers
    self.norm = LayerNormalization()

  def forward(self, x, mask):
    for layer in self.layers:
      x = layer(x, mask)
    return self.norm(x)


# **Decoder**

Has:
* Masked Multihead attention
* Multi Head attention block: Query from decoder and Key & value from encoder // CrossAttention
* 3 Normalizations
* Feed Forward Layer

Output embeddings (shifted right) with positional embeddings are given.

In [10]:
class DecoderBlock(nn.Module):

  def __init__(self, self_attention_block: MultiHeadAttention, cross_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardLayer, dropout:float) -> None:

    super().__init__()
    self.self_attention_block = self_attention_block
    self.cross_attention_block = cross_attention_block
    self.feed_forward_block = feed_forward_block

    self.residual_connection = nn.ModuleList([ResidualConnection(dropout) for _ in range(3)])


  def forward(self, x, encoder_op, src_mask, trg_mask): # src mask from encoder and trg=target mask for decoder
    x = self.residual_connection[0](x, lambda x: self.self_attention_block(x,x,x,trg_mask))
    x = self.residual_connection[1](x, lambda x: self.cross_attention_block(x,encoder_op, encoder_op, src_mask))
    x = self.residual_connection[2](x, self.feed_forward_block)

    return x










class Decoder(nn.Module):

  def __init__(self, layers:nn.ModuleList):
    super().__init__()

    self.layers = layers
    self.norm = LayerNormalization()

  def forward(self, x, encoder_op, src_mask, trg_mask):
    for layer in self.layers:
      x = layer(x, encoder_op, src_mask, trg_mask)
    return self.norm(x)


## Linear Projection Layer

* The linear layer exists after Decoder.
* Converts data from *`d_model`* into *`vocab_size`*.

In [11]:
class ProjectionLayer(nn.Module):

  def __init__(self, d_model:int, vocab_size:int):
    super().__init__()
    self.proj = nn.Linear(d_model, vocab_size)


  def forward(self, x):
    # (batch, seq_len, d_model) --> (batch, seq_len, vocab_size)
    # Using log_softmax instead of softmax for numerical stability
    return torch.log_softmax(self.proj(x), dim=-1)

# **TRANSFORMER**

* **
* **

**Putting it all together.**

* **
* **


In [12]:
class Transformer(nn.Module):

  def __init__(self, encoder:Encoder, decoder:Decoder, src_embed:InputEmbedding, trg_embed:InputEmbedding, src_pos:Positional_Embedding, trg_pos:Positional_Embedding, projection_layer:ProjectionLayer) -> None:
    super().__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.src_embed = src_embed
    self.trg_embed = trg_embed
    self.src_pos = src_pos
    self.trg_pos = trg_pos
    self.projection_layer = projection_layer

  def encode(self, src, src_mask):
    src = self.src_embed(src)
    src = self.src_pos(src)
    return self.encoder(src, src_mask)



  def decode(self, encoder_op, src_mask, trg, trg_mask):
    trg = self.trg_embed(trg)
    trg = self.trg_pos(trg)
    return self.decoder(trg,encoder_op, src_mask, trg_mask)


  def projection(self, x):
    return self.projection_layer(x)




## Transformer Builder function

Build transformer for translation task with hyperparameters.

In [15]:
def build_transformer(src_vocab_size:int, trg_vocab_size:int, src_seq_len:int, trg_seq_len:int, d_model:int=512, N:int=6, h:int=8, dropout:float=0.1, d_ff:int=2048) -> Transformer:
  # Create the embedding layers
    src_embed = InputEmbedding(d_model, src_vocab_size)
    tgt_embed = InputEmbedding(d_model, trg_vocab_size)

    # Create the positional encoding layers
    src_pos = Positional_Embedding(d_model, src_seq_len, dropout)
    tgt_pos = Positional_Embedding(d_model, trg_seq_len, dropout)

    # Create the encoder blocks
    encoder_blocks = []
    for _ in range(N):
        encoder_self_attention_block = MultiHeadAttention(d_model, h, dropout)
        feed_forward_block = FeedForwardLayer(d_model, d_ff, dropout)
        encoder_block = EncoderBlock(d_model, encoder_self_attention_block, feed_forward_block, dropout)
        encoder_blocks.append(encoder_block)

    # Create the decoder blocks
    decoder_blocks = []
    for _ in range(N):
        decoder_self_attention_block = MultiHeadAttention(d_model, h, dropout)
        decoder_cross_attention_block = MultiHeadAttention(d_model, h, dropout)
        feed_forward_block = FeedForwardLayer(d_model, d_ff, dropout)
        decoder_block = DecoderBlock(d_model, decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)
        decoder_blocks.append(decoder_block)

    # Create the encoder and decoder
    encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))
    decoder = Decoder(d_model, nn.ModuleList(decoder_blocks))

    # Create the projection layer
    projection_layer = ProjectionLayer(d_model, trg_vocab_size)

    # Create the transformer
    transformer = Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer)

    # Initialize the parameters
    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)

    return transformer


### Transformer Model is obtained.

This function returns a transformer model with the parameters provided and can be used for translation tasks.

References:

* [Yu-Hsiang Huang Github](https://github.com/jadore801120/attention-is-all-you-need-pytorch)
* [Umar Jamil Github](https://github.com/hkproj/pytorch-transformer)

With what I already knew about transformer, applying it in code helped me to understand it in more depth and also to clarify the doubts I had.