## Transformer Encoder (with Scaled Dot Product) from Scratch

## First What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.



BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.


It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.


------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------

![Imgur](https://imgur.com/NpeB9vb.png)

-------------------------

![](assets/2022-07-04-21-46-12.png)

![](assets/2022-07-08-02-58-52.png)

![](assets/2022-07-08-03-07-18.png)


![](assets/2022-07-08-03-08-46.png)

In [20]:
!pip install bertviz transformers -q

# !pip install transformers
# !pip install transformers 

In [None]:
#hide_output
from transformers import AutoTokenizer

from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

text = "As the aircraft becomes lighter, it flies higher in air of lower density to maintain the same airspeed."

model = BertModel.from_pretrained(model_ckpt)

# show(model, 'bert', tokenizer, text, display_mode = 'light', layer=0, head=8 )
# Commenting out the above line as Github will NOT render the bertviz plot and for that reason anything below this line was NOT getting rendered in the  notebook at all



## Tokenization

In [22]:
inputs = tokenizer(text, return_tensors='pt', add_special_tokens=False)

inputs.input_ids

tensor([[ 2004,  1996,  2948,  4150,  9442,  1010,  2009, 10029,  3020,  1999,
          2250,  1997,  2896,  4304,  2000,  5441,  1996,  2168, 14369, 25599,
          1012]])

In [23]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [24]:
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [25]:
inputs_embeds = token_emb(inputs.input_ids)
print(inputs_embeds)
inputs_embeds.size()

tensor([[[ 0.6336, -1.8910,  0.5547,  ..., -0.5884, -1.5728, -0.3602],
         [ 0.4609, -0.6503,  0.2237,  ..., -0.0850, -1.1233,  0.2282],
         [ 0.6410, -0.9294,  1.3242,  ...,  1.4695, -1.0404,  0.1205],
         ...,
         [ 0.0038, -0.0762,  0.4624,  ...,  1.0815, -0.2495, -0.1189],
         [ 0.2080,  0.9414, -0.3307,  ...,  0.3679, -0.7962,  0.9216],
         [ 0.0468, -0.0521, -0.5550,  ...,  0.7277,  1.0729, -1.5228]]],
       grad_fn=<EmbeddingBackward0>)


torch.Size([1, 21, 768])

In [26]:
import torch
from math import sqrt

query = key = value = inputs_embeds

dim_k = key.size(-1)

scores = torch.bmm(query, key.transpose(1, 2)) /sqrt(dim_k)

scores.size()

torch.Size([1, 21, 21])

#### This has created a 5 × 5 matrix of attention scores per sample in the batch.


-----------------

## torch.bmm() function 

The torch.bmm() function performs a batch matrix-matrix product that simplifies the
computation of the attention scores where the query and key vectors have the
shape [batch_size, seq_len, hidden_dim]. 

If we ignored the batch dimension we could calculate the dot product between each query and key vector by simply transposing the key tensor to have the shape [hidden_dim, seq_len] and then using the matrix product (i.e. torch.matmul()) to collect all the dot products in a [seq_len, seq_len] matrix. 

### But here, since we want to do this for all sequences in the batch independently, hence, we use torch.bmm() instead of torch.matmul(), because, torch.bmm() which takes two batches of matrices and multiplies each matrix from the first batch with the corresponding matrix in the second batch.


In [27]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=1 )

weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1.]], grad_fn=<SumBackward1>)

In [28]:
attn_outputs = torch.bmm(weights, value)

attn_outputs.shape

torch.Size([1, 21, 768])

In [29]:
def scaled_dot_product_attention(query, key, value):
    """
    Compute scaled dot product attention.

    Args:
        query (torch.Tensor): Query tensor of shape (batch_size, seq_len_q, dim_q).
        key (torch.Tensor): Key tensor of shape (batch_size, seq_len_k, dim_k).
        value (torch.Tensor): Value tensor of shape (batch_size, seq_len_v, dim_v).

    Returns:
        torch.Tensor: Output tensor after applying scaled dot product attention of shape (batch_size, seq_len_q, dim_v).

    """
    #first calculates the dimension of the key tensor (dim_k).
    dim_k = key.size(-1)
    # computes the attention scores by performing the dot product between the query and the transposed key tensor. The result is divided by the square root of dim_k.
    scores = torch.bmm(query, key.transpose(1, 2)) /sqrt(dim_k)
    # Next, the attention scores are normalized using the softmax function along the last dimension, which represents the sequence length (seq_len_k).
    weights = F.softmax(scores, dim=-1)
    
    """ Finally, the attention weights are applied to the value tensor by performing a batch matrix multiplication. The resulting tensor is the output of the scaled dot product attention and has the shape (batch_size, seq_len_q, dim_v).

    The output tensor represents the attended values corresponding to each query element based on their similarity to the key elements. """
    return torch.bmm(weights, value)

In [30]:
class AttentionHead(nn.Module):
    """
    Attention head module for the Transformer model. Encapsulates the operations required to compute attention within a single attention head of the Transformer model.

    Args:
        embed_dim (int): Dimensionality of the input embeddings.
        head_dim (int): Dimensionality of the attention head.

    """
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
        
    def forward(self, hidden_state):
        """
        Perform forward pass through the attention head.

        Args:
            hidden_state (torch.Tensor): Input tensor of shape (batch_size, seq_len, embed_dim).

        Returns:
            torch.Tensor: Output tensor after applying scaled dot product attention of shape (batch_size, seq_len, head_dim).

        """
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state)
        )
        return attn_outputs

In [31]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head attention module for the Transformer model. Combines the outputs of multiple attention heads and applies a linear transformation to produce the final output of the attention mechanism in the Transformer model.

    Args:
        config (object): Configuration object containing model parameters.

    """
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads # Num of Multiple Heads
        
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, embed_dim)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        """
        Perform forward pass through the multi-head attention module.
        
        For each attention head, the input tensor is passed through the corresponding AttentionHead instance, and the outputs are concatenated along the last dimension. The concatenated output is then passed through the output_linear layer to obtain the final output tensor.

        Args:
            hidden_state (torch.Tensor): Input tensor of shape (batch_size, seq_len, embed_dim).

        Returns:
            torch.Tensor: Output tensor after applying multi-head attention and linear transformation
                of shape (batch_size, seq_len, embed_dim).

        """
        concatenated_output = torch.cat([h(hidden_state) for h in self.heads], dim=-1 )
        concatenated_output = self.output_linear(concatenated_output)
        return concatenated_output
        

In [31]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_outputs.size()

In [31]:
from bertviz import head_view

from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = text = "As the aircraft becomes lighter, it flies higher in air of lower density to maintain the same airspeed."

sentence_b = "The corn field are full of flies"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
# Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow:

attention = model(**viz_inputs).attentions

sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)

tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[21])

### token_type_ids

token_type_ids: list of token type ids to be fed to a model

https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html

https://huggingface.co/transformers/v3.2.0/glossary.html#token-type-ids


### convert_ids_to_tokens

convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False) → Union[int, List[int]]
Converts a single index or a sequence of indices (integers) in a token ” (resp.) a sequence of tokens (str), using the vocabulary and added tokens.

--------

This visualization shows the attention weights as lines connecting the token
whose embedding is getting updated (left) with every word that is being
attended to (right). The intensity of the lines indicates the strength of the
attention weights, with dark lines representing values close to 1, and faint lines
representing values close to 0.


In this example, the input consists of two sentences and the [CLS] and [SEP]
tokens are the special tokens in BERT’s tokenizer. 

One thing we can see from the visualization is that the attention
weights are strongest between words that belong to the same sentence, which
suggests BERT can tell that it should attend to words in the same sentence.
However, for the word “flies” we can see that BERT has identified “arrow” as
important in the first sentence and “fruit” and “banana” in the second. These
attention weights allow the model to distinguish the use of “flies” as a verb or
noun, depending on the context in which it occurs!



In [31]:
class FeedForward(nn.Module):
    """
    This class implements the Feed Forward neural network layer within the Transformer model.
    
    Feed Forward layer is a crucial part of the Transformer's architecture, responsible for the actual 
    transformation of the input data. It consists of two linear layers with a GELU activation function 
    in between, followed by a dropout layer for regularization.

    Parameters
    ----------
    config : object
        The configuration object containing model parameters. It should have the following attributes:
        - hidden_size: The size of the hidden layer in the transformer model.
        - intermediate_size: The size of the intermediate layer in the Feed Forward network.
        - hidden_dropout_prob: The dropout probability for the hidden layer.

    Attributes
    ----------
    linear1 : torch.nn.Module
        The first linear transformation layer.
    linear2 : torch.nn.Module
        The second linear transformation layer.
    gelu : torch.nn.Module
        The Gaussian Error Linear Unit (GELU) activation function.
    dropout : torch.nn.Module
        The dropout layer for regularization.
    """
    def __init__(self, config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x):
        """
        Defines the computation performed at every call.

        Parameters
        ----------
        x : torch.Tensor
            The input tensor to the Feed Forward network layer.

        Returns
        -------
        x : torch.Tensor
            The output tensor after passing through the Feed Forward network layer.
        """
        x = self.linear1(x)
        x = self.gelu(x)
        x = self.linear2(x)
        x = self.dropout(x)
        return x

## Class definition of nn.Linear in pytorch?

`CLASS torch.nn.Linear(in_features, out_features, bias=True)`

Applies a linear transformation to the incoming data: `y = x*W^T + b`

Parameters:

 - **in_features** – size of each input sample (i.e. size of x)
 - **out_features** – size of each output sample (i.e. size of y)

---

Note that a feed-forward layer such as nn.Linear is usually applied to a tensor of
shape (batch_size, input_dim), where it acts on each element of the batch
dimension independently. 

This is actually true for any dimension except the last one, so when we pass a tensor of shape (batch_size, seq_len, hidden_dim) the layer is applied to all token embeddings of the batch and sequence independently, which is exactly what we want. Let’s test this by passing the attention outputs:


In [31]:
feed_forward = FeedForward(config)

ff_outputs = feed_forward(attn_outputs)

ff_outputs.size()

In [31]:
class TransformerEncoderLayer(nn.Module):
    """
    This class implements the Transformer Encoder Layer as part of the Transformer model.
    
    Each encoder layer consists of a Multi-Head Attention mechanism followed by a Position-wise 
    Feed Forward neural network. Additionally, residual connections around each of the two 
    sub-layers are employed, followed by layer normalization.

    Parameters
    ----------
    config : object
        The configuration object containing model parameters. It should have the following attributes:
        - hidden_size: The size of the hidden layer in the transformer model.

    Attributes
    ----------
    layer_norm_1 : torch.nn.Module
        The first layer normalization.
    layer_norm_2 : torch.nn.Module
        The second layer normalization.
    attention : MultiHeadAttention
        The MultiHeadAttention mechanism in the encoder layer.
    feed_forward : FeedForward
        The FeedForward neural network in the encoder layer.
    """
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        
    def forward(self, x):
        """
        Defines the computation performed at every call.

        Parameters
        ----------
        x : torch.Tensor
            The input tensor to the Transformer Encoder Layer.

        Returns
        -------
        x : torch.Tensor
            The output tensor after passing through the Transformer Encoder Layer.
        """
        hidden_state = self.layer_norm_1(x)
        x = x + self.attention(hidden_state)
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [None]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

In [None]:
from torch import embedding


class Embeddings(nn.Module):
    """
    This class implements the Embeddings layer as part of the Transformer model.
    
    The Embeddings layer is responsible for converting input tokens and their corresponding positions 
    into dense vectors of fixed size. The token embeddings and position embeddings are summed up 
    and subsequently layer-normalized and passed through a dropout layer for regularization.

    Parameters
    ----------
    config : object
        The configuration object containing model parameters. It should have the following attributes:
        - vocal_size: The size of the vocabulary.
        - hidden_size: The size of the hidden layer in the transformer model.
        - max_position_embeddings: The maximum number of positions that the model can accept.

    Attributes
    ----------
    token_embeddings : torch.nn.Module
        The embedding layer for the tokens.
    position_embeddings : torch.nn.Module
        The embedding layer for the positions.
    layer_norm : torch.nn.Module
        The layer normalization.
    dropout : torch.nn.Module
        The dropout layer for regularization.
    """
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocal_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size )
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()
        
    def forward(self, input_ids):
        """
        Defines the computation performed at every call.

        Parameters
        ----------
        input_ids : torch.Tensor
            The input tensor to the Embeddings layer, typically the token ids.

        Returns
        -------
        embeddings : torch.Tensor
            The output tensor after passing through the Embeddings layer.
        """
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings


In [None]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

In [None]:
class TransformerEncode(nn.Module):
    """
    This class implements the Transformer Encoder as part of the Transformer model.
    
    The Transformer Encoder consists of a series of identical layers, each with a self-attention mechanism 
    and a position-wise fully connected feed-forward network. The input to each layer is first processed by 
    the Embeddings layer which converts input tokens and their corresponding positions into dense vectors of 
    fixed size.

    Parameters
    ----------
    config : object
        The configuration object containing model parameters. It should have the following attributes:
        - num_hidden_layer: The number of hidden layers in the encoder.

    Attributes
    ----------
    embeddings : Embeddings
        The embedding layer which converts input tokens and positions into dense vectors.
    layers : torch.nn.ModuleList
        The list of Transformer Encoder Layers.
    """
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        # Initialize a list of Transformer Encoder Layers. The number of layers is defined by config.num_hidden_layer
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layer) ])
        
    def forward(self, x):
        """
        Defines the computation performed at every call.

        Parameters
        ----------
        x : torch.Tensor
            The input tensor to the Transformer Encoder.

        Returns
        -------
        x : torch.Tensor
            The output tensor after passing through the Transformer Encoder.
        """
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [None]:
encoder = TransformerEncode(config)
encoder(inputs.input_ids).size()

### Adding a Classification Head

In [None]:
class TransformerForSequenceClassification(nn.Module):
    """
    This class implements the Transformer model for sequence classification tasks.
    
    The model architecture consists of a Transformer encoder, followed by a dropout layer for regularization, 
    and a linear layer for classification. The output from the [CLS] token's embedding is used for the classification task.

    Parameters
    ----------
    config : object
        The configuration object containing model parameters. It should have the following attributes:
        - hidden_size: The size of the hidden layer in the transformer model.
        - hidden_dropout_prob: The dropout probability for the hidden layer.
        - num_labels: The number of labels in the classification task.

    Attributes
    ----------
    encoder : TransformerEncode
        The Transformer Encoder.
    dropout : torch.nn.Module
        The dropout layer for regularization.
    classifier : torch.nn.Module
        The classification layer.
    """
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncode(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x):
        """
        Defines the computation performed at every call.

        Parameters
        ----------
        x : torch.Tensor
            The input tensor to the Transformer model.

        Returns
        -------
        x : torch.Tensor
            The output tensor after passing through the Transformer model and the classification layer.
        """
        x = self.encoder(x)[:, 0, :] # selecting hidden stae of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x
  
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()      