In [1]:
!pip install bertviz

Collecting bertviz
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.6/157.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bertviz
Successfully installed bertviz-1.4.0
[0m

## <b>1 <span style='color:#F1A424'>|</span> Transformer Encoder</b> 

***

### <b><span style='color:#F1A424'>Background</span></b>

First, some background to the notebook

In the following notebook, we'll look at the following things

<div style=" background-color:#3b3745; padding: 13px 13px; border-radius: 8px; color: white">
    
<ul>
<li>Simple Attention</li>
<li>Multi-Head Self Attention</li>
<li>Feed Forward Layer</li>
<li>Normalisation</li>
<li>Skip Connection</li>
<li>Position Embeddings</li>
<li>Transformer Encoder</li>
<li>Classifier Head</li>
</ul> 
</div> 

### <b><span style='color:#F1A424'>Encoder Base</span></b>

Encoder simply put:
- Converts a **series tokens** into a **series of embedding vectors** (hidden state)
- The encoder (neural network) consists of **multiple layers** (**blocks**) constructed together 

The encoder structure:
- Composed of multiple encoder layers (blocks) stacked next to each other (similar to CNN layer stacks)
- Each encoder block contains **multi-head self attention** & **fully connected feed forward layer** (for each input embedding)

Purpose of the Encoder
- Input tokens are encoded & modified into a form that **stores some contextual information** in the sequence

The example we'll use:

> the bark of a palm tree is very rough


### <b><span style='color:#F1A424'>Classification Head</span></b>

- Transformers can be utilised for various application so they are created in a base form
- If we want to utilise them for a specific task, we add an extra component **head** to the transformer
- In this example, we'll utilise it for **classification** purposes, and look at how we can combine the base with the **head**


<br>

## <b>2 <span style='color:#F1A424'>|</span> Simple Self-Attention</b> 

***

### <b><span style='color:#F1A424'>Types of Attention</span></b>

**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention</mark>**
 
- Mechanism which allows networks to assign **different weight distributions to each element** in a sequence 
- Elements in sequence - `token embeddings` (each token mapped to a vector of fixed dimension) (eg. BERT model - 768 dimensions)
 
 
**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>**

- Instead of using fixed embeddings for each token, can use whole sequence to **compute weighted average** of each `embedding`
- One can think of self-attention as a form of averaging
- Common form of `self-attention` **scaled dot-product attention** 


### <b><span style='color:#F1A424'>Four Main Steps</span></b>

- Project each `token embedding` into three vectors **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>**
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** (nxn)

    - (we determine how much the query & key vectors relate to eachother using a similarity function)
    - Similarity function for scaled dot-product attention - dot product
    - queries & keys that are similar will have large dot product & visa versa
    - Outputs from this step - attention scores
    
    
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weight</mark>** (wij)

    - dot products produce large numbers 
    - attention scores first multiplied by a scaling factor to normalise their variance
    - Then normalised with softmax to ensure all column values sum to 1
    
    
- Update the token embeddings (hidden state)

    - multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">weights</mark>** by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** vector

### <b><span style='color:#F1A424'>Step by Step : Simple Attention Formulation</span></b>

- Well look at a simple example, and summarise the attention mechanism in one function
- `bert-base-uncased` model will be used to extract different model settings (eg. number of attention heads), so we will be building a similar model 

<br>

#### 1. Document tokenisation

- Each token in the sentence has been mapped to a **unique identifier** from a **vocabulary** or **dictionary**
- We start off by using the `bert-base-uncased` pretrained tokeniser

In [2]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel

# load tokeniser and model
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

# document well be using as an exmaple
text = "the bark of a palm tree is very rough"

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

100%|██████████| 433/433 [00:00<00:00, 200589.09B/s]
100%|██████████| 440473133/440473133 [00:13<00:00, 33671532.37B/s]


In [3]:
# Tokenise input (text)
inputs = tokenizer(text, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)
inputs.input_ids.size()

tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])


torch.Size([1, 9])

In [4]:
# Decode sequence
tokenizer.decode(inputs['input_ids'].tolist()[0])

'the bark of a palm tree is very rough'

At this point:

- `inputs.inpits_ids` A tensor of id mapped tokens
- Token embeddings are **independent of their context**
- **Homonyms** (same spelling, but different meaning) have the same representation

Role of subsequent attention layers:

- Mix the **token embeddings** to disambiguate & inform the representation of each token with the context of its content

In [5]:
'''

Create an embedding layer

'''

from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)

print(config.hidden_size,"hidden size")
print(config.vocab_size,"vocabulary size")

# load sample embedding layer of size (30522,758) -> same as bert-base
token_emb = nn.Embedding(config.vocab_size,
                         config.hidden_size)
token_emb

768 hidden size
30522 vocabulary size


Embedding(30522, 768)

#### 2. Embedding Vectors

- Convert Tokenised data into embedding data (768 dimensions) using vocab of 30522 tokens
- Each input_ids is **mapped to one of 30522 embedding vectors** stored in nn.embedding, each with a size of 768 
- Our output will be [batch_size,seq_len,hidden_dim] by calling `nn.Embedding(hidden)`

In [6]:
'''

Convert Tokens to Embedding Vectors
utilising the existing model embedding embeddings

'''

inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 9, 768])

In [7]:
# 9 embedding vectors of 768 dimensions
inputs_embeds

tensor([[[-0.8709, -0.1843, -0.7062,  ...,  0.5614, -0.5979, -0.1744],
         [-1.3574, -0.4618, -0.5609,  ...,  0.3766, -0.7757, -1.3060],
         [ 0.2145, -0.8197,  0.4578,  ..., -0.5882,  1.3728,  0.0022],
         ...,
         [ 0.6875,  1.6081,  0.4598,  ..., -0.8235,  0.6205,  0.6845],
         [ 1.1868,  0.0716,  0.2414,  ...,  0.6190, -0.3855,  0.7214],
         [-0.8954,  1.0546, -0.2179,  ...,  0.0564,  0.3874, -0.4946]]],
       grad_fn=<EmbeddingBackward0>)

#### 3. Create query, key and value vectors

- As the most simplistic case of attention, **we set them equal to one another**
- Attention mechanism with equal query and key vectors will assign a **very large score to identical words in the context** (diagonal component of matrix)

In [8]:
import torch
from math import sqrt

# setting them equal to one another
print("query and key components\n")
query = key = value = inputs_embeds
print('query size:',query.size())
dim_k = key.size(-1)   # hidden dimension 
print('key size:',key.transpose(1,2).size())

query and key components

query size: torch.Size([1, 9, 768])
key size: torch.Size([1, 768, 9])


#### 4. Compute the attention scores

- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** using the **dot product as the similarity function**
- `torch.bmm` - batch matrix matrix product (as we work in batches during training)
- If we need to transpose a vector `vector.transpose(1,2)`

In [9]:
# dot product & apply normalisation
print("\ndot product (attention scores)")
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()


dot product (attention scores)


torch.Size([1, 9, 9])

In [10]:
# attention scores
scores

tensor([[[ 2.8648e+01, -1.0776e+00,  2.9722e-01, -2.9451e-01, -2.1153e+00,
          -2.0768e+00,  2.8204e-01,  8.9665e-01,  7.9191e-01],
         [-1.0776e+00,  2.8015e+01, -7.0782e-01, -2.8204e-01,  8.8297e-01,
          -3.5298e-01, -1.5330e+00, -1.3100e+00,  1.2302e-01],
         [ 2.9722e-01, -7.0782e-01,  2.7576e+01, -1.6882e+00,  1.0794e+00,
           6.6921e-01,  2.7391e-01,  1.0884e+00,  5.6002e-01],
         [-2.9451e-01, -2.8204e-01, -1.6882e+00,  2.7700e+01,  1.5521e+00,
           6.7132e-01, -1.7238e+00, -1.3852e-01, -7.7847e-01],
         [-2.1153e+00,  8.8297e-01,  1.0794e+00,  1.5521e+00,  2.8228e+01,
          -1.2328e-01,  8.3366e-01, -6.9877e-01, -5.7359e-01],
         [-2.0768e+00, -3.5298e-01,  6.6921e-01,  6.7132e-01, -1.2328e-01,
           2.6586e+01,  9.9523e-02,  9.9959e-03, -9.7113e-01],
         [ 2.8204e-01, -1.5330e+00,  2.7391e-01, -1.7238e+00,  8.3366e-01,
           9.9524e-02,  2.7679e+01,  4.2262e-01, -1.1332e-01],
         [ 8.9665e-01, -1.3100e+00

#### 5. Compute attention weights (softmax function)

- Created a 5x5 matrix of **attention scores** per sample in the batch
- Apply the softmax for normalisation to get the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

In [11]:
import torch.nn.functional as F

print("sotfmax applied, attention weights :\n")
weights = F.softmax(scores, dim=-1)
print(weights.size())
print(weights)

print("\nsum of column values:/n")
weights.sum(dim=-1)

sotfmax applied, attention weights :

torch.Size([1, 9, 9])
tensor([[[1.0000e+00, 1.2309e-13, 4.8674e-13, 2.6935e-13, 4.3606e-14,
          4.5319e-14, 4.7941e-13, 8.8640e-13, 7.9826e-13],
         [2.3186e-13, 1.0000e+00, 3.3561e-13, 5.1375e-13, 1.6471e-12,
          4.7856e-13, 1.4705e-13, 1.8379e-13, 7.7031e-13],
         [1.4227e-12, 5.2076e-13, 1.0000e+00, 1.9537e-13, 3.1105e-12,
          2.0638e-12, 1.3899e-12, 3.1385e-12, 1.8503e-12],
         [6.9511e-13, 7.0384e-13, 1.7250e-13, 1.0000e+00, 4.4060e-12,
          1.8260e-12, 1.6647e-13, 8.1246e-13, 4.2842e-13],
         [6.6392e-14, 1.3312e-12, 1.6202e-12, 2.5994e-12, 1.0000e+00,
          4.8669e-13, 1.2672e-12, 2.7373e-13, 3.1023e-13],
         [3.5627e-13, 1.9972e-12, 5.5508e-12, 5.5625e-12, 2.5129e-12,
          1.0000e+00, 3.1401e-12, 2.8712e-12, 1.0764e-12],
         [1.2635e-12, 2.0574e-13, 1.2533e-12, 1.7001e-13, 2.1935e-12,
          1.0527e-12, 1.0000e+00, 1.4542e-12, 8.5089e-13],
         [1.9107e-12, 2.1031e-13, 2.3

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

#### 6. Update values 

Multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>** matrix by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">values</mark>** vector

In [12]:
attn_outputs = torch.bmm(weights, value)
print(attn_outputs)
print(attn_outputs.shape)

tensor([[[-0.8709, -0.1843, -0.7062,  ...,  0.5614, -0.5979, -0.1744],
         [-1.3574, -0.4618, -0.5609,  ...,  0.3766, -0.7757, -1.3060],
         [ 0.2145, -0.8197,  0.4578,  ..., -0.5882,  1.3728,  0.0022],
         ...,
         [ 0.6875,  1.6081,  0.4598,  ..., -0.8235,  0.6205,  0.6845],
         [ 1.1868,  0.0716,  0.2414,  ...,  0.6190, -0.3855,  0.7214],
         [-0.8954,  1.0546, -0.2179,  ...,  0.0564,  0.3874, -0.4946]]],
       grad_fn=<BmmBackward0>)
torch.Size([1, 9, 768])


Now we have a general function:
- Which inputs vectors `query`, `key` & `value` 
- Calculates the scalar dot product attention 

In [13]:
'''

Scalar Dot Product Attention
scores = query*key.T / sqrt(dims)
weight = softmax(scores) 

'''

def sdp_attention(query, key, value):
    dim_k = query.size(-1) # dimension component
    sfact = sqrt(dim_k)     
    scores = torch.bmm(query, key.transpose(1,2)) / sfact
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

## <b>3 <span style='color:#F1A424'>|</span> Multi-Head Self Attention</b> 

***

- The meaning of the word will be better informed by **complementary words in the context** than by **identical words** (which gives 1)

### <b><span style='color:#F1A424'>Simplistic Approach</span></b>

- We only used the embeddings "as is" (no linear transformation) to compute the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

### <b><span style='color:#F1A424'>Better Approach</span></b>

- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>** layer applies **three independent linear transformations (`nn.linear`) to each embedding** to generate **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** 
- These transformations project the embeddings and **each projection carries its own set of learnable parameters** (**Weights**)
- This **allows the self-attention layer to focus on different semantic aspects of the sequence**



Its beneficial to have **multiple sets of linear projections** (each one represents an **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**)

Why do we need more than one **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**?
- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">softmax</mark>** of one head tends to focus on mostly **one aspect of similarity**


**Several heads** allows the model to **focus on several apsects at once**
- Eg. one head can focus on subject-verb interaction, another finds nearby adjectives
- **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">CV analogy</mark>**: filters; one filter responsible for detecting the head, another for facial features 

In [14]:
'''

Attention Class

# nn.linear : apply linear transformation to incoming data
#             y = x * A^T + b
# Ax = b where x is input, b is output, A is weight

# calculate scaled dot product attention matrix
# Requires embedding dimension 
# Each attention head is made of different q,k,v vectors

'''

class Attention(nn.Module):
    
    # initalisation 
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        
        # Define the three vectors
        # input - embed_dim, output - head_dim
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    # main class operation
    def forward(self, hidden_state):
        
        # calculate scaled dot product given a 
        attn_outputs = sdp_attention(
            self.q(hidden_state), 
            self.k(hidden_state), 
            self.v(hidden_state))
        
        return attn_outputs

`Attention` will be used in the construction of a model

- We’ve **initialised three independent linear layers** that apply matrix multiplication to the embedding vectors to produce tensors of shape [batch_size, seq_len, head_dim]
- Where head_dim is the number of dimensions we are projecting into


In [15]:
# 
print(config.num_attention_heads,'heads')
print(config.hidden_size,'hidden state embedding dimension')

12 heads
768 hidden state embedding dimension


In [16]:
''' Sample Initialisation '''

# Initialised just one head, requires token embedding vector for forward operation

embed_dim = config.hidden_size
num_heads = config.num_attention_heads

attention = Attention(embed_dim,num_heads)
attention

Attention(
  (q): Linear(in_features=768, out_features=12, bias=True)
  (k): Linear(in_features=768, out_features=12, bias=True)
  (v): Linear(in_features=768, out_features=12, bias=True)
)

In [17]:
# Weights are always initialised randomly, attention_outputs varies
attention_outputs = attention(inputs_embeds)
attention_outputs

tensor([[[-0.1804, -0.0355,  0.2850, -0.0088,  0.0488,  0.0104, -0.1397,
           0.1800, -0.1550,  0.1591, -0.3884, -0.3103],
         [-0.2434, -0.0367,  0.2643, -0.1137, -0.0861,  0.0392, -0.1183,
           0.1923, -0.1150,  0.0633, -0.3714, -0.2804],
         [-0.2184, -0.0615,  0.2490,  0.0307,  0.0154,  0.0091, -0.1071,
           0.1482, -0.1489,  0.1104, -0.3719, -0.2284],
         [-0.2270,  0.0603,  0.2910, -0.1348, -0.0142,  0.0752, -0.1084,
           0.2588, -0.0805,  0.1108, -0.4121, -0.2126],
         [-0.2155,  0.0273,  0.2651, -0.0508,  0.0252,  0.0960, -0.0768,
           0.2428, -0.0987,  0.0898, -0.3790, -0.0925],
         [-0.2540, -0.0703,  0.2320, -0.0544, -0.0656,  0.0099, -0.1250,
           0.1393, -0.1262,  0.0707, -0.3794, -0.2972],
         [-0.2582, -0.0738,  0.2525, -0.0715, -0.0304,  0.1867, -0.0478,
           0.1679, -0.1137, -0.0074, -0.3407, -0.0108],
         [-0.2024,  0.0201,  0.2848, -0.0249,  0.0462,  0.1055, -0.0833,
           0.2281, -0.10

In [18]:
'''

Multihead attention class

'''


class multiHeadAttention(nn.Module):
    
    # Config during initalisation
    def __init__(self, config):
        super().__init__()
        
        # model params, read from config file
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        
        # attention head (define only w/o hidden state)
        # each attention head is initialised with embedd/heads head dimension
        self.heads = nn.ModuleList(
            [Attention(embed_dim, head_dim) for _ in range(num_heads)])
        
        # output uses whole embedding dimension for output
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    # Given a hidden state (embeddings)
    # Apply operation for multihead attention
        
    def forward(self, hidden_state):
        
        # for each head embed_size/heads, calculate attention
        heads = [head(hidden_state) for head in self.heads] 
        x = torch.cat(heads, dim=-1) # merge/concat head data together
    
        # apply linear transformation to multihead attension scalar product
        x = self.out_linear(x)
        return x

In [19]:
'''

Sample Usage: Multi-Head Attention

'''

# Every time will be different due to randomised weights
multihead_attn = multiHeadAttention(config) # initialisation with config
attn_output = multihead_attn(inputs_embeds) # forward by inputting embedding vectors (one for each token)

# Attention output (attention weights matrix x vector weights concat)
print(attn_output)
print(attn_output.size())

tensor([[[-0.1354,  0.0707, -0.0377,  ..., -0.0092,  0.0750, -0.1316],
         [-0.1551,  0.0392, -0.0357,  ..., -0.0361,  0.1730, -0.1306],
         [-0.1009, -0.0182, -0.0520,  ..., -0.0356,  0.0616, -0.1059],
         ...,
         [-0.1474,  0.0178, -0.1116,  ...,  0.0540,  0.0988, -0.1248],
         [-0.2208,  0.0140, -0.0677,  ...,  0.0058,  0.0610, -0.0620],
         [-0.1220,  0.0006, -0.0499,  ..., -0.0171,  0.0433, -0.0611]]],
       grad_fn=<AddBackward0>)
torch.Size([1, 9, 768])


## <b>4 <span style='color:#F1A424'>|</span> Feed-Forward Layer</b> 

***

**position-wise feed-forward layer**

The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">feed-forward</mark>** sublayer in the encoder & decoder
- **two layer fully connected neural network**


However, instead of processing the whole sequence of embedding as a single vector, 
- it **processes each embedding** independently
- Also see it referred to as a Conv1D with a kernel size of 1 (people with a CV background)


The **hidden size** of the **1st layer = 4x size of the embeddings** & **GELU activation function**
- Place where most of the capacity & memorization is hypothesized to happen
- It is most often scaled, when scaling up the models

In [20]:
class feedForward(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    # define layer operations input x
        
    def forward(self, x):    # note must be forward
        x = self.gelu(self.linear1(x))
        x = self.linear2(x)
        x = self.dropout(x)
        return x

In [21]:
# initailise feedforward layer
feed_forward = feedForward(config)              # initialise 
print(feed_forward,'\n')

# requires config & attn_outputs outputs
ff_outputs = feed_forward(attn_output) # forward operation
ff_outputs

feedForward(
  (linear1): Linear(in_features=768, out_features=3072, bias=True)
  (linear2): Linear(in_features=3072, out_features=768, bias=True)
  (gelu): GELU()
  (dropout): Dropout(p=0.1, inplace=False)
) 



tensor([[[ 0.0497, -0.0443,  0.0052,  ...,  0.0450,  0.0032, -0.0255],
         [ 0.0282, -0.0301, -0.0007,  ...,  0.0291, -0.0070, -0.0030],
         [ 0.0378, -0.0340, -0.0034,  ...,  0.0357, -0.0095, -0.0130],
         ...,
         [ 0.0354, -0.0305,  0.0000,  ...,  0.0418, -0.0146, -0.0212],
         [ 0.0332, -0.0255,  0.0119,  ...,  0.0454, -0.0115, -0.0250],
         [ 0.0409, -0.0275, -0.0019,  ...,  0.0317, -0.0169, -0.0143]]],
       grad_fn=<MulBackward0>)

## <b>5 <span style='color:#F1A424'>|</span> Normalisation Layers</b> 

***

Transformer architecture also uses **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">layer normalisation</mark>** & **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">skip connections</mark>**
- **normalisation** - normalises batch input to have zero mean & unit variance
- **skip connections** - pass a tensor to the next level of the model w/o processing & adding it to the processed tensor

Two main approaches, when it comes to normalisation layer placement in decoder, encoder:
- **post layer** normalisation (transformer paper, layer normalisation b/w skip connections)
- **pre layer** normalisation 

<br>

| `post-layer` normalisation |  `pre-layer` normalisation in literature |
| - | - |
| Arrangement is tricky to train from scractch, as the gradients can diverge |  Most often found arrangement
| Used with LR warm up (learning rate gradually increased, from small value to some maximum value during training) | Places layer normalization within the span of the skip connection |
|  | Tends to be much more stable during training, and it does not usually require any learning rate warm-up |

In [22]:
class encoderLayer(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.hidden_size)
        self.norm2 = nn.LayerNorm(config.hidden_size)
        self.attention = multiHeadAttention(config)    # multihead attention layer 
        self.feed_forward = feedForward(config)        # feed forward layer

    def forward(self, x):
        
        # Apply layer norm. to hidden state, copy input into query, key, value
        # Apply attention with a skip connection
        x = x + self.attention(self.norm1(x))
        
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.norm2(x))
        
        return x

In [23]:
# Transformer layer output
encoder_layer = encoderLayer(config) # initialise encoder layer
print(encoder_layer,'\n')

print('input',inputs_embeds.shape) 
print('output',encoder_layer(inputs_embeds).size())

encoderLayer(
  (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attention): multiHeadAttention(
    (heads): ModuleList(
      (0): Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=768, out_features=64, bias=True)
        (v): Linear(in_features=768, out_features=64, bias=True)
      )
      (1): Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=768, out_features=64, bias=True)
        (v): Linear(in_features=768, out_features=64, bias=True)
      )
      (2): Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=768, out_features=64, bias=True)
        (v): Linear(in_features=768, out_features=64, bias=True)
      )
      (3): Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=76

There is an issue with the way we set up the **encoder layers** (which uses just embedding inputs)
- they are totally **invariant to the position of the tokens**
- Multi-head attention layer is effectively a weighted sum, the **information on token position is lost**


## <b>6 <span style='color:#F1A424'>|</span> Positional Embeddings</b> 

***

Let's incorporate positional information using **positional embeddings**


**positional embeddings** are based on idea:
  - Modify the **token embeddings** with a **position-dependent pattern** of values arranged in a vector
  
  
If the pattern is characteristic for each position
- the **attention heads** and **feed-forward layers** in each stack can learn to incorporate positional information into their transformations



- There are several ways to achieve this, and one of the most popular approaches is to use a `learnable pattern`
- This works exactly the same way as the token embeddings, but using the **position index** instead of the **token identifier** (from vocabulary dictionary) as input
- An efficient way of encoding the positions of tokens is learned during pretraining

Creating Custom `Embedding` class

Let’s create a custom Embeddings module (**token embeddings + positional embeddings**)
 - That combines a token embedding layer that projects the input_ids to a dense hidden state 
 - Together with the positional embedding that does the same for position_ids
 - The resulting embedding is simply the **sum of both embeddings**

In [24]:
'''

Token + Position Embedding 


'''

class tpEmbedding(nn.Module):
    
    def __init__(self, config):        
        super().__init__()
        
        # token embedding layer
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        
        # positional embedding layer
        # config.max_position_embeddings -> max number of positions in text 512 (tokens)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        
        self.norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        
        # Create position IDs for input sequence
        seq_length = input_ids.size(1) # number of tokens
        position_ids = torch.arange(seq_length, dtype=torch.long)[None,:] # range(0,9)
        
        # tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])
        # tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
        
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        
        # Add normalisation & dropout layers
        embeddings = self.norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [25]:
# Token and Position Embeddings
embedding_layer = tpEmbedding(config)
embedding_layer(inputs.input_ids)

tensor([[[-3.9818, -0.0000,  0.0000,  ..., -0.3383, -0.0000, -2.5944],
         [-5.0538, -0.0000, -2.1341,  ..., -0.0000,  0.0000, -1.2488],
         [-0.0000,  0.0000,  1.4679,  ...,  0.0000, -0.0000, -3.0386],
         ...,
         [ 0.0000,  2.0134,  2.0548,  ..., -1.2429,  4.4909, -0.0000],
         [-1.7996, -0.0000,  0.0000,  ..., -2.1656,  0.3933,  0.0000],
         [ 0.0000, -1.9420, -0.0000,  ..., -0.0000, -0.0000, -0.0000]]],
       grad_fn=<MulBackward0>)

## <b>7 <span style='color:#F1A424'>|</span> Putting it all Together</b> 

***

- Constructing the Transformer **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">encoder</mark>**, combining the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Embedding</mark>** and **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Encoder</mark>**  layers
- We utilise both **token** & **positional** embeddings using `tpEmbedding`
- For a given number of heads, we store `encoderLayer`, which contains the **attention** & **feed-forward** layers (which are our layers)

In [26]:
# full transformer encoder combining the `Embedding` with the ``Embedding` ` layers

class TransformerEncoder(nn.Module):
    
    def __init__(self, config):       
        super().__init__()
        
        # token & positional embedding layer
        self.embeddings = tpEmbedding(config)
        
        # attention & forward feed layer 
        self.layers = nn.ModuleList([encoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        
        # embeddings layer output
        x = self.embeddings(x)
        
        # cycle through all heads
        for layer in self.layers:
            x = layer(x)
        return x

In [27]:
# Encoder initialisation & output
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
encoder_output

tensor([[[-1.3150, -0.9448, -1.9527,  ...,  0.4258, -0.5694, -3.3036],
         [-0.9168, -0.6489,  0.9373,  ...,  0.6969,  1.5973, -0.5345],
         [ 1.0490, -1.5161,  1.6764,  ...,  1.0162, -2.1198,  2.6263],
         ...,
         [-0.5821, -5.6414, -2.3057,  ...,  1.1826, -1.0920, -2.5330],
         [-1.6120, -1.7331,  0.7028,  ...,  1.3954,  2.4647,  0.9209],
         [ 1.0205,  1.4829, -0.8654,  ..., -1.8677, -3.3867,  1.6848]]],
       grad_fn=<AddBackward0>)

In [28]:
# hidden state for each token in a batch
encoder_output.size()

torch.Size([1, 9, 768])

## <b>8 <span style='color:#F1A424'>|</span> Classification Head</b> 

***

Quite often, transformers are divided into:
- Task independent body (`TransformerEncoder`)
- Task dependent head (`TransformerClassifier`)

Select one of the token outputs:
- The first token in such models is often used for the prediction **[CLS] token**
- Can attach a `dropout` and a `linear` transformation layer to make a classification prediction

In [29]:
class TransformerClassifier(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        
        # Transformer Encoder
        self.encoder = TransformerEncoder(config)
        
        # Classification Head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        print(x.size())
        x = self.dropout(x)
        x = self.classifier(x) # 768 -> 3 
        return x

- For each sample in the batch we get the **unnormalized logits** for each class in the output, which corresponds to the BERT model 

In [30]:
config.num_labels = 3
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs.input_ids)
output

torch.Size([1, 768])


tensor([[1.0015, 0.8205, 0.6130]], grad_fn=<AddmmBackward0>)