In [1]:
!pip install bertviz

Collecting bertviz
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.6/157.6 kB[0m [31m735.9 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: bertviz
Successfully installed bertviz-1.4.0
[0m

## <b>1 <span style='color:#F1A424'>|</span> Creating a Transformer</b> 

***

In the following notebook, we'll look at the following things

<div style=" background-color:#3b3745; padding: 13px 13px; border-radius: 8px; color: white">
    
<ul>
<li>Simple Attention</li>
<li>Multi-Head Self Attention</li>
<li>Feed Forward Layer</li>
<li>Normalisation</li>
<li>Position Embeddings</li>
<li>Transformer Encoder</li>
<li>Classifier Head</li>
</ul> 
</div> 

<br>

## <b>2 <span style='color:#F1A424'>|</span> Simple Self-Attention</b> 

***

### <b><span style='color:#F1A424'>Types of Attention</span></b>

**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention</mark>**
 
- Mechanism which allows networks to assign **different weight distributions to each element** in a sequence 
- Elements in sequence - `token embeddings` (each token mapped to a vector of fixed dimension) (eg. BERT model - 768 dimensions)
 
 
**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>**

- Instead of using fixed embeddings for each token, can use whole sequence to **compute weighted average** of each `embedding`
- One can think of self-attention as a form of averaging
- Common form of `self-attention` **scaled dot-product attention** 


### <b><span style='color:#F1A424'>Four Main Steps</span></b>

- Project each `token embedding` into three vectors **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>**
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** (nxn)

    - (we determine how much the query & key vectors relate to eachother using a similarity function)
    - Similarity function for scaled dot-product attention - dot product
    - queries & keys that are similar will have large dot product & visa versa
    - Outputs from this step - attention scores
    
    
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weight</mark>** (wij)

    - dot products produce large numbers 
    - attention scores first multiplied by a scaling factor to normalise their variance
    - Then normalised with softmax to ensure all column values sum to 1
    
    
- Update the token embeddings 

    - multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">weights</mark>** by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** vector

***

### <b><span style='color:#F1A424'>Step by Step</span></b>

#### 1. Document tokenisation

- Each token in the sentence has been mapped to a unique identifier from a vocabulary 
- We start off by using the `bert-base-uncased` pretrained tokeniser

In [2]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
# from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

# document
text = "time flies like an arrow"

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

100%|██████████| 433/433 [00:00<00:00, 229280.85B/s]
100%|██████████| 440473133/440473133 [00:15<00:00, 28687024.42B/s]


In [3]:
# Tokenise input (text)
inputs = tokenizer(text, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)
inputs.input_ids.size()

tensor([[ 2051, 10029,  2066,  2019,  8612]])


torch.Size([1, 5])

In [4]:
# tensor to list
inputs['input_ids'].tolist()[0]

[2051, 10029, 2066, 2019, 8612]

In [5]:
# Decode sequence
tokenizer.decode(inputs['input_ids'].tolist()[0])

'time flies like an arrow'

At this point:

- `inputs.inpits_ids` A tensor of id mapped tokens
- Token embeddings are **independent of their context**
- **Homonyms** (same spelling, but different meaning) have the same representation

Role of subsequent attention layers:

- Mix the **token embeddings** to disambiguate & inform the representation of each token with the context of its content

In [6]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)

print(config.hidden_size,"hidden size")
print(config.vocab_size,"vocabulary size")

# load sample embedding layer of size (30522,758) -> same as bert-base
token_emb = nn.Embedding(config.vocab_size,
                         config.hidden_size)
token_emb

768 hidden size
30522 vocabulary size


Embedding(30522, 768)

#### 2. Embedding Vectors

- Convert Tokenised data into embedding data (768 dimensions) using vocab of 30522 tokens
- Each input_ids is **mapped to one of 30522 embedding vectors** stored in nn.embedding, each with a size of 768 
- Our output will be [batch_size,seq_len,hidden_dim] by calling `nn.Embedding(hidden)`

In [7]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

In [8]:
# 5 embedding vectors of 768 dimensions
inputs_embeds

tensor([[[ 0.2257, -0.4307,  1.8177,  ...,  0.7229,  3.2706, -0.7792],
         [ 0.4286,  0.0174, -0.7029,  ..., -0.8039, -0.6916,  0.6626],
         [-0.3575, -0.6561,  0.5093,  ..., -1.2935,  0.0352,  1.3152],
         [-0.4761,  0.8055,  0.4843,  ...,  1.2622, -0.6694, -0.1447],
         [ 0.7571,  0.1967,  0.0324,  ...,  0.1879,  1.6343, -1.1001]]],
       grad_fn=<EmbeddingBackward0>)

#### 3. Create query, key and value vectors

- As the most simplistic case of attention, **we set them equal to one another**
- attention mechanism with equal query and key vectors will assign a **very large score to identical words in the context** (diagonal component of matrix)

In [9]:
import torch
from math import sqrt

# setting them equal to one another
print("query and key components\n")
query = key = value = inputs_embeds
print('query size:',query.size())
dim_k = key.size(-1)   # hidden dimension 
print('key size:',key.transpose(1,2).size())

query and key components

query size: torch.Size([1, 5, 768])
key size: torch.Size([1, 768, 5])


#### 4. Compute the attention scores

- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** using the **dot product as the similarity function**
- `torch.bmm` - batch matrix matrix product (as we work in batches during training)
- If we need to transpose a vector `vector.transpose(1,2)`

In [10]:
# dot product & apply normalisation
print("\ndot product (attention scores)")
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()


dot product (attention scores)


torch.Size([1, 5, 5])

In [11]:
# attention scores
scores

tensor([[[25.8117, -0.1492,  0.4068,  0.6981,  0.0988],
         [-0.1492, 30.5535, -0.3039, -0.4122, -0.1059],
         [ 0.4068, -0.3039, 27.0213,  0.5284,  0.3396],
         [ 0.6981, -0.4122,  0.5284, 29.3416,  1.0719],
         [ 0.0988, -0.1059,  0.3396,  1.0719, 27.9608]]],
       grad_fn=<DivBackward0>)

#### 5. Compute attention weights (softmax function)

- Created a 5x5 matrix of **attention scores** per sample in the batch
- Apply the softmax for normalisation to get the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

In [12]:
import torch.nn.functional as F

print("sotfmax applied, attention weights :\n")
weights = F.softmax(scores, dim=-1)
print(weights.size())
print(weights)

print("\nsum of column values:/n")
weights.sum(dim=-1)

sotfmax applied, attention weights :

torch.Size([1, 5, 5])
tensor([[[1.0000e+00, 5.3127e-12, 9.2636e-12, 1.2397e-11, 6.8080e-12],
         [4.6343e-14, 1.0000e+00, 3.9701e-14, 3.5627e-14, 4.8393e-14],
         [2.7635e-12, 1.3577e-12, 1.0000e+00, 3.1207e-12, 2.5840e-12],
         [3.6332e-13, 1.1970e-13, 3.0659e-13, 1.0000e+00, 5.2800e-13],
         [7.9376e-13, 6.4682e-13, 1.0099e-12, 2.1005e-12, 1.0000e+00]]],
       grad_fn=<SoftmaxBackward0>)

sum of column values:/n


tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

#### 6. Update values 

Multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>** matrix by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">values</mark>** vector

In [13]:
attn_outputs = torch.bmm(weights, value)
print(attn_outputs)
print(attn_outputs.shape)

tensor([[[ 0.2257, -0.4307,  1.8177,  ...,  0.7229,  3.2706, -0.7792],
         [ 0.4286,  0.0174, -0.7029,  ..., -0.8039, -0.6916,  0.6626],
         [-0.3575, -0.6561,  0.5093,  ..., -1.2935,  0.0352,  1.3152],
         [-0.4761,  0.8055,  0.4843,  ...,  1.2622, -0.6694, -0.1447],
         [ 0.7571,  0.1967,  0.0324,  ...,  0.1879,  1.6343, -1.1001]]],
       grad_fn=<BmmBackward0>)
torch.Size([1, 5, 768])


In [14]:
'''

Scalar Dot Product Attention
scores = query*key.T / sqrt(dims)
weight = softmax(scores) 

'''

def sdp_attention(query, key, value):
    dim_k = query.size(-1) # dimension component
    sfact = sqrt(dim_k)     
    scores = torch.bmm(query, key.transpose(1,2)) / sfact
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

## <b>3 <span style='color:#F1A424'>|</span> Multi-Head Self Attention</b> 

***

- The meaning of the word will be better informed by **complementary words in the context** than by **identical words** (which gives 1)

### <b><span style='color:#F1A424'>Simplistic Approach</span></b>

- We only used the embeddings "as is" to compute the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

### <b><span style='color:#F1A424'>Better Approach</span></b>

- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>** layer applies **three independent linear transformations to each embedding** to generate **query**, **key** & **value**
- These transformations project the embeddings and **each projection carries its own set of learnable parameters**, 
- This **allows the self-attention layer to focus on different semantic aspects of the sequence**



Its beneficial to have **multiple sets of linear projections** (each one represents an **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**)

- Why do we need more than one **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**?
- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">softmax</mark>** of one head tends to focus on mostly **one aspect of similarity**
- **Several heads** allows the model to **focus on several apsects at once**
- Eg. one head can focus on subject-verb interaction, another finds nearby adjectives
- **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">CV analogy</mark>**: filters; one filter responsible for detecting the head, another for facial features 

In [15]:
# nn.linear : apply linear transformation to incoming data
#             y = x * A^T + b
# Ax = b where x is input, b is output, A is weight

# calculate scaled dot product attention matrix
# Requires embedding dimension 
# Each attention head is made of different q,k,v vectors

class att(nn.Module):
    
    # initalisation 
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        
        # Define the three vectors
        # input - embed_dim, output - head_dim
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    # main class operation
    def forward(self, hidden_state):
        
        # calculate scaled dot product given a 
        attn_outputs = sdp_attention(
            self.q(hidden_state), 
            self.k(hidden_state), 
            self.v(hidden_state))
        
        return attn_outputs

`AttentionHead` will be used in the construction of a model

- We’ve **initialised three independent linear layers** that apply matrix multiplication to the embedding vectors to produce tensors of shape [batch_size, seq_len, head_dim]
- Where head_dim is the number of dimensions we are projecting into


In [16]:
print(config.num_attention_heads,'heads')
print(config.hidden_size,'hidden state embedding dimension')

12 heads
768 hidden state embedding dimension


In [17]:
embed_dim = config.hidden_size
num_heads = config.num_attention_heads

# Initialised just one head, requires token embedding vector for forward operation
attention_outputs = att(embed_dim,num_heads)
attention_outputs

att(
  (q): Linear(in_features=768, out_features=12, bias=True)
  (k): Linear(in_features=768, out_features=12, bias=True)
  (v): Linear(in_features=768, out_features=12, bias=True)
)

In [18]:
class matt(nn.Module):
    
    # Config during initalisation
    def __init__(self, config):
        super().__init__()
        
        # model params, read from config file
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        
        # attention head (define only w/o hidden state)
        # each attention head is initialised with embedd/heads 
        self.heads = nn.ModuleList(
            [att(embed_dim, head_dim) for _ in range(num_heads)])
        
        # output uses whole embedding dimension for output
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    # Apply operation for multihead attention, requires hidden state (embeddings)
        
    def forward(self, hidden_state):
        
        # for each head embed_size/heads, calculate attention
        heads = [head(hidden_state) for head in self.heads] 
        x = torch.cat(heads, dim=-1) # merge them together
    
        # apply linear transformation to multihead attension scalar product
        x = self.output_linear(x)
        return x

In [19]:
# Input into MultiHeadAttention (token embeddings for each token)
print(inputs_embeds)
print(inputs_embeds.size())

tensor([[[ 0.2257, -0.4307,  1.8177,  ...,  0.7229,  3.2706, -0.7792],
         [ 0.4286,  0.0174, -0.7029,  ..., -0.8039, -0.6916,  0.6626],
         [-0.3575, -0.6561,  0.5093,  ..., -1.2935,  0.0352,  1.3152],
         [-0.4761,  0.8055,  0.4843,  ...,  1.2622, -0.6694, -0.1447],
         [ 0.7571,  0.1967,  0.0324,  ...,  0.1879,  1.6343, -1.1001]]],
       grad_fn=<EmbeddingBackward0>)
torch.Size([1, 5, 768])


In [20]:
# Every time will be different due to randomised weights
multihead_attn = matt(config) # initialisation with config
attn_output = multihead_attn(inputs_embeds) # forward by inputting embedding vectors (one for each token)

# Attention output (attention weights matrix x vector weights concat)
print(attn_output)
print(attn_output.size())

tensor([[[-0.2287,  0.0604,  0.0362,  ...,  0.1805, -0.0274, -0.0454],
         [-0.2188,  0.0154,  0.0420,  ...,  0.1469, -0.0010, -0.0786],
         [-0.2651,  0.0482,  0.0657,  ...,  0.1920, -0.0163, -0.0702],
         [-0.3076,  0.0556,  0.0631,  ...,  0.2168, -0.0448, -0.0093],
         [-0.2418, -0.0258,  0.0101,  ...,  0.2184,  0.0147, -0.0526]]],
       grad_fn=<AddBackward0>)
torch.Size([1, 5, 768])


## <b>4 <span style='color:#F1A424'>|</span> Feed-Forward Layer</b> 

***

Position feed-forward network 

Often referred to as a **position-wise feed-forward layer**

- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">feed-forward</mark>** sublayer in the encoder & decoder -> simple **two layer fully connected neural network**
- However, instead of processing the whole sequence of embedding as a single vector, it **processes each embedding** independently
- Also see it referred to as a Conv1D with a kernel size of 1 (people with a CV background)

A rule of thumb 

The **hidden size** of the **1st layer = 4x size of the embeddings** & **GELU activation function**
- Place where most of the capacity & memorization is hypothesized to happen (most often scaled when scaling up the models)


A feed-forward layer (eg. nn.Linear usually applied to a tensor of shape (batch_size, input_dim), where it acts on each element of the batch dimension independently
- This is actually true for any dimension except the last one, so when we pass a tensor of shape (batch_size, seq_len, hidden_dim) the layer is applied to all token embeddings of the batch and sequence independently, which is exactly what we want. 

In [21]:
class ff(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    # define layer operations input x
        
    def forward(self, x):    # note must be forward
        x = self.gelu(self.linear_1(x))
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [22]:
# Example for bert model
ff(config)

ff(
  (linear_1): Linear(in_features=768, out_features=3072, bias=True)
  (linear_2): Linear(in_features=3072, out_features=768, bias=True)
  (gelu): GELU()
  (dropout): Dropout(p=0.1, inplace=False)
)

In [23]:
# Attention outputs
attn_outputs
print(attn_outputs.shape)

torch.Size([1, 5, 768])


In [24]:
# requires config & attn_outputs outputs

feed_forward = ff(config)     # initialise 
ff_outputs = feed_forward(attn_output) # forward
ff_outputs.size()

torch.Size([1, 5, 768])

## <b>5 <span style='color:#F1A424'>|</span> Normalisation Layers</b> 

***

Transformer architecture uses **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">layer normalisation</mark>** & **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">skip connections</mark>**
- **normalisation** - normalises batch input to have zero mean & unit variance
- **skip connections** - pass a tensor to the next level of the model w/o processing & adding it to the processed tensor

Two main approaches, when it comes to normalisation layer placement in decoder, encoder:
- **post layer** normalisation (transformer paper, layer normalisation b/w skip connections)
- **pre layer** normalisation 

<br>

| `post-layer` normalisation |  `pre-layer` normalisation in literature |
| - | - |
| Arrangement is tricky to train from scractch, as the gradients can diverge |  Most often found arrangement
| Used with LR warm up (learning rate gradually increased, from small value to some maximum value during training) | Places layer normalization within the span of the skip connection |
|  | Tends to be much more stable during training, and it does not usually require any learning rate warm-up |

In [25]:
class encoderLayer(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = matt(config)    # multihead attention layer 
        self.feed_forward = ff(config)        # feed forward layer

    def forward(self, x):
        
        # Apply layer norm. to hidden state, copy input into query, key, value
        # Apply attention with a skip connection
        x = x + self.attention(self.layer_norm_1(x))
        
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        
        return x

In [26]:
# Transformer layer output
encoder_layer = encoderLayer(config)

print('input',inputs_embeds.shape) 
print('output',encoder_layer(inputs_embeds).size())

input torch.Size([1, 5, 768])
output torch.Size([1, 5, 768])


We’ve now implemented our transformer encoder layer
- However, there is an issue with the way we set up the **encoder layers**:
  - they are totally **invariant to the position of the tokens**
  
  
- Multi-head attention layer is effectively a weighted sum, the **information on token position is lost**


## <b>6 <span style='color:#F1A424'>|</span> Positional Embeddings</b> 

***

Incorporate positional information using `positional embeddings`:

- `positional embeddings` are based on idea:
  - Modify the `token embeddings` with a **position-dependent pattern** of values arranged in a vector
  
  
- If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate positional information into their transformations
- There are several ways to achieve this, and one of the most popular approaches is to use a `learnable pattern`
- This works exactly the same way as the token embeddings, but using the **position index** instead of the **token identifier** (from vocabulary dictionary) as input
- An efficient way of encoding the positions of tokens is learned during pretraining

Creating Custom `Embedding` class

Let’s create a custom Embeddings module (**token embeddings + positional embeddings**)
 - That combines a token embedding layer that projects the input_ids to a dense hidden state 
 - Together with the positional embedding that does the same for position_ids
 - The resulting embedding is simply the **sum of both embeddings**

In [27]:
# Token & Position Embedding Layer

class tpEmbedding(nn.Module):
    
    def __init__(self, config):        
        super().__init__()
        
        # token embedding layer
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        
        # positional embedding layer
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings


In [28]:
# Token and Position Embeddings
embedding_layer = tpEmbedding(config)
embedding_layer(inputs.input_ids)

tensor([[[-0.8694, -1.6714,  0.7159,  ...,  1.1703,  1.5813,  2.1954],
         [-0.0000,  0.0672, -0.0000,  ...,  1.5110, -0.0000,  0.0000],
         [ 1.1838,  0.3636,  0.0000,  ...,  3.7152,  1.0541, -0.0000],
         [ 0.5582,  0.5616, -0.0000,  ..., -0.7990, -0.1277, -2.2744],
         [-0.7716,  1.9271,  0.0000,  ...,  1.1052,  0.0000,  0.0000]]],
       grad_fn=<MulBackward0>)

## <b>7 <span style='color:#F1A424'>|</span> Putting it all Together</b> 

***

Constructing the Transformer **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">encoder</mark>**, combining the `Embedding` and `Encoder`  layers

In [29]:
# full transformer encoder combining the `Embedding` with the ``Embedding` ` layers

class TransformerEncoder(nn.Module):
    
    def __init__(self, config):       
        super().__init__()
        
        # token & positional embedding layer
        self.embeddings = tpEmbedding(config)
        
        # attention & forward feed layer 
        self.layers = nn.ModuleList([encoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        
        # embeddings
        x = self.embeddings(x)
        
        for layer in self.layers:
            x = layer(x)
        return x

In [30]:
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
encoder_output

tensor([[[ 3.7086, -2.3515, -1.4191,  ...,  1.4979, -3.3497,  0.9201],
         [ 2.0326,  1.8728, -5.9263,  ..., -1.3855,  0.0124,  2.9590],
         [ 1.9051,  2.1667,  0.6003,  ..., -0.6429,  1.2111, -2.7430],
         [ 1.5930, -0.1081, -2.5242,  ..., -2.8394, -0.5158,  0.1820],
         [ 1.9846,  0.8312,  3.4625,  ..., -1.2064, -0.3385,  0.1955]]],
       grad_fn=<AddBackward0>)

In [31]:
# hidden state for each token in a batch
encoder_output.size()

torch.Size([1, 5, 768])

## <b>8 <span style='color:#F1A424'>|</span> Classification Head</b> 

***

Quite often, transformers are divided into:
- Task independent body (`TransformerEncoder`)
- Task dependent head (`TransformerClassifier`)


- We have a hidden state for each token (5 tokens x 768 dimensions) at the output, but need to make one prediction


- The first token in such models is often used for the prediction **[CLS] token**
  - can attach a dropout and a linear layer to make a classification prediction

In [32]:
class TransformerClassifier(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        
        # Transformer Encoder
        self.encoder = TransformerEncoder(config)
        
        # Classification Head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

- For each sample in the batch we get the **unnormalized logits** for each class in the output, which corresponds to the BERT model 

In [33]:
config.num_labels = 3
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs.input_ids)
output

tensor([[-0.5311,  1.9373, -0.6955]], grad_fn=<AddmmBackward0>)