## Transformers anatomy

In [1]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

In [2]:
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<?, ?B/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 440473133/440473133 [34:08<00:00, 215054.75B/s]


In [3]:
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Everythin starts from generating Dense embeddings for each token in the sequence

In [4]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
print(inputs.input_ids)

tensor([[ 2051, 10029,  2066,  2019,  8612]])


In [5]:
from torch import nn
from transformers import AutoConfig

In [6]:
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
print(token_emb)

Embedding(30522, 768)


Above we loaded ```config.json``` file associated with ```bert-base-uncased``` model checkpoint. In Transformers every checkpoint has such associated file, which contains various hyperparameters such as vocabulary size, hidden dimensions etc...

In [7]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

In [8]:
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query,key.transpose(1,2))/sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

As expected above created 5x5 matrix of attention scores per sample per batch.
We will see later that query, key and value vectors are generated by applying independant matrices Wq, Wk, Wv to the embeddings, but for now we've kept them equal for simplicity. In scaled dot-product attention, the dot products are scaled by the size of the embedding vectors so that we don't get too many large numbers during training that can cause the softmax we will apply next to saturate

In [13]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [15]:
def scaled_dot_product(query, key, value):
    dim_k=key.size(-1)
    scores=torch.bmm(query,key.transpose(1,2))/sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights,value)

In [38]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
    
    def forward(self,hidden_state):
        attn_outputs = scaled_dot_product(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs
        

In [45]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList([AttentionHead(embed_dim,head_dim) for _ in range(num_heads)])
        self.output_linear = nn.Linear(embed_dim, embed_dim)
    
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads],dim=-1)
        x = self.output_linear(x)
        return x
        

In [46]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

torch.Size([1, 5, 768])

In [57]:
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)
sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=-1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

This visualization shows attention weights as lines connecting the token whose embedding is getting updated (left) with every word that is being attended to (right). The intensity of the lines indicates the strength of attention weights, with dark lines representing values close to 1, and faint lines representing values close to 0.

Feed forward sub layer in transformers encoder and decoder is just a two layer fully connected neural network, but with a little twist. Instead of applying feed forward to the whole sequence treating it as a one vector, it is applied to each embedding independently. For this reason this layer is referred to as position-wise feed forward layer. You may find it referred as one dimensional confolution with kernel size of one. A rule of thumb from literature is for the first layer to have size four times of embedding size and use GELU as activation.
This is where most of the capacity and memorization is hypothesized to happen and it's the part that is most often scaled when scaling up the models. We can implement this as a simple nn.Module

In [62]:
class FeedForward(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
    
    def forward(self, x):
        x = self.linear1(x)
        x = self.gelu(x)
        x = self.linear2(x)
        x = self.dropout(x)
        return x

In [63]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_output)
ff_outputs.size()

torch.Size([1, 5, 768])

# Add Layer Normalization

Transformer makes use of layer normalization and skip connections. The former makes each input in the batch to have a mean of zero and variance of one. Skip connections pass a tensor to the next layer of the model without processing and add it to the processed tensor

In [66]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
    
    def forward(self, x):
        hidden_state = self.layer_norm1(x)
        
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm2(x))
        return x
        

In [69]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

We have implemented our very first transformer encoder layer. But there is a caveat. The current model is invariant to token positions. In reality when modeling a language, token positions are important as the meaning of the text can change completely if you change the order of the words.

# Positional embeddings

Positional embeddings are based on a simple yet very effective idea : augment token embeddings with position
dependant pattern of values arranged in a vector.
 If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate
 positional information into their transformations.
 There are several ways to achieve this, and one of the most popular approaches is to use a learnable pattern, especially when pretraining dataset is sufficiently large. This works exactly the same way as the token embeddings, but using the position index instead of the token ID as input. With that approach an efficient way of encoding the positions of tokens is learned during pretraining.
 Let's create a custom Embeddings module that combines a token embedding layer that projects the input_ids to a dense hidden state together with the positional embedding that does the same for position_ids. The resulting embedding is simply the sum of both embeddings

In [71]:
config.max_position_embeddings

512