<a href="https://colab.research.google.com/github/soutrik71/MInMaxBERT/blob/main/notebook/BERTEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Objective of this notebook is to experiment with the tokenization provided by bert model

## BERT Word Embeddings

The tutorial demonstrates how to leverage BERT for extracting word and sentence embeddings from text data. These embeddings offer:

Enhanced keyword/search expansion, semantic search, and information retrieval capabilities. By accurately capturing contextual meaning, they enable precise matching of customer queries with relevant content. For instance, even if there's no direct overlap in keywords or phrases, BERT embeddings facilitate retrieving well-documented searches or answers to customer questions.

Advanced feature inputs for downstream NLP models, such as LSTMs or CNNs. Unlike traditional approaches like Word2Vec, where word embeddings remain static regardless of context, BERT dynamically adjusts word representations based on surrounding words. This context-awareness leads to more accurate feature representations and improved model performance. For example, consider the sentences: "The man was accused of robbing a bank" and "The man went fishing by the bank of the river." While Word2Vec would assign the same embedding to "bank" in both sentences, BERT generates distinct embeddings considering the contextual nuances.

In [1]:
!pip install transformers



In [7]:
import torch
from transformers import BertTokenizer, BertModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
#logging.basicConfig(level=logging.INFO)

import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np

In [8]:
os.environ["HF_TOKEN"]="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"

In [9]:
# load pre-trained model tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

To prepare our input data for BERT, we need to adhere to specific formatting requirements:

* Special Tokens: We require special tokens like [SEP] to indicate the end of a sentence or the separation between two sentences, and [CLS] at the beginning of our text, which BERT expects regardless of the task.

* Tokens from BERT's Vocabulary: Our input tokens should be chosen from the fixed vocabulary used by BERT.

* Token IDs: We need to convert our tokens into their corresponding token IDs using BERT's tokenizer.

* Mask IDs: We use mask IDs to distinguish between actual tokens and padding elements in the sequence.

* Segment IDs: These are used to differentiate between different sentences in the input sequence.

* Positional Embeddings: These embeddings indicate the position of each token within the sequence.

Although the transformers interface offers a convenient way to handle these requirements through functions like tokenizer.encode_plus, we'll perform most of these steps manually for the sake of introducing working with BERT.

BERT can take as input either one or two sentences, and uses the special token [SEP] to differentiate them. The [CLS] token always appears at the start of the text, and is specific to classification tasks.

In [10]:
text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"

In [11]:
tokenized_text = tokenizer.tokenize(marked_text)
print(tokenized_text)

['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']


WordPiece Model

The appearance of smaller subwords and characters with hash signs (##) in front of them is a result of the BERT tokenizer's WordPiece model. Here's why it looks this way:

WordPiece Model: BERT's tokenizer operates using a WordPiece model, which constructs a vocabulary comprising individual characters, subwords, and words that best represent the language data it was trained on.

Vocabulary Size Limit: The BERT tokenizer's vocabulary size is limited to 30,000 tokens. The WordPiece model selects the most common English words, subwords, and characters to populate this vocabulary.

Composition of Vocabulary:

Whole words
Subwords occurring at the beginning of a word or independently (e.g., "em" in "embeddings" shares the same vector as the standalone "em" in "go get em").
Subwords not at the beginning of a word, denoted by preceding '##'.
Individual characters
Tokenization Process: When tokenizing a word, the tokenizer first checks if the entire word exists in the vocabulary. If not, it attempts to decompose the word into the largest possible subwords found in the vocabulary. As a last resort, it decomposes the word into individual characters.

Handling Out of Vocabulary Words: Instead of assigning out-of-vocabulary words to a generic unknown token, BERT decomposes them into subword and character tokens. This approach helps retain some contextual meaning of the original word.

Representation of Out-of-Vocabulary Words: For instance, the word "embeddings" would be split into subword tokens like ['em', '##bed', '##ding', '##s']. By retaining these subword embeddings, it's possible to approximate the vector for the original word, and even generate an average vector from these subword embeddings.

Overall, this tokenization strategy allows BERT to handle out-of-vocabulary words more effectively, preserving some contextual information rather than treating them as entirely unknown entities.

In [14]:
list(tokenizer.vocab.keys())[:10], list(tokenizer.vocab.keys())[-10:]

(['[PAD]',
  '[unused0]',
  '[unused1]',
  '[unused2]',
  '[unused3]',
  '[unused4]',
  '[unused5]',
  '[unused6]',
  '[unused7]',
  '[unused8]'],
 ['##！', '##（', '##）', '##，', '##－', '##．', '##／', '##：', '##？', '##～'])

After breaking the text into tokens, we then have to convert the sentence from a list of strings to a list of vocabulary indeces.
101 marks starting cls and 102 marks the ending

In [15]:
index_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(index_tokens)

[101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005, 1012, 102]


SEGMENT ID:

BERT is trained on and expects sentence pairs, using 1s and 0s to distinguish between the two sentences. That is, for each token in “tokenized_text,” we must specify which sentence it belongs to: sentence 0 (a series of 0s) or sentence 1 (a series of 1s). For our purposes, single-sentence inputs only require a series of 1s, so we will create a vector of 1s for each token in our input sentence.

If you want to process two sentences, assign each word in the first sentence plus the ‘[SEP]’ token a 0, and all tokens of the second sentence a 1.

This is same as mask id s that we use in our classification model

In [17]:
segments_ids = [1] * len(tokenized_text)
print(segments_ids)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Extracting Embeddings

Extract the right embeddings from the model

In [19]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([index_tokens])
segments_tensors = torch.tensor([segments_ids])

In [20]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [21]:
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
  outputs = model(tokens_tensor, segments_tensors)


In [25]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])

In [33]:
hidden_states = outputs["hidden_states"]

In [35]:
len(hidden_states)

13

In [32]:
outputs['last_hidden_state'].shape

torch.Size([1, 14, 768])

The full set of hidden states for this model, stored in the object hidden_states, is a little dizzying. This object has four dimensions, in the following order:

The layer number (13 layers)
The batch number (1 sentence)
The word / token number (22 tokens in our sentence)
The hidden unit / feature number (768 features)
Wait, 13 layers? Doesn’t BERT only have 12? It’s 13 because the first element is the input embeddings, the rest is the outputs of each of BERT’s 12 layer

In [37]:
print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0

print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

print(f"length of the actual input token is {len(index_tokens)}")

Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 14
Number of hidden units: 768
length of the actual input token is 14


so a batch of 14 tokens has been converted into embeddings of 768 abstrct features for each

Current dimensions:

[# layers, # batches, # tokens, # features]

Desired dimensions:

[# tokens, # layers, # features]

In [38]:
# Each layer in the list is a torch tensor. First one as it represents the op of embedding layer which is fed to next 12 transformer layers
print('Tensor shape for each layer: ', hidden_states[0].size())

Tensor shape for each layer:  torch.Size([1, 14, 768])


In [45]:
# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)

token_embeddings.size() # 13 layers of concatenated representation of the sentence

torch.Size([13, 1, 14, 768])

In [46]:
# Remove dimension 1, the "batches" as it os not needed
token_embeddings = torch.squeeze(token_embeddings, dim=1)

token_embeddings.size()

torch.Size([13, 14, 768])

In [47]:
# now what we want is representation of each token across all the layers
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_embeddings.size()

torch.Size([14, 13, 768])

Now, what do we do with these hidden states? We would like to get individual vectors for each of our tokens, or perhaps a single vector representation of the whole sentence, but for each token of our input we have 13 separate vectors each of length 768.

In order to get the individual vectors we will need to combine some of the layer vectors…but which layer or combination of layers provides the best representation?

Word Vectors

Create appropriate vector representation for each word:

concat/sum

In [68]:
# Stores the token vectors
token_vecs_cat = []

# For each token in the sentence...
for token in token_embeddings:
    print(token.shape)
    # Concatenate the vectors (that is, append them together) from the last
    # four layers.
    # Each layer vector is 768 values, so `cat_vec` is length 3,072 for last 4
    cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
    # alt :: token[-4:].reshape(1,-1).squeeze()
    print(cat_vec.shape)

    # Use `cat_vec` to represent `token`.
    token_vecs_cat.append(cat_vec)

print ('Shape is: %d x %d' % (len(token_vecs_cat), len(token_vecs_cat[0])))


torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
torch.Size([13, 768])
torch.Size([3072])
Shape is: 14 x 3072


In [70]:
# for each token we have a representation of 3072 elements
len(token_vecs_cat) , token_vecs_cat[0].shape

(14, torch.Size([3072]))

In [71]:
# Stores the token vectors
token_vecs_sum = []
# For each token in the sentence...
for token in token_embeddings:

  # Sum the vectors from the last four layers.
  sum_vec = torch.sum(token[-4:], dim=0)
  # Use `sum_vec` to represent `token`.
  token_vecs_sum.append(sum_vec)

print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))

Shape is: 14 x 768


Sentence Vectors

To get a single vector for our entire sentence we have multiple application-dependent strategies, but a simple approach is to average the second to last hiden layer of each token producing a single 768 length vector.

In [73]:
# !gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
# !gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv

In [75]:
# `hidden_states` has shape [13 x 1 x 14 x 768]

# `token_vecs` is a tensor with shape [14 x 768]
token_vecs = hidden_states[-2][0]

# Calculate the average of all 14 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)

In [76]:
print ("Our final sentence embedding vector of shape:", sentence_embedding.size())

Our final sentence embedding vector of shape: torch.Size([768])


Pooling Strategy & Layer Choice

While concatenation of the last four layers produced the best results on this specific task, many of the other methods come in a close second and in general it is advisable to test different versions for your specific application: results may vary.

This is partially demonstrated by noting that the different layers of BERT encode very different kinds of information, so the appropriate pooling strategy will change depending on the application because different layers encode different kinds of information.

http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png

https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=E_t4cM6KLc98
https://colab.research.google.com/drive/1fCKIBJ6fgWQ-f6UKs7wDTpNTL9N-Cq9X