<a href="https://colab.research.google.com/github/rahiakela/getting-started-with-google-bert/blob/main/3-getting-hands-on-with-BERT/1_extracting_embeddings_from_pre_trained_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Extracting embeddings from pre-trained BERT

Pre-training BERT from scratch is computationally expensive. So, we can download the pre-trained BERT model and use it. Google has open sourced the pre-trained BERT model and we can download it from Google Research's GitHub repository – https://github.com/google-research/bert. They have released the pre-trained BERT model with various configurations.

The pre-trained model is also available in the BERT-uncased and BERT-cased formats. In BERT-uncased, all the tokens are lowercased, but in BERT-cased, the tokens are not lowercased and are used directly for training. 

The BERT-uncased model is the one that is most commonly used, but if we are working on certain tasks such as **Named Entity Recognition (NER)** where we have to preserve the case, then we should use the BERT-cased model. Along with these, Google also released pre-trained BERT models trained using the whole word masking method.

We can use the pre-trained model in the following two ways:
- As a feature extractor by extracting embeddings
- By fine-tuning the pre-trained BERT model on downstream tasks such as text
classification, question-answering, and more

## Setup

**Hugging Face transformers**

Hugging Face is an organization that is on a path to solve and democratize AI through natural language. Their open-source library 'transformers' is very popular among the NLP community. It is very useful and powerful for several NLP and NLU tasks. It includes thousands of pre-trained models in about 100+ languages. One of the many advantages of the transformer library is that it is compatible with both PyTorch and TensorFlow.

In [1]:
%%capture
!pip install torch==1.4.0
!pip install transformers==3.5.1

In [2]:
import torch
from transformers import BertModel, BertTokenizer

## Introduction

Consider a sentence – I love Paris. Say we need to extract the contextual embedding of each word in the sentence. To do this, first, we tokenize the sentence and feed the tokens to the pre-trained BERT model, which will return the embeddings for each of the tokens. Apart from obtaining the token-level (word-level) representation, we can also obtain the sentence-level
representation.

Let's suppose we want to perform a sentiment analysis task, and say we have the dataset shown in the following figure:

<img src='https://github.com/rahiakela/img-repo/blob/master/getting-started-with-google-bert/sample-dataset.png?raw=1' width='800'/>

We have sentences and their corresponding labels, where 1 indicates positive sentiment and 0 indicates negative sentiment. We can train a classifier to classify the sentiment of a sentence using the given dataset.

But we can't feed the given dataset directly to a classifier, since it has text. So first, we need to vectorize the text. We can vectorize the text using methods such as:-

- TF-IDF, 
- word2vec,
- BERT

BERT learns the contextual embedding, unlike other context-free embedding models such as word2vec. Now, we will see how to use the pre-trained BERT model to vectorize the sentences in our dataset.

Let's take the first sentence in our dataset – `I love Paris`. First, we tokenize the sentence using the WordPiece tokenizer and get the tokens (words).

```
tokens = [I, love, Paris]
```

Now, we add the `[CLS]` token at the beginning and the `[SEP]` token at the end.

```
tokens = [[CLS], I, love, Paris, [SEP]]
```

Similarly, we can tokenize all the sentences in our training set. But the length of each sentence varies, right? Yes, and so does the length of the tokens. We need to keep the length of all the tokens the same. 

Say we keep the length of the tokens to 7 for all the sentences in
our dataset. If we look at our preceding tokens list, the tokens length is 5. To make the tokens length 7, we add a new token called `[PAD]`.

```
tokens = [[CLS], I, love, Paris, [SEP], [PAD], [PAD]]
```

As we can observe, now our tokens length is 7, as we have added two [PAD] tokens. 

**The next step is to make our model understand that the `[PAD]` token is added only to match the tokens length and it is not part of the actual tokens. To do this, we introduce an attention mask. We set the attention mask value to 1 in all positions and 0 to the position where we have a `[PAD]` token.**

```
attention_mask = [ 1,1,1,1,1,0,0]
```

Next, we map all the tokens to a unique token ID. Suppose the following is the mapped token ID:

```
token_ids = [101, 1045, 2293, 3000, 102, 0, 0]
```

It implies that ID 101 indicates the token [CLS], 1045 indicates the token I, 2293 indicates the token Love, and so on.

**Now, we feed token_ids along with attention_mask as input to the pre-trained BERT model and obtain the vector representation (embedding) of each of the tokens.**

As we can see, once we feed the tokens as the input, encoder 1 computes the representation of all the tokens and sends it to the next encoder, which is encoder 2. Encoder 2 takes the representation computed by encoder 1 as input, computes its representation, and sends it to the next encoder, which is encoder 3. In this way, each encoder sends its representation to the next
encoder above it. The final encoder, which is encoder 12, returns the
final representation (embedding) of all the tokens in our sentence:

<img src='https://github.com/rahiakela/img-repo/blob/master/getting-started-with-google-bert/pre-trained-BERT.png?raw=1' width='800'/>

**Thus, in this way, we can obtain the representation of each of the tokens. These representations are basically the contextualized word (token) embeddings.** Say we are using the pre-trained BERT-base model; in that case, the representation size of each token is 768.

We learned how to obtain the representation for each word in the sentence I love Paris. But how do we obtain the representation of the complete sentence?

We learned that we have prepended the `[CLS]` token to the beginning of our sentence. The representation of the `[CLS]` token will hold the aggregate representation of the complete sentence. So, we can ignore the embeddings of all other tokens and take the embedding of the `[CLS]` token and assign it as a representation of our sentence. Thus, the representation of our sentence I love Paris is just the representation of the `[CLS]` token $R_{[CLS]}$.

In a very similar fashion, we can compute the vector representation of all the sentences in our training set. Once we have the sentence representation of all the sentences in our training set, we can feed those representations as input and train a classifier to perform a sentiment analysis task.

**Note that using the representation of the `[CLS]` token as a sentence representation is not always a good idea. The efficient way to obtain the representation of a sentence is either averaging or pooling the representation of all the tokens.**



## Generating BERT embeddings

We will learn how to extract embeddings from the pre-trained BERT model.
Consider the sentence I love Paris. Let's see how to obtain the contextualized word embedding of all the words in the sentence using the pre-trained BERT model with Hugging Face's transformers library.

We use the 'bert-base-uncased' model. As the name suggests, it is the BERT-base model with 12 encoders and it is trained with uncased tokens. Since we are using BERTbase, the representation size will be 768.

In [3]:
# Download and load the pre-trained bert-base-uncased model
model = BertModel.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Next, we download and load the tokenizer that was used to pre-train the `bert-baseuncased` model:

In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### Preprocessing the input

Now, let's see how to preprocess the input before feeding it to BERT.

In [5]:
# Define the sentence
sentence = "I love Paris"

# Tokenize the sentence and obtain the tokens
tokens = tokenizer.tokenize(sentence)
print(tokens)

['i', 'love', 'paris']


Now, we will add the `[CLS]` token at the beginning and the `[SEP]` token at the end of the tokens list:

In [6]:
tokens = ["[CLS]"] + tokens + ["[SEP]"]
print(tokens)

['[CLS]', 'i', 'love', 'paris', '[SEP]']


As we can observe, we have a `[CLS]` token at the beginning and an `[SEP]` token at the end of our tokens list. We can also see that length of our tokens list is 5.

Say we need to keep the length of our tokens list to 7; in that case, we add two `[PAD]` tokens at the end.

In [7]:
tokens = tokens + ["[PAD]"] + ["[PAD]"]
print(tokens)

['[CLS]', 'i', 'love', 'paris', '[SEP]', '[PAD]', '[PAD]']


As we can see, now we have the tokens list with `[PAD]` tokens and the length of our tokens list is 7.

Next, we create the attention mask. We set the attention mask value to 1 if the token is not a `[PAD]` token, else we set the attention mask to 0.

In [8]:
attention_mask = [1 if i != '[PAD]' else 0 for i in tokens]
print(attention_mask)

[1, 1, 1, 1, 1, 0, 0]


As we can see, we have attention mask values 0 at positions where have a `[PAD]` token and 1 at other positions.

Next, we convert all the tokens to their token IDs as follows:

In [9]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[101, 1045, 2293, 3000, 102, 0, 0]


From the output, we can observe that each token is mapped to a unique token ID.

Now, we convert token_ids and attention_mask to tensors.

In [10]:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
print(token_ids)
print(attention_mask)

tensor([[ 101, 1045, 2293, 3000,  102,    0,    0]])
tensor([[1, 1, 1, 1, 1, 0, 0]])


**That's it. Next, we feed token_ids and attention_mask to the pre-trained BERT model and get the embedding.**

### Getting the embedding

We feed `token_ids` and `attention_mask` to `model` and get the embeddings. Note that `model` returns the output as a tuple with two values. The first value indicates the hidden state representation, `hidden_rep`, and it consists of the representation of all the tokens obtained from the final encoder (encoder 12), and the second value, `cls_head`, consists of the representation of the `[CLS]` token:

In [11]:
hidden_rep, cls_head = model(token_ids, attention_mask=attention_mask)

# hidden_rep contains the embedding (representation) of all the tokens in our input
print(hidden_rep.shape)

torch.Size([1, 7, 768])


The size `[1,7,768]` indicates `[batch_size, sequence_length, hidden_size]`.

Our batch size is 1, the sequence length is the token length, since we have 7 tokens, the sequence length is 7, and the hidden size is the representation (embedding) size and it is 768 for the BERT-base model.

We can obtain the representation of each token as:

- `hidden_rep[0][0]` gives the representation of the first token which is `[CLS]`
- `hidden_rep[0][1]` gives the representation of the second token which is 'I'
- `hidden_repo[0][2]` gives the representation of the third token which is 'love'

In this way, we can obtain the contextual representation of all the tokens. This is basically the contextualized word embeddings of all the words in the given sentence.

Now, let's take a look at the cls_head. It contains the representation of the `[CLS]` token. Let's print the shape of cls_head :

In [12]:
print(cls_head.shape)

torch.Size([1, 768])


The size `[1,768]` indicates `[batch_size, hidden_size]`.

We learned that `cls_head` holds the aggregate representation of the sentence, so we can use `cls_head` as the representation of the sentence `I love Paris`.

We learned how to extract embeddings from the pre-trained BERT model. But these are the embeddings obtained only from the topmost encoder layer of BERT, which is encoder 12.

But we can also extract the embeddings from all the encoder layers of BERT.