# Pre-training the BERT  

In this section, we will learn how to pre-train the BERT model. Wait, what is pre-training? Say, we have a model $m$. First, we train the model $m$ with huge data points for some task and save the trained model. Now, for a new task, instead of initializing the model with random weights, we will initialize the model with the weights of our already trained model $m$ (pre-trained model). That is, since the model $m$ is already trained on huge data points, instead of training the model from scratch again for a new task, we use the pre-trained model $m$ and adjust(finetune) its weights according to the new task. It is a type of transfer learning. 

The BERT model is pre-trained on a huge corpus using two interesting tasks called masked language modeling and next sentence prediction. After pre-training, we save the pre-trained BERT model. For a new task, say question-answering, instead of training BERT from scratch, we will use the pre-trained BERT model. That is, we will use the pre-trained BERT model and adjust (finetune) its weight for the new task. 

In this section, we will learn how the BERT model is pre-trained in detail. Before diving into pre-training, first, let's take a look into how to structure the input data in a way that BERT accepts. 

## Input data representation
Before feeding the input to the BERT, we convert the input into embedding using the three embedding layers given in the following:

- Token embedding 
- Segment embedding 
- Position embedding 
Let us understand how each of these embedding layers works one by one. 

### Token embedding 

First, we have a token embedding layer. Let's understand this with an example. Consider the following two sentences:

Sentence A: Paris is a beautiful city 

Sentence B: I love Paris. 

First, we tokenize both two sentences and obtain the tokens as shown below. In our example, we have not lowercased the tokens: 

tokens = [Paris, is, a, beautiful, city, I love, Paris]

Next, we add a new token called [CLS] token only at the beginning of the first sentence: 

tokens = [ [CLS], Paris, is, a, beautiful, city, I love, Paris]

And, we add a new token called [SEP] at the end of every sentence:

tokens = [ [CLS], Paris, is, a, beautiful, city, [SEP], I love, Paris, [SEP]]

Note that the [CLS] token is added only at the beginning of the first sentence and the [SEP] token is added at the end of every sentence. [SEP] token is used to mark the end of every sentence. We will understand how these two tokens, [CLS] and [SEP] are useful as we move forward through the chapter. 

Now, before feeding all the tokens to the BERT, we convert the tokens into embedding using an embedding layer called token embedding. Note that the value of token embeddings will be learned during training.  As shown in the following figure, we have embeddings for all the tokens, that is, $E_{\text{[CLS]}}$  indicates the embedding of the token [CLS], $E_{\text{Paris}}$  indicates the embedding of the token Paris, and so on:


![title](images/8.png)

# Segment embedding

Next, we have a segment embedding layer. The segment embedding is used to distinguish between the two given sentences. Let's understand the segment embedding with an example. Consider the same two sentences we saw in the previous section: 

Sentence A: Paris is a beautiful city 

Sentence B: I love Paris

After tokenizing the preceding two sentences, we will have the following:

tokens = [ [CLS], Paris, is, a, beautiful, city, [SEP], I love, Paris, [SEP]]

Now, apart from the [SEP] token, we have to give some sort of indicator to our model to distinguish between the two sentences. To do this, we feed the input tokens to the segment embedding layer. 

The segment embedding layer returns only either of the two embeddings $E_A$ and $E_B$ as the output. That is, if the input token belongs to sentence A, then the token will be mapped to the embedding $E_A$ and if the token belongs to sentence B then it will be mapped to the embedding $E_B$.

As shown in the following figure, all the tokens from the sentence A are mapped to the embedding $E_A$ and all the tokens from the sentence B are mapped to the embedding $E_B$:



![title](images/9.png)


Okay, how the segment embedding works if we have only one sentence? Say, we will have only the sentence - 'Paris is a beautiful city', in that case, all the tokens of the sentence will be mapped to embedding $E_A$ as shown below:


### Position embedding 

Next, we have a position embedding layer. In the previous chapter, we learned that since the transformer does not use any recurrence mechanism and process all the words in parallel, we need to give some information about the word order, so we used positional encoding.

We know that BERT is essentially the transformer's encoder and thus we need to give information about the position of the words (tokens) in our sentence before feeding them directly as an input to the BERT. So, we use a layer called position embedding layer and get the position embedding for each token in our sentence.

As we can observe from the following figure,  $E_0$ indicates the position embedding of the token [CLS]. $E_1$ indicates the position embedding of the token Paris and so on:



![title](images/11.png)

### Final representation 
Now, let us take a look at the final input data representation. As shown in the following figure, first, we convert the given input sentences to tokens and feed the tokens to the token embedding, segment embedding, and position embedding layers and obtain the embeddings. Next, we sum up all the embeddings together and feed them as an input to the BERT:


![title](images/12.png)

Now that we have learned how to convert the input into embedding using three embedding layers, in the next section we will learn about the tokenizer used by BERT called WordPiece tokenizer.  


# WordPiece tokenizer 
BERT uses a special type of tokenizer called a WordPiece tokenizer. The WordPiece tokenizer follows the subword tokenization scheme. Let's understand how the Wordpiece tokenizer works with an example. Consider the following sentence:

"Let us start pretraining the model"

Now, if we tokenize the sentence using the WordPiece tokenizer, then we obtain the tokens as shown below:

tokens = [let, us, start, pre, ##train, ##ing, the, model]

As we can observe, while tokenizing the sentence using the WordPiece tokenizer, the word pertaining is split into subwords - pre, ##train, ##ing. But what does this implies? 

When we tokenize using the WordPiece tokenizer, first we check if the word is present in our vocabulary. If the word is present in the vocabulary then we use it as a token. If the word is not present in the vocabulary then we split the word into subwords and we check if the subword is present in the vocabulary. If the subword is present in the vocabulary then we use it as a token. But if the subword is not present in the vocabulary then again we split the subword and check if it is present in the vocabulary. If it is present in the vocabulary then we use it as a token else again we split. In this way, we keep splitting and check the subword with the vocabulary until we reach individual characters. This is effective in handling the out-of-vocabulary (OOV) words.

The size of the BERT vocabulary is 30K tokens. If a word belongs to these 30K tokens, then we use it as a token. Else, we split the word into subwords and check if the subword belongs to these 30K tokens. We keep splitting and check the subwords with these 30K tokens in the vocabulary until we reach the individual characters. 

In our example, the word pretraining is not in our BERT vocabulary. So we split the word pretraining into subwords pre, ##train, and ##ing. The hash signs before the tokens ##train and ##ing indicate that it is a subword and it is preceded by other words. Now we check if the subwords ##train and ##ing are present in the vocabulary, since they are present in the vocabulary, we don't split again and use them as tokens. 

Thus, by using WordPiece tokenizer we obtain the following tokens: 

tokens = [let, us, start, pre, ##train, ##ing, the, model]

Next, we add [CLS] token at the beginning of the sentence and [SEP] token at the end of the sentence:

tokens = [ [CLS], let, us, start, pre, ##train, ##ing, the model, [SEP] ]

Now, just like we learned in the previous section, we feed the input tokens to the token, segment, and position embedding layers, obtain the embeddings, sum the embeddings and feed it as an input to BERT. A more detailed explanation of how the WordPiece tokenizer works and how we build the vocabulary is discussed at the end of the chapter along with other tokenizers in the section "Subword tokenization algorithms". 

Now that we learned how to feed the input to the BERT by converting them into embedding and also how to tokenize the input using WordPiece tokenizer, in the next section, let us learn how to pre-train the BERT model. 




























