# Understanding ELECTRA 

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is yet another interesting variant of the BERT. We learned that we pre-train BERT using masked-language modeling and next sentence prediction tasks. We know that in the masked language modeling task, we randomly mask 15% of the tokens and train the BERT to predict the masked token. Instead of using the masked language modeling task as a pre-training objective, ELECTRA is pre-trained using a task called replaced token detection.

The replaced token detection task is very similar to the masked language modeling but instead of masking a token with the [MASK] token, here we replace a token with a different token and train the model to classify whether the given tokens are actual or replaced tokens. 

Okay, but what is the use of the replaced token detection task instead of the masked language modeling task? One of the problems with the masked language modeling task is that it uses the [MASK] token during pre-training but the [MASK] token will not be present during fine-tuning on downstream tasks. This causes a mismatch between pre-training and fine-tuning. In the replaced token detection task, we don't use any [MASK] token for masking, instead, we just replace a token with a different token and train the model to classify whether the given tokens are actual or replaced tokens. This combats the issue of mismatch between the pre-training and fine-tuning. 

Unlike BERT which is pre-trained using masked language modeling and next sentence prediction tasks, ELECTRA is pre-trained only using the replaced token detection task. Okay, but how does the replaced token detection task work? Which token do we replace? How we train the model to perform this task? Let us find out the answers to all these questions in the next section. 

## Understanding the replaced token detection task 
Let us learn how exactly the replaced token detection task works with an example. Let us use the same example used in the paper. Consider a sentence: 'The chef cooked the meal'. After tokenization we have: 

tokens = [ The, chef, cooked, the, meal] 

Let us replace the first token, 'the' to 'a' and the third token 'cooked' to 'ate' then we have:

tokens = [ a, chef, ate, the, meal] 

As we can observe we replaced two tokens. Now, we train the BERT model to classify whether the tokens are original or replaced tokens. We can call this BERT a discriminator model since it is just classifying whether the tokens are original or replaced tokens.

As shown in the following figure, we feed the tokens to the discriminator (BERT) and it returns whether the tokens are original or replaced: 


![title](images/6.png)

We learned that we replace some tokens in the given sentence and feed it to the discriminator to classify whether the tokens are replaced or original. But the question is, before feeding the tokens to the discriminator, how exactly we replace them? To replace tokens, we use masked language modeling. Consider the original sentence, 'The chef cooked the meal'. After tokenization, we have:

tokens = [ The, chef, cooked, the, meal] 

Now, we randomly replace mask 15% of tokens with [MASK] token then we have:

tokens = [ [MASK] , chef, [MASK] ,the, meal]

Next, we feed the tokens to another BERT model and predict the masked tokens. We can call this BERT a generator since it returns the probability distribution over tokens. As we notice from the following figure, we feed the masked tokens to the generator and it predicts the masked token as the one which has a high probability of being the masked token:



![title](images/7.png)

From the preceding figure, we can understand that our generator has predicted the masked token 'the' as 'a' and the masked token 'cooked' as 'ate'. Now, we take the tokens generated by the generator and use them for replacing. That is, we replace tokens in the given sentence with the tokens generated by the generator. For instance, tokenizing the given sentence 'The chef cooked the meal', we have: 

tokens = [ The, chef, cooked, the, meal] 

Now, we replace tokens with tokens generated by the generator. So our tokens list becomes:

tokens = [ a, chef, ate, the, meal] 

As we can notice, we replaced tokens 'The' and 'cooked' with tokens 'a' and 'ate' generated by the generator. Now we feed the above tokens to the generator and train it to classify whether the given tokens are original or replaced.

As shown in the following figure, first we mask the tokens randomly and feed them to the generator. The generator predicts the masked tokens. Next, we replace the input tokens with tokens generated by the generator and feed it to the discriminator. The discriminator classifies whether the given tokens are original or replaced as shown in the following figure:


![title](images/8.png)

The discriminator is basically our ELECTRA model. After training, we can remove the generator and use the discriminator as the ELECTRA model. Now that we have understood how the replaced token detection task works, let us take a closer look into the generator and discriminator in the next section.  

















