## First What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.



BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.


It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.


------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------

![Imgur](https://imgur.com/NpeB9vb.png)

-------------------------

The core part of BERT is the stacked bidirectional encoders from the transformer model, but during pre-training, a masked language modeling and next sentence prediction head are added onto BERT.

When I say “head”, I mean that a few extra layers are added onto BERT that can be used to generate a specific output. The raw output of BERT is the output from the stacked Bi-directional encoders.

### Tokenization for Bert Models

Tokenization plays an essential role in NLP as it helps convert the text to numbers which deep learning models can use for processing.

No deep learning models can work directly with the text. You need to convert it into numbers or the format which the model can understand.

In [None]:
!pip install transformers
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
text = '2022 will be a great year for all of us'
encoding = tokenizer.encode_plus(text, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
encoding

{'input_ids': tensor([[  101, 16798,  2475,  2097,  2022,  1037,  2307,  2095,  2005,  2035,
          1997,  2149,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

**`padding = "max_length"`** => The padding argument controls padding. padded by zeros to make all the text to the length of `max_length`. So, it will pad to a length specified by the `max_length` argument or the maximum length accepted by the model if no max_length is provided. Padding will still be applied if you only provide a single sequence.

**truncation = True** => True or 'longest_first': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached.

As BERT can only accept as input only 512 tokens at a time, so we must specify the truncation parameter to True. 

The **`add_special_tokens`** parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens. `Return_tensors = “pt”` is just for the tokenizer to return PyTorch tensors. If you don’t want this to happen(maybe you want it to return a list), then you can remove the parameter and it will return lists.

**tokenizer.encode_plus()** specifically returns a dictionary of values instead of just a list of values. Because tokenizer.encode_plus() can return many different types of information, like the attention_masks and token type ids, everything is returned in a dictionary format, and if you want to retrieve the specific parts of the encoding, you can do it like below

In [None]:
input = encoding["input_ids"][0]
attention_mask = encoding["attention_mask"][0]

Additionally, because the tokenizer returns a dictionary of different values, instead of finding those values as shown above and individually passing these into the model, we can just pass in the entire encoding like this

```
output = model(**encoding)
```


One more very important thing about the tokenizer to know is that you can specify to retrieve specific tokens if desired. For example, if you are doing masked language modeling and you want to insert a mask at a location for your model to decode, then you can simply retrieve the mask token like this

In [None]:
tokenizer.mask_token

'[MASK]'

## Masked Language Modeling

Masked Language modelling is a way to perform word prediction that was originally hidden intentionally in a sentence.

In simple terms, it is the task of filling in the blanks.

Masked language modelling can be considered similar to autoencoding modelling which works based on constructing outcomes from unarranged or corrupted input.

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict = True)

''' Masked Language Modeling works by inserting a mask token at the desired position where you want to predict the best candidate word that would go in that position.

You can simply insert the mask token by concatenating it at the desired position

The Bert Model for Masked Language Modeling predicts the best word/token in its vocabulary that would replace that word. 

The logits are the output of the BERT Model before a softmax activation function is applied to the output of BERT. 
i.e. logits are the Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

And in order to get the logits, we have to specify "return_dict = True" in the parameters when initializing the model, otherwise, the above code will result in a compilation error. 

"return_dict" - If set to True, the model will return a ModelOutput class instead of a plain tuple.

On "return_dict = True" see my note - HuggingFace/input_dict_true_its_purpose.ipynb

'''

text = "The Opera House in Australia is in , " + tokenizer.mask_token + " city"

input = tokenizer.encode_plus(text, return_tensors = "pt")


''' In order to get the tensor of softmax values of all the words in BERT’s vocabulary for replacing the mask token, we need to specify the masked token index.

And these we can get using torch.where(). And in this particular example I am retrieving the top 10 candidate replacement words for the mask token. '''
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)

''' mask_token (str or tokenizers.AddedToken, optional) — A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT). Will be associated to self.mask_token and self.mask_token_id. '''

output = model(**input)

logits = output.logits

''' After we pass the input encoding into the BERT Model, we can get the logits simply by specifying output.logits, which returns a tensor, and after this we can finally apply a softmax activation function to the logits. '''

softmax = F.softmax(logits, dim = -1)
''' By applying a softmax onto the output of BERT, we get probabilistic distributions for each of the words in BERT’s vocabulary. Word’s with a higher probability value will be better candidate replacement words for the mask token.  '''

mask_word = softmax[0, mask_index, :]
''' In order to get the tensor of softmax values of all the words in BERT’s vocabulary for replacing the mask token, we can specify the masked token index, which we already got using torch.where(). 

Further, Because in this particular example I am retrieving the top 10 candidate replacement words for the mask token. '''

top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
''' I used the torch.topk() function, which allows you to retrieve the top k values in a given tensor, and it returns a tensor containing those top k values. '''

'''  After this, the process becomes relatively simple, as all we have to do is iterate through the tensor, and replace the mask token in the sentence with the candidate token. '''
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The Opera House in Australia is in , sydney city
The Opera House in Australia is in , melbourne city
The Opera House in Australia is in , brisbane city
The Opera House in Australia is in , adelaide city
The Opera House in Australia is in , the city
The Opera House in Australia is in , canberra city
The Opera House in Australia is in , auckland city
The Opera House in Australia is in , hobart city
The Opera House in Australia is in , griffith city
The Opera House in Australia is in , hume city


If you want to only get the top candidate word, you can do this:

```py
softmax = F.softmax(logits, dim = -1)

mask_word = softmax[0, mask_index, :]

top_word = torch.argmax(mask_word, dim=1)

print(tokenizer.decode(top_word))
```

Instead of using torch.topk() for retrieving the top 10 values, we just use torch.argmax(), which returns the index of the maximum value in the tensor. The rest of the code is pretty much the same thing as the original code.

---

## Next Sentence Prediction

Next Sentence Prediction is the task of predicting whether one sentence follows another sentence. Here is my code for this:


[BertForNextSentencePrediction](https://huggingface.co/transformers/v4.9.2/model_doc/bert.html#bertfornextsentenceprediction)

It returns logits (torch.FloatTensor of shape (batch_size, 2)) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

In [None]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
from torch.nn import functional as F
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

prompt = "I came back from Office in the evening"

next_sentence = "I opened my Beer after Office"

encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')
outputs = model(**encoding)[0]
softmax = F.softmax(outputs, dim = 1)
print(softmax)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[9.9998e-01, 1.5085e-05]], grad_fn=<SoftmaxBackward0>)


Next Sentence prediction is the task of predicting how good a sentence is as a next sentence for a given sentence. 

In this case, prompt variable is the given sentence and we are trying to predict whether next_sentence is the next sentence. 


To do this, the BERT tokenizer automatically inserts a [SEP] token in between the sentences, which represents the separation between the two sentences, and the specific Bert 

For Next Sentence Prediction model predicts two values of whether the sentence is the next sentence. 

Bert returns two values in a tensor: the first value represents whether the second sentence is a continuation of the first, and the second value represents whether the second sentence is a random sequence or not a good continuation of the first. 

### Unlike Masked Language Modeling, we don’t retrieve any logits because we are not trying to compute a softmax on the vocabulary of BERT; 

### We are simply trying to compute a softmax on the two values that BERT for next sentence prediction returns so that we can see which value has the highest probability. 

### And this will represent whether the second sentence is a good next sentence for the first. 


Once we get the softmax values, we can simply look at the tensor by printing it out. Here are the values that I got:

`tensor([[0.9953, 0.0047]], grad_fn=<SoftmaxBackward0>)`

Because the first value is considerably higher than the second index, BERT believes that the second sentence follows the first sentence, which is the correct answer.

---

## Question Answering

In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-cased-squad2")

model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

example_text = "GPT-3 came in 2020"

example_question = "When did GPT-3 come"

# We can use our tokenizer to automatically generate 2 sentence by passing the
# two sequences to tokenizer as two arguments
tokenized_inputs = tokenizer(example_question, example_text, return_tensors="pt")
tokenized_inputs


Bert QA appends question before context.

Tokenizer returns 3 tensors for us.

#### “inputs_ids” are tokenized ids of text.

----------------------

### "'token_type_ids' => To understand them first note, Some models’ purpose is to do classification on pairs of sentences or question answering.

https://huggingface.co/docs/transformers/v4.20.1/en/glossary#token-type-ids

These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:


### [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

We used our tokenizer to automatically generate such a sentence by passing the two sequences to tokenizer as two arguments

BERT has token type IDs (also called segment IDs). They are represented as a binary mask identifying the two types of sequence in the model.

Here those 2 types of sequences are Questions and the Context.

Token type 0 is for question part and 1 context.


**The model will tell you at what start and end position of the input_ids the answer to the question will be located.**

----------------------

In [None]:
text = "The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula.   The Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail.   In March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online.   The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items.   Scholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican.   The Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant."

question = "When was the Vat formally opened?"

In [None]:
tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-cased-squad2")

model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

tokenized_inputs = tokenizer(question, text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokenized_inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

''' start_logits (torch.FloatTensor of shape (batch_size, sequence_length)) — Span-start scores (before SoftMax).

end_logits (torch.FloatTensor of shape (batch_size, sequence_length)) — Span-end scores (before SoftMax). '''

predict_answer_tokens = tokenized_inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)