## First What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.



BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.


It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.


------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------

![Imgur](https://imgur.com/NpeB9vb.png)

-------------------------

## Extracting embeddings from pre-trained BERT

In [None]:
!pip install transformers -q

In [None]:
from transformers import BertModel, BertTokenizer
import torch

In [None]:
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
sentence = 'She is a MachineLearning Engineer and works in California'

## Understanding Token IDs

The token ids are indices in a vocabulary.

The ids themselves are not used during the training of a network, rather the ids are transformed into vectors.

Say you are inputting three words, and their ids are 12,14, and 4. What is actually is given as input is three vectors (say each of n-dimension) where each id is mapped to a unique vector. These vectors could be one-hot, i.e 1 at the index 4 for the token Id 4 and rest zeros, or they could be pre-trained embedding like GloVe.

-----------------

![](2022-09-27-21-01-25.png)

![](2022-09-27-21-02-47.png)


The token ID specifically is used in the embedding layer, which you can see as a matrix where row indices are the token IDs.

The token ID is the row ID in the embedding matrix. So every row is a token representation

So one row for each item in the total vocabulary, for instance 30K rows for 30k tokens. 


Every token therefore has a (learned!) representation. 

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Tokenize the sentence and obtain the tokens:

In [22]:
tokens = tokenizer.tokenize(sentence)

In [23]:
tokens

['she',
 'is',
 'a',
 'machine',
 '##lea',
 '##rn',
 '##ing',
 'engineer',
 'and',
 'works',
 'in',
 'california']

Let's print the tokens:

Now, we will add the [CLS] token at the beginning and [SEP] token at the end of the tokens list: 

In [24]:
tokens = ['[CLS]'] + tokens + ['[SEP]']

Let's look at our updated tokens list:

In [25]:
print(tokens)

['[CLS]', 'she', 'is', 'a', 'machine', '##lea', '##rn', '##ing', 'engineer', 'and', 'works', 'in', 'california', '[SEP]']


In [28]:
len(tokens)

14

In [29]:
tokens = tokens + ['[PAD]'] + ['[PAD]']

Let's print our updated tokens list:

In [31]:
print(len(tokens))

16


In [32]:
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]

In [33]:
print(attention_mask)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]


# unique token ID

In [34]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)


Let's have a look at the token_ids:

In [35]:
print(token_ids)

[101, 2016, 2003, 1037, 3698, 19738, 6826, 2075, 3992, 1998, 2573, 1999, 2662, 102, 0, 0]


In [None]:
['[CLS]', 'she', 'is', 'a', 'machine', '##lea', '##rn', '##ing', 'engineer', 'and', 'works', 'in', 'california', '[SEP]']

In [38]:
token_ids = torch.tensor(token_ids).unsqueeze(0)

attention_mask = torch.tensor(attention_mask).unsqueeze(0)


That's it. Next, we feed the token_ids and attention_mask to the pre-trained BERT model and get the embedding. 

## Getting the embedding 


In [40]:
output = model(token_ids, attention_mask = attention_mask)

In [41]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1925,  0.1684, -0.4252,  ..., -0.2599,  0.3736,  0.0529],
         [ 0.2417, -0.2748, -0.4909,  ...,  0.1372,  0.3408, -0.4655],
         [-0.0871,  0.0837,  0.2605,  ..., -0.4635, -0.0462,  0.2621],
         ...,
         [ 0.6711, -0.0076, -0.3847,  ..., -0.1289, -0.5171, -0.8002],
         [-0.2731,  0.1098, -0.5440,  ...,  0.0314,  0.4467, -0.3448],
         [-0.2387,  0.0119, -0.4760,  ...,  0.4656,  0.5837, -0.3774]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9531, -0.4914, -0.8872,  0.9035,  0.8174, -0.2919,  0.9511,  0.4982,
         -0.7595, -1.0000, -0.6996,  0.9459,  0.9890,  0.4754,  0.9723, -0.8460,
         -0.1423, -0.7209,  0.4428, -0.7905,  0.7822,  1.0000,  0.2119,  0.4066,
          0.5813,  0.9923, -0.8380,  0.9670,  0.9746,  0.8324, -0.8227,  0.4136,
         -0.9931, -0.2821, -0.8860, -0.9961,  0.5261, -0.8722, -0.0915, -0.0950,
         -0.9237,  0.5106,  1.00

In [44]:
output[0].shape

torch.Size([1, 16, 768])

In [45]:
output[1]

tensor([[-0.9531, -0.4914, -0.8872,  0.9035,  0.8174, -0.2919,  0.9511,  0.4982,
         -0.7595, -1.0000, -0.6996,  0.9459,  0.9890,  0.4754,  0.9723, -0.8460,
         -0.1423, -0.7209,  0.4428, -0.7905,  0.7822,  1.0000,  0.2119,  0.4066,
          0.5813,  0.9923, -0.8380,  0.9670,  0.9746,  0.8324, -0.8227,  0.4136,
         -0.9931, -0.2821, -0.8860, -0.9961,  0.5261, -0.8722, -0.0915, -0.0950,
         -0.9237,  0.5106,  1.0000, -0.0830,  0.5382, -0.3140, -1.0000,  0.3774,
         -0.9557,  0.8998,  0.7947,  0.8279,  0.2756,  0.6581,  0.6064, -0.3369,
          0.0251,  0.1856, -0.3297, -0.7515, -0.6843,  0.4392, -0.8613, -0.9603,
          0.8838,  0.7763, -0.3041, -0.3105, -0.1854, -0.0969,  0.9726,  0.3039,
          0.0437, -0.8890,  0.6619,  0.2710, -0.7410,  1.0000, -0.5422, -0.9900,
          0.7010,  0.7629,  0.6910, -0.1635,  0.4133, -1.0000,  0.6527, -0.1595,
         -0.9959,  0.2069,  0.5956, -0.3285,  0.3339,  0.7146, -0.4482, -0.5843,
         -0.4799, -0.8426, -

In [46]:
output[1].shape

torch.Size([1, 768])