<a href="https://colab.research.google.com/github/shahchhatru/AI_colab_notebooks/blob/main/BERTTokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## following youtube tutorial
https://www.youtube.com/watch?v=KPtna8FahZ8&list=PLxqBkZuBynVQaqvEwN-qAjkNAJ6NgyfcM

Notebook no 1

In [None]:
! pip install transformers -q

In [None]:
from transformers import BertModel, BertTokenizer
import torch

In [None]:
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
sentence='She is a machine learning engineer and works in California'


In [None]:
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')

#### Tokenize the sentences and obtain the token data[link text](https://)

In [None]:
tokens=tokenizer.tokenize(sentence)

In [None]:
tokens

['she',
 'is',
 'a',
 'machine',
 'learning',
 'engineer',
 'and',
 'works',
 'in',
 'california']

ADD stop and start token to the original token list

'[CLS]' for begging and '[SEP]' for ending



In [None]:
tokens=['[CLS]']+tokens+['[SEP]']

In [None]:
tokens

['[CLS]',
 'she',
 'is',
 'a',
 'machine',
 'learning',
 'engineer',
 'and',
 'works',
 'in',
 'california',
 '[SEP]']

When we tokenize the sentences we need a list of token of same size as a input to our model. But we know that sentences length are not fixed naturally. They vary . So what do we do??
We simply set a max_length limit and add padding to the sentence until it's token array is of that size.

In [None]:
##padding
## here let's us consider max_length is 14 so we need to add two padddings
tokens=tokens+['[PAD]']+['[PAD]']

In [None]:
print(tokens)

['[CLS]', 'she', 'is', 'a', 'machine', 'learning', 'engineer', 'and', 'works', 'in', 'california', '[SEP]', '[PAD]', '[PAD]']


In [None]:
print(len(tokens))

14


## Great now let's make our model understand that the pad token is added only to match the token length and is not part of the actual token.  for this we create a attention mask where 0 is for PAD tokens and 1 for rest of the tokens


In [None]:
attention_mask=[1 if i!='[PAD]' else 0 for i in tokens]

In [None]:
print(attention_mask)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]


## Unique token id
It is a concept related to mapping all the tokens to an unique id
We can simply do that here with a method called convert token to id


In [None]:
token_ids=tokenizer.convert_tokens_to_ids(tokens)

In [None]:
token_ids

[101, 2016, 2003, 1037, 3698, 4083, 3992, 1998, 2573, 1999, 2662, 102, 0, 0]

In [None]:
token_ids=torch.tensor(token_ids).unsqueeze(0)
## similarly for attention masks as well
attention_mask=torch.tensor(attention_mask).unsqueeze(0)
print(token_ids)
print(attention_mask)

tensor([[ 101, 2016, 2003, 1037, 3698, 4083, 3992, 1998, 2573, 1999, 2662,  102,
            0,    0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])


In [None]:
outputs=model(token_ids,attention_mask=attention_mask)

In [None]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1054,  0.1832, -0.4909,  ..., -0.3241,  0.3984,  0.1591],
         [ 0.2438, -0.3654, -0.6240,  ..., -0.0516,  0.3615, -0.3675],
         [-0.1027,  0.1559,  0.0640,  ..., -0.5743,  0.0936,  0.2774],
         ...,
         [ 0.7337,  0.0462, -0.4839,  ..., -0.0072, -0.5861, -0.5907],
         [-0.1763,  0.0910, -0.6007,  ...,  0.1243,  0.4635, -0.3595],
         [-0.2029, -0.1064, -0.5744,  ...,  0.4472,  0.6531, -0.4184]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9473, -0.4997, -0.8718,  0.8811,  0.7894, -0.3033,  0.9334,  0.4975,
         -0.7602, -1.0000, -0.6657,  0.9340,  0.9871,  0.4571,  0.9601, -0.8242,
         -0.1882, -0.7192,  0.4297, -0.7576,  0.7524,  1.0000,  0.2104,  0.4064,
          0.5661,  0.9898, -0.8310,  0.9563,  0.9695,  0.8257, -0.7848,  0.4256,
         -0.9918, -0.3059, -0.8785, -0.9956,  0.5351, -0.8573, -0.0809, -0.1058,
         -0.9060,  0.5226,  1.00

In [None]:
outputs.last_hidden_state

tensor([[[-0.1054,  0.1832, -0.4909,  ..., -0.3241,  0.3984,  0.1591],
         [ 0.2438, -0.3654, -0.6240,  ..., -0.0516,  0.3615, -0.3675],
         [-0.1027,  0.1559,  0.0640,  ..., -0.5743,  0.0936,  0.2774],
         ...,
         [ 0.7337,  0.0462, -0.4839,  ..., -0.0072, -0.5861, -0.5907],
         [-0.1763,  0.0910, -0.6007,  ...,  0.1243,  0.4635, -0.3595],
         [-0.2029, -0.1064, -0.5744,  ...,  0.4472,  0.6531, -0.4184]]],
       grad_fn=<NativeLayerNormBackward0>)

In [None]:
outputs.last_hidden_state.shape

torch.Size([1, 14, 768])

In [None]:
outputs[0] # same as last hidden state

tensor([[[-0.1054,  0.1832, -0.4909,  ..., -0.3241,  0.3984,  0.1591],
         [ 0.2438, -0.3654, -0.6240,  ..., -0.0516,  0.3615, -0.3675],
         [-0.1027,  0.1559,  0.0640,  ..., -0.5743,  0.0936,  0.2774],
         ...,
         [ 0.7337,  0.0462, -0.4839,  ..., -0.0072, -0.5861, -0.5907],
         [-0.1763,  0.0910, -0.6007,  ...,  0.1243,  0.4635, -0.3595],
         [-0.2029, -0.1064, -0.5744,  ...,  0.4472,  0.6531, -0.4184]]],
       grad_fn=<NativeLayerNormBackward0>)

In [None]:
outputs.pooler_output

tensor([[-0.9473, -0.4997, -0.8718,  0.8811,  0.7894, -0.3033,  0.9334,  0.4975,
         -0.7602, -1.0000, -0.6657,  0.9340,  0.9871,  0.4571,  0.9601, -0.8242,
         -0.1882, -0.7192,  0.4297, -0.7576,  0.7524,  1.0000,  0.2104,  0.4064,
          0.5661,  0.9898, -0.8310,  0.9563,  0.9695,  0.8257, -0.7848,  0.4256,
         -0.9918, -0.3059, -0.8785, -0.9956,  0.5351, -0.8573, -0.0809, -0.1058,
         -0.9060,  0.5226,  1.0000, -0.1399,  0.5261, -0.3048, -1.0000,  0.3643,
         -0.9460,  0.8846,  0.7668,  0.8117,  0.2881,  0.6517,  0.5824, -0.2660,
          0.0118,  0.1780, -0.3221, -0.7566, -0.6814,  0.4362, -0.8407, -0.9596,
          0.8762,  0.7554, -0.2953, -0.3134, -0.2160, -0.0667,  0.9666,  0.3424,
          0.0743, -0.8780,  0.6402,  0.2758, -0.7399,  1.0000, -0.5168, -0.9872,
          0.7122,  0.7422,  0.7109, -0.1680,  0.4112, -1.0000,  0.6379, -0.1462,
         -0.9948,  0.2089,  0.5989, -0.3234,  0.3302,  0.7130, -0.4732, -0.5511,
         -0.4602, -0.8095, -

In [None]:
outputs.pooler_output.shape

torch.Size([1, 768])