<a href="https://colab.research.google.com/github/somnathsingh31/inside_LLM_Architecture/blob/main/tokenizer_in_details.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [4]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [6]:
print(tokenizer)

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [7]:
raw_input = "This is my village. It is haven with lush fields, vibrant culture, traditional houses, warm community, ancient temples, and simple, self-sufficient living."

In [18]:
tokenized_input = tokenizer(raw_input, return_tensors='pt')

In [19]:
print(tokenized_input)

{'input_ids': tensor([[  101,  1188,  1110,  1139,  1491,   119,  1135,  1110,  3983,  1114,
         19302,  3872,   117, 18652,  2754,   117,  2361,  2725,   117,  3258,
          1661,   117,  2890,  8433,   117,  1105,  3014,   117,  2191,   118,
          6664,  1690,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


*Attention masks* are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

This returns token id corrsponding to given raw text.

There are two steps invovled: i). splitting text into tokens ii). convert tokens into numbers or token IDs

In [10]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [11]:
# first step: splitting text into tokens
tokens = tokenizer.tokenize(raw_input)
print(tokens)

['This', 'is', 'my', 'village', '.', 'It', 'is', 'haven', 'with', 'lush', 'fields', ',', 'vibrant', 'culture', ',', 'traditional', 'houses', ',', 'warm', 'community', ',', 'ancient', 'temples', ',', 'and', 'simple', ',', 'self', '-', 'sufficient', 'living', '.']


In [12]:
# second step: convert tokens into numbers or token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[1188, 1110, 1139, 1491, 119, 1135, 1110, 3983, 1114, 19302, 3872, 117, 18652, 2754, 117, 2361, 2725, 117, 3258, 1661, 117, 2890, 8433, 117, 1105, 3014, 117, 2191, 118, 6664, 1690, 119]


**Decoding is going the other way around: from vocabulary indices, we want to get a string**

In [13]:
decoded_string = tokenizer.decode(token_ids)
print(decoded_string)

This is my village. It is haven with lush fields, vibrant culture, traditional houses, warm community, ancient temples, and simple, self - sufficient living.


***Note:*** The decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence.

Feeding token_id directly to the model will result in an IndexError because the model expects input data with a batch dimension (i.e., a 2D tensor) where the first dimension represents the batch size.

In [21]:
import torch

In [22]:
print("Size of tokenized input directly using tokenize(raw_input, return_tensors='pt): ", tokenized_input["input_ids"].size())
print("Size of tokenized input using tokenizer.convert_tokens_to_ids(tokens): ", torch.tensor(token_ids).size())

Size of tokenized input directly using tokenize(raw_input, return_tensors='pt):  torch.Size([1, 34])
Size of tokenized input using tokenizer.convert_tokens_to_ids(tokens):  torch.Size([32])


In [23]:
#solve this by converting into tensors as below
token_ids_input = torch.tensor([token_ids])
print("size: ", token_ids_input.size())

size:  torch.Size([1, 32])


Now, this we can feed to model