<a href="https://colab.research.google.com/github/vkjadon/llm/blob/main/03hf_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import transformers
print(transformers.__version__)

In [None]:
dir(transformers)

The first type of tokenizer that comes to mind is word-based. It's generally very easy to set up and use with only a few rules, and it often yields decent results. There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python's split() function


In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

Word-level tokenizers split text into whole words, sometimes using extra rules for punctuation. This creates very large vocabularies, because each unique word in the language must have its own ID. English alone has over 500,000 words, and forms like dog/dogs or run/running are treated as completely different tokens, so the model initially doesn't know they are related.

Because no vocabulary can include every possible word, tokenizers need an unknown token (often [UNK] or <unk>). When a tokenizer produces many unknown tokens, it means it is failing to represent words properly and losing information. Therefore, vocabularies should be designed to minimize unknown tokens while keeping the vocabulary size manageable.

One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.

Character-based tokenizers split text into individual characters instead of words. This greatly reduces vocabulary size and almost eliminates unknown tokens, since all words can be formed from characters. However, this approach introduces challenges with handling spaces and punctuation, and the tokens carry less meaning, especially in languages like English (though not in languages like Chinese, where characters are meaningful). It also increases sequence length because one word becomes many character tokens.

To balance the advantages of word- and character-level methods, subword tokenization is used.

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.

Here is an example showing how a subword tokenization algorithm would tokenize the sequence “Let's do tokenization!“

Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

In [None]:
print(type(tokenizer))

These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer("Using a Transformer network is simple")

In [None]:
print(type(tokenizer))

In [None]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")
text = "Transformers are powerful!"

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("IDs:", ids)

In [None]:
from transformers import AutoTokenizer

In [None]:
help(AutoTokenizer.from_pretrained)

In [None]:
from transformers import TOKENIZER_MAPPING

list(TOKENIZER_MAPPING.keys())

In [None]:
tokenizer_classes = [
    cls.__name__
    for cls in transformers.__dict__.values()
    if isinstance(cls, type) and "Tokenizer" in cls.__name__
]

tokenizer_classes

First, let's load a pre-trained tokenizer, for example, from the 'bert-base-uncased' model.

In [None]:
# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# print(tokenizer)

Now, let's use this tokenizer to encode a sample sentence.

In [None]:
sentence = "Hello, how are you today?"
encoded_input = tokenizer(sentence, return_tensors='pt')

print("Encoded input:", encoded_input)
print("Input IDs:", encoded_input['input_ids'])
print("Attention Mask:", encoded_input['attention_mask'])

In [None]:
sentence = "Hello"
encoded_input = tokenizer(sentence, return_tensors='pt')

print("Encoded input:", encoded_input)
print("Input IDs:", encoded_input['input_ids'])
print("Attention Mask:", encoded_input['attention_mask'])

We can also decode the input IDs back to text.

In [None]:
decoded_output = tokenizer.decode(encoded_input['input_ids'][0])
print("Decoded output:", decoded_output)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Unbelievable performance at JadonTechLabs!"

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", ids)