The tokenization process involves creating a systematic pipeline for transforming words into tokens. However, it is crucial to provide a clear understanding of what exactly tokens represent in this context. Researchers have incorporated three distinct encoding approaches into their studies. 

**Character Level**: Consider each character in a text as a token.

**Word Level**: Encoding each word in the corpus as one token.

**Subword Level**: Breaking down a word into smaller chunks when possible. For example, we can encode the word “basketball” to the combination of two tokens as “basket” + “ball”.

Subword-level encoding offers increased flexibility and reduces the number of required unique tokens to represent a corpus. This approach enables the combination of different tokens to represent new words, eliminating the need to add every new word to the dictionary. This technique proved to be the most effective encoding when training neural networks and LLMs. Well-known models like GPT family, LLaMA employ this tokenization method. 

Some subword level algorithms exist, such as  Byte Pair Encoding (BPE), WordPiece and SentencePiece, which are used in practice. Iit is important to note that while their token selection methods may differ, the fundamental process remains the same.

# Byte Pair Encoding (BPE)

It is an iterative process to extract the most repetitive words or subwords in a corpus. The algorithm starts by counting the occurrence of each character and builds on top of it by merging the characters. It is a greedy process that carefully considers all possible combinations to identify the optimal set of words/subwords that covers the dataset with the least number of required tokens.

The next step involves creating the vocabulary for our model, which consists of a comprehensive dictionary comprising the most frequently occurring tokens extracted by BPE (or another technique of your choosing) from the dataset. The definition of a dictionary (**dict** type) is a data structure that holds a key and value pair for each row. In our particular scenario, each data point is assigned a **key** represented by an index that begins from 0, while the corresponding **value** is a token.

Due to the fact that neural networks only accept numerical inputs, we can utilize the vocabulary to establish a mapping between tokens and their corresponding IDs, like a lookup table. We have to save the vocabulary for future use cases to be able to decode the model's output from the IDs to words. This is known as a pre-trained vocabulary, an essential component accompanying published pre-trained models. Without the vocabulary, understanding the model's output (the IDs) would be impossible. For smaller models like BERT, the dictionary can consist of as few as 30K tokens, while larger models like GPT-3 can expand to encompass up to 50K tokens.

# Tokenizers In Action

In [2]:
# load the pre-trained tokenizer for the GPT-2 model from the Huggingface Hub using the transformers package
!pip install -q transformers
from transformers import AutoTokenizer

# Download and load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The code snippet above will grab the tokenizer and load the dictionary, so you can simply use the tokenizer variable to encode/decode your text.
But let's take a look at what the vocabulary contains

In [3]:
print( tokenizer.vocab )



Each entry is a pair of token and ID. For example, we can represent the word optional with the number 11902.There is a special character, Ġ, preceding certain tokens. This character represents a space. The next code sample will use the tokenizer object to convert a sentence into tokens and IDs.

In [4]:
token_ids = tokenizer.encode("This is a sample text to test the tokenizer.")

print( "Tokens:   ", tokenizer.convert_ids_to_tokens( token_ids ) )
print( "Token IDs:", token_ids )

Tokens:    ['This', 'Ġis', 'Ġa', 'Ġsample', 'Ġtext', 'Ġto', 'Ġtest', 'Ġthe', 'Ġtoken', 'izer', '.']
Token IDs: [1212, 318, 257, 6291, 2420, 284, 1332, 262, 11241, 7509, 13]


# Tokenizers Shortcomings

**Uppercase/Lowercase Words**: The tokenizer will treat the the same word differently based on cases. For example, a word like “hello” will result in token id 31373, while the word “HELLO” will be represented by three tokens as [13909, 3069, 46] which translates to [“HE”, “LL”, “O”].


**Dealing with Numbers**: You might have heard that transformers are not naturally proficient in handling mathematical tasks. One reason for this is the tokenizer's inconsistency in representing each number, leading to unpredictable variations. For instance, the number 200 might be represented as one token, while the number 201 will be represented as two tokens like [20, 1].


**Trailing whitespace**: The tokenizer will identify some tokens with trailing whitespace. For example a word like “last” could be represented as “ last” as one tokens instead of [" ", "last"]. This will impact the probability of predicting the next word if you finish your prompt with a whitespace or not. As evident from the sample output above, you may observe that certain tokens begin with a special character (Ġ) representing whitespace, while others lack this feature.


**Model-specific**: Even though most language models are using BPE method for tokenization, they still train a new tokenizer for their own models. GPT-4, LLaMA, OpenAssistant, and similar models all develop their separate tokenizers.