##### What is this notebook about?
- This notebook shows different ways to tokenize an NLP text. 
- Tokenization concept:
    - Converts raw text to numbers for the model
    - Typical steps involve:
        - Normalization: Remove accents, punctuations, etc if needed.
        - Pre-tokenization: Split sentence into words on space/punctuations, etc.
        - Tokenizer model: Converts into tokens using different algorithms
            - These are the types of tokenizers:
                - Character level tokenizers
                - Word level tokenizers 
                - Subword tokenizers
        - Post-processing: Add extra tokens like [CLS], [SEP], etc if needed.
- The following tokenizers are covered in this notebook:
    - Subword tokenizers
        - Byte Pair Encoding (from OpenAI)
            - Implemented in tiktoken library, huggingface transformers
            - Used in GPT models, Llama 3
        - WordPiece tokenization (from Google)
            - BertTokenizer
        - Unigram tokenization (implemented in SentencePiece from Google) 
            - XLNetTokenizer
            - Also used in Llama 2
    - Word level tokenizers 
        - nltk
        - spaCy torchtext's get_tokenizer


#### Note on special tokens
- Some tokenizers use special tokens to help the LLM with additional context. Some of these special tokens are
    - `[BOS]` (beginning of sequence) marks the beginning of text
    - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
    - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
    - `[UNK]` to represent words that are not included in the vocabulary
        - In word-level tokenizers, we can add special tokens like `[UNK]` or `"<|unk|>"` to the vocabulary to represent unknown words that are not in the training data vocabulary. Sub-word tokenizers do not user this token, since they break down all words (including unknown words) into subword units.

- GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity. The token `"<|endoftext|>"` is similar to `[EOS]` token:
    - its used to denote the end of a text 
    - its also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.
    - its also used for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)

### Text samples to test tokenizers

In [None]:
# Using a sample text file
with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [2]:
text = "Hello, LLM. Shall we tokenize a sample sentence?"
print("Total number of character:", len(text))

Total number of character: 48


### BPE (from OpenAI) - tiktoken library 
- Used by GPT-2, Llama 3, etc.

In [3]:
import tiktoken

In [4]:
for model in ["gpt2", "gpt-4", "gpt-4o"]:
    print(tiktoken.encoding_for_model(model))

<Encoding 'gpt2'>
<Encoding 'cl100k_base'>
<Encoding 'o200k_base'>


In [5]:
tt_bpe_tokenizer = tiktoken.get_encoding("gpt2")

token_ids = tt_bpe_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(token_ids)

text_out = tt_bpe_tokenizer.decode(token_ids)
print(text_out)

print(tt_bpe_tokenizer.n_vocab)
print(tt_bpe_tokenizer.eot_token)



[15496, 11, 27140, 44, 13, 38451, 356, 11241, 1096, 257, 6291, 6827, 30]
Hello, LLM. Shall we tokenize a sample sentence?
50257
50256


### BPE (from OpenAI) - Hugging face API

In [6]:
from transformers import GPT2Tokenizer, GPT2TokenizerFast

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
hf_bpe_tokenizer = GPT2Tokenizer.from_pretrained("gpt2") 

token_ids = hf_bpe_tokenizer(text)
print(token_ids)

text_out = hf_bpe_tokenizer.decode(token_ids['input_ids'])
print(text_out)

{'input_ids': [15496, 11, 27140, 44, 13, 38451, 356, 11241, 1096, 257, 6291, 6827, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Hello, LLM. Shall we tokenize a sample sentence?


In [8]:
# Fast

hf_bpe_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")

token_ids = hf_bpe_tokenizer_fast(text)
print(token_ids)

text_out = hf_bpe_tokenizer_fast.decode(token_ids['input_ids'])
print(text_out)

{'input_ids': [15496, 11, 27140, 44, 13, 38451, 356, 11241, 1096, 257, 6291, 6827, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Hello, LLM. Shall we tokenize a sample sentence?


### Performance comparison

In [9]:
%timeit tt_bpe_tokenizer.encode(raw_text)

1.35 ms ± 4.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [10]:
%timeit hf_bpe_tokenizer(raw_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


20.1 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%timeit hf_bpe_tokenizer_fast(raw_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


6.34 ms ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Word tokenizer

### References:

> https://github.com/rasbt/LLMs-from-scratch

> https://www.coursera.org/specializations/generative-ai-engineering-with-llms
