In [1]:
# !pip install transformers

--------------------------------------
## BPE 
----------------------------------

#### GPT2 tokenizer

In [1]:
from transformers import GPT2Tokenizer

In [2]:
# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")



In [3]:
# Define the text you want to tokenize
text = "Hello, how are you? I am learning about GPT models."

In [4]:
# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?', 'ĠI', 'Ġam', 'Ġlearning', 'Ġabout', 'ĠG', 'PT', 'Ġmodels', '.']


In [20]:
# Convert tokens to their corresponding input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

Input IDs: [15496, 11, 703, 389, 345, 30, 314, 716, 4673, 546, 402, 11571, 4981, 13]


In [21]:
# Convert back to the original text (detokenization)
decoded_text = tokenizer.decode(input_ids)
print("Decoded text:", decoded_text)

Decoded text: Hello, how are you? I am learning about GPT models.


**Explanation of why tokens look like this:**

`['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?', 'ĠI', 'Ġam', 'Ġlearning', 'Ġabout', 'ĠG', 'PT', 'Ġmodels', '.']`

"""
1. 'Ġ' Represents a Space:
   - The 'Ġ' symbol you see before tokens like 'Ġhow', 'Ġare', and 'Ġyou' represents a space before the word.
   - This is because GPT-2 uses Byte Pair Encoding (BPE), which does not tokenize text at the word level, but at the subword level.
   - For example, instead of tokenizing " how", " are", and " you" as whole words, the tokenizer marks a space with 'Ġ' to denote 
     that a space precedes those subword tokens.

2. Subword Tokenization:
   - The BPE tokenizer breaks words into smaller subword units based on their frequency in the training data.
   - For example, "GPT" is tokenized into ['ĠG', 'PT'] because "GPT" is less frequent and split into more common subword components.

3. Efficient Vocabulary Size:
   - BPE helps keep the vocabulary size manageable by encoding words as a combination of subwords.
   - This makes it easier to handle unseen words by decomposing them into subwords known by the model.

4. Example Breakdown:
   - 'Hello': The entire word appears as one token because it's frequent in the training data.
   - 'Ġhow', 'Ġare', 'Ġyou': These tokens have the space marker 'Ġ' indicating they follow a space.
   - 'ĠG', 'PT': The word "GPT" is split into subwords since it's relatively rare in the training data.
   - 'Ġmodels': The token has 'Ġ' to indicate it follows a space.
"""


#### GPT-3 Tokenizer (Using tiktoken)
For GPT-3 and GPT-4 models, the tiktoken library is used for tokenization, which is faster and more efficient than the previous methods.

In [5]:
import tiktoken

In [6]:
# Load the GPT-3 tokenizer (uses tiktoken)
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [7]:
# Tokenize the input text
tokens_gpt3 = encoding.encode(text)

In [8]:
# Convert token IDs back to their corresponding token words
token_words_gpt3 = [encoding.decode([token]) for token in tokens_gpt3]

print("GPT-3 Tokens (IDs):", tokens_gpt3)
print("GPT-3 Token Words:", token_words_gpt3)


GPT-3 Tokens (IDs): [9906, 11, 1268, 527, 499, 30, 358, 1097, 6975, 922, 480, 2898, 4211, 13]
GPT-3 Token Words: ['Hello', ',', ' how', ' are', ' you', '?', ' I', ' am', ' learning', ' about', ' G', 'PT', ' models', '.']


**GPT-3 often produces fewer tokens overall due to more efficient tokenization.**

#### Roberta

In [9]:
from transformers import RobertaTokenizer

# Load the pre-trained RoBERTa tokenizer
tokenizer = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path = "roberta-base",
                                             cache_dir = r'D:\AI-DATASETS\07-Hugging-Face-Data')

In [10]:
# Tokenize input text (returns token IDs)
tokens = tokenizer.encode(text, add_special_tokens=True)

# Convert token IDs back to corresponding token words
token_words = [tokenizer.decode([token]) for token in tokens]

# Print the token IDs and corresponding token words
print("Token IDs:", tokens)
print("Token Words:", token_words)

Token IDs: [0, 31414, 6, 141, 32, 47, 116, 38, 524, 2239, 59, 272, 10311, 3092, 4, 2]
Token Words: ['<s>', 'Hello', ',', ' how', ' are', ' you', '?', ' I', ' am', ' learning', ' about', ' G', 'PT', ' models', '.', '</s>']


------------------------------------------
## Wordpiece
-----------------------------------------

#### Python Code for BERT Tokenizer (WordPiece)

In [11]:
from transformers import BertTokenizer

In [12]:
# Load the pre-trained BERT tokenizer (uses WordPiece)
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path = "bert-base-uncased",
                                          cache_dir = r'D:\AI-DATASETS\07-Hugging-Face-Data')

In [13]:
# Tokenize input text (returns token IDs)
tokens = tokenizer.encode(text, add_special_tokens=True)

In [14]:
# Convert token IDs back to corresponding token words
token_words = [tokenizer.decode([token]) for token in tokens]

# Print the token IDs and corresponding token words
print("Token IDs:", tokens)
print("Token Words:", token_words)

Token IDs: [101, 7592, 1010, 2129, 2024, 2017, 1029, 1045, 2572, 4083, 2055, 14246, 2102, 4275, 1012, 102]
Token Words: ['[CLS]', 'hello', ',', 'how', 'are', 'you', '?', 'i', 'am', 'learning', 'about', 'gp', '##t', 'models', '.', '[SEP]']


`encode(text, add_special_tokens=True)`: 
- Encodes the input text into token IDs and adds special tokens like [CLS] and [SEP] at the beginning and end of the sequence, respectively.

#### DistilBert

In [15]:
from transformers import DistilBertTokenizer

# Load the pre-trained DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased",
                                                cache_dir = r'D:\AI-DATASETS\07-Hugging-Face-Data')

In [16]:
# Tokenize input text (returns token IDs)
tokens = tokenizer.encode(text, add_special_tokens=True)

# Convert token IDs back to corresponding token words
token_words = [tokenizer.decode([token]) for token in tokens]

# Print the token IDs and corresponding token words
print("Token IDs:", tokens)
print("Token Words:", token_words)

Token IDs: [101, 7592, 1010, 2129, 2024, 2017, 1029, 1045, 2572, 4083, 2055, 14246, 2102, 4275, 1012, 102]
Token Words: ['[CLS]', 'hello', ',', 'how', 'are', 'you', '?', 'i', 'am', 'learning', 'about', 'gp', '##t', 'models', '.', '[SEP]']


----------------------------
## SentencePiece 
---------------------------
- is a tokenization algorithm used by models like T5, ALBERT, and XLM-R.
- It differs from BPE and WordPiece as it treats text as a sequence of characters without needing pre-tokenization (such as splitting by spaces).
- SentencePiece can operate using subword units, providing a robust way to handle both rare and common tokens in any language.

#### T5

In [17]:
from transformers import T5Tokenizer

# Load the pre-trained T5 tokenizer (uses SentencePiece)
tokenizer = T5Tokenizer.from_pretrained("t5-small",
                                        cache_dir = r'D:\AI-DATASETS\07-Hugging-Face-Data')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [18]:
# Tokenize input text (returns token IDs)
tokens = tokenizer.encode(text, add_special_tokens=True)

# Convert token IDs back to corresponding token words
token_words = [tokenizer.decode([token]) for token in tokens]

# Print the token IDs and corresponding token words
print("Token IDs:", tokens)
print("Token Words:", token_words)

Token IDs: [8774, 6, 149, 33, 25, 58, 27, 183, 1036, 81, 350, 6383, 2250, 5, 1]
Token Words: ['Hello', ',', 'how', 'are', 'you', '?', 'I', 'am', 'learning', 'about', 'G', 'PT', 'models', '.', '</s>']


#### Albert - with [CLS] and [SEP] Tokens

In [39]:
from transformers import AlbertTokenizer

# Load the pre-trained ALBERT tokenizer (SentencePiece-based)
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2",
                                           cache_dir = r'D:\AI-DATASETS\07-Hugging-Face-Data')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

In [40]:
# Tokenize input text (returns token IDs)
tokens = tokenizer.encode(text, add_special_tokens=True)

# Convert token IDs back to corresponding token words
token_words = [tokenizer.decode([token]) for token in tokens]

# Print the token IDs and corresponding token words
print("Token IDs:", tokens)
print("Token Words:", token_words)

Token IDs: [2, 10975, 15, 184, 50, 42, 60, 31, 589, 2477, 88, 10538, 38, 2761, 9, 3]
Token Words: ['[CLS]', 'hello', ',', 'how', 'are', 'you', '?', 'i', 'am', 'learning', 'about', 'gp', 't', 'models', '.', '[SEP]']


Another example ...

In [42]:
# List of words categorized by complexity
words = [
    "cat",                      # Simple
    "happy",                    # Simple
    "unhappiness",              # Medium complexity
    "international",            # Medium complexity
    "computerization",          # Medium complexity
    "antidisestablishmentarianism",  # Complex
    "electroencephalographically",    # Complex
    "pseudopseudohypoparathyroidism"  # Complex
]

In [43]:
# Tokenize and print the results
for word in words:
    tokens = tokenizer.tokenize(word)
    print(f"Word: {word}")
    print(f"Tokens: {tokens}")
    print("-" * 40)

Word: cat
Tokens: ['▁cat']
----------------------------------------
Word: happy
Tokens: ['▁happy']
----------------------------------------
Word: unhappiness
Tokens: ['▁un', 'hap', 'pi', 'ness']
----------------------------------------
Word: international
Tokens: ['▁international']
----------------------------------------
Word: computerization
Tokens: ['▁computer', 'ization']
----------------------------------------
Word: antidisestablishmentarianism
Tokens: ['▁anti', 'dis', 'establishment', 'arian', 'ism']
----------------------------------------
Word: electroencephalographically
Tokens: ['▁electro', 'en', 'cephal', 'ographic', 'ally']
----------------------------------------
Word: pseudopseudohypoparathyroidism
Tokens: ['▁pseudo', 'ps', 'e', 'udo', 'hy', 'po', 'para', 'thy', 'roid', 'ism']
----------------------------------------


#### How Tokenization Works

1. **Training Phase**:
   - **Data Collection**: The tokenizer is trained on a large corpus of text data. This corpus can include books, articles, and any relevant textual data.
   - **Frequency Analysis**: During training, the tokenizer analyzes the frequency of all possible character sequences (up to a certain length). It identifies commonly occurring subword units, words, or characters.
   - **Subword Units Creation**:
     - **WordPiece**: It starts with individual characters and gradually combines them to form larger units based on their frequency. The most common sequences are kept as tokens, and rare sequences are split into smaller parts. This process continues until a predefined vocabulary size is reached.
     - **SentencePiece**: It operates more like a language model. It uses a statistical approach to segment text into subword units, without relying on whitespace as a boundary. It treats the entire corpus as a sequence and learns to find the most efficient segmentation based on a byte pair encoding (BPE) or unigram language model.

2. **Vocabulary Creation**: After training, the tokenizer has a fixed vocabulary that contains:
   - Whole words (e.g., "cat", "happy")
   - Subwords (e.g., "un", "happiness", "anti", "dis", "est")
   - Special tokens (e.g., `[CLS]`, `[SEP]`)

3. **Tokenization During Inference**:
   - When you pass a simple text string to the tokenizer (e.g., `"Antidisestablishmentarianism"`), it uses the vocabulary created during the training phase to segment the input text into tokens.
   - The tokenizer matches substrings of the input text against its vocabulary:
     - It starts with the longest substring that matches a token in its vocabulary.
     - If the entire word isn’t found, it attempts to break it down into smaller parts, looking for known subwords.
   - This process continues until the entire string is tokenized into valid tokens that correspond to IDs in the vocabulary.

#### Example Process
Let’s illustrate this process step-by-step using the example of the word **"unhappiness"**.

1. **WordPiece Example**:
   - **Input**: `"unhappiness"`
   - **Tokenization Steps**:
     - Check if the entire word is in the vocabulary: **Not found**.
     - Check for subwords:
       - Match **"un"** (found in vocabulary).
       - Match **"happiness"** (found in vocabulary).
   - **Output Tokens**: `["un", "happiness"]`

2. **SentencePiece Example**:
   - **Input**: `"unhappiness"`
   - **Tokenization Steps**:
     - SentencePiece does not assume any delimiters. It analyzes the word as a whole.
     - Using its learned segmentation, it might identify:
       - Tokenized as `["▁un", "happiness"]` (the "▁" indicates a word boundary).
   - **Output Tokens**: `["▁un", "happiness"]`