# Unit 3

## Comparing BPE, WordPiece, and SentencePiece in NLP

# Introduction to Tokenization Techniques

Welcome to this lesson on comparing tokenization techniques used in modern **Natural Language Processing (NLP)** models. **Tokenization** is a crucial step in NLP that involves breaking down text into smaller units called tokens. This process is essential for AI and **Large Language Models (LLMs)** to understand and process text data effectively. In previous lessons, we explored rule-based tokenization and **Byte-Pair Encoding (BPE)**. Today, we will build on that knowledge by comparing BPE with two other popular tokenization techniques: **WordPiece** and **SentencePiece**.

-----

## Quick Recap: Byte Pair Encoding (BPE)

Before diving into WordPiece and SentencePiece, let's briefly recall Byte Pair Encoding (BPE). BPE is a subword tokenization technique that reduces vocabulary size and handles rare words by encoding text into subword units. It merges the most frequent pairs of characters or subwords iteratively to form a compact vocabulary. This technique is particularly useful for languages with rich morphology and has been widely adopted in NLP tasks.

-----

## Understanding WordPiece Tokenization

**WordPiece** tokenization is an extension of BPE and is used in models like **BERT**. It builds on BPE by introducing additional rules for handling subword units, which helps in better capturing the semantics of words. WordPiece uses a probabilistic model to determine the likelihood of subword sequences, allowing it to choose the most semantically meaningful tokenization.

### Example of WordPiece Tokenization:

Consider the word "unbelievable". WordPiece might break it down into subwords like "un", "\#\#believ", and "\#\#able". The "\#\#" prefix indicates that the subword is a continuation of the previous token. This allows the model to understand the semantic components of the word, such as the prefix "un-" and the root "believe".

Let's explore how WordPiece tokenization works using the **transformers** library.

### Step 1: Importing the Necessary Library

First, we need to import the `AutoTokenizer` from the `transformers` library. This library provides pre-trained tokenizers for various models, including BERT.

```python
from transformers import AutoTokenizer
```

### Step 2: Loading the WordPiece Tokenizer

Next, we load the WordPiece tokenizer used in BERT. The `AutoTokenizer.from_pretrained()` method allows us to load a pre-trained tokenizer by specifying the model name.

```python
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
```

### Step 3: Tokenizing a Sample Sentence

Now, let's tokenize a sample sentence using the WordPiece tokenizer. The `tokenize()` method breaks the sentence into subword units.

```python
sentence = "In 2024, the price of a ticket to New York’s Broadway show is $29.99, including tax & fees—an unbelievable deal compared to last year’s $45.50!"
bert_tokens = tokenizer_bert.tokenize(sentence)
print("WordPiece Tokenization (BERT):", bert_tokens)
```

**Output:**

```
WordPiece Tokenization (BERT): ['in', '202', '##4', ',', 'the', 'price', 'of', 'a', 'ticket', 'to', 'new', 'york', '’', 's', 'broadway', 'show', 'is', '$', '29', '.', '99', ',', 'including', 'tax', '&', 'fees', '—', 'an', 'unbelievable', 'deal', 'compared', 'to', 'last', 'year', '’', 's', '$', '45', '.', '50', '!']
```

In this example, the sentence is tokenized into subwords, demonstrating how WordPiece handles various elements like numbers, symbols, and proper nouns. This sentence includes proper nouns, apostrophes, numbers, symbols, hyphenated words, and a comparison phrase, making it useful for testing different tokenization techniques.

-----

## Exploring SentencePiece Tokenization

**SentencePiece** is a versatile tokenization technique used in models like **T5**. Unlike BPE and WordPiece, SentencePiece treats the input text as a raw byte sequence, allowing it to handle any language without relying on language-specific preprocessing. It uses a unigram language model or BPE to learn the subword units, making it effective for multilingual tasks.

### Step 1: Importing the Necessary Library

To use SentencePiece, we need to import the `AutoTokenizer` from the `transformers` library.

```python
from transformers import AutoTokenizer
```

### Step 2: Loading the SentencePiece-based Tokenizer

We load the SentencePiece-based tokenizer used in T5. The `AutoTokenizer.from_pretrained()` method allows us to load a pre-trained tokenizer by specifying the model name.

```python
tokenizer_t5 = AutoTokenizer.from_pretrained("t5-small")
```

### Step 3: Tokenizing a Sample Sentence

Now, let's tokenize a sample sentence using the SentencePiece tokenizer. The `tokenize()` method breaks the sentence into subword units.

```python
sentence = "In 2024, the price of a ticket to New York’s Broadway show is $29.99, including tax & fees—an unbelievable deal compared to last year’s $45.50!"
t5_tokens = tokenizer_t5.tokenize(sentence)
print("SentencePiece Tokenization (T5):", t5_tokens)
```

**Output:**

```
SentencePiece Tokenization (T5): [' In', ' 2024', ',', ' the', ' price', ' of', ' a', ' ticket', ' to', ' New', ' York', '’', 's', ' Broadway', ' show', ' is', ' $', '29', '.', '99', ',', ' including', ' tax', ' &', ' fees', '—', ' an', ' unbelievable', ' deal', ' compared', ' to', ' last', ' year', '’', 's', ' $', '45', '.', '50', '!']
```

In this example, SentencePiece uses a special character ( ) to indicate the start of a new word, showcasing its unique approach to tokenization. This method is particularly useful for handling languages with complex scripts and for tasks requiring language-agnostic tokenization.

-----

## Comparing Tokenization Techniques

Now that we've explored WordPiece and SentencePiece, let's summarize their key differences and similarities in a table:

| Feature | Byte Pair Encoding (BPE) | WordPiece | SentencePiece |
| :--- | :--- | :--- | :--- |
| **Vocabulary Construction** | Iterative merging of frequent pairs | Probabilistic model for subword sequences | Unigram or BPE model |
| **Language Dependency** | Language-specific preprocessing | Language-specific preprocessing | Language-agnostic |
| **Handling of Subwords** | Merges frequent pairs | Additional rules for semantics | Uses special characters |
| **Model Examples** | GPT-2, RoBERTa | BERT | T5, ALBERT |

Each technique has its strengths and is suited for different NLP tasks. Choosing the right tokenization method depends on the specific requirements of your application.

-----

## Summary and Preparation for Practice

In this lesson, we compared three popular tokenization techniques: BPE, WordPiece, and SentencePiece. We explored how each technique works and provided code examples to demonstrate their application using a complex sentence that includes proper nouns, apostrophes, numbers, symbols, hyphenated words, and a comparison phrase. Understanding these techniques is crucial for effectively processing text data in NLP tasks.

As you move on to the practice exercises, you'll have the opportunity to apply what you've learned and reinforce your understanding of these tokenization methods. Keep up the great work, and remember that mastering tokenization is a key step in becoming proficient in NLP\!

## WordPiece Tokenization Challenge

You've just learned about WordPiece tokenization and its application in models like BERT. Now, let's put that knowledge into practice. Your task is to use the BERT tokenizer from the transformers library to break down challenging words with prefixes, suffixes, and compound structures.

Load the BERT tokenizer, which uses WordPiece tokenization.
Tokenize words such as "unwanted", "multifunctional", and "hyperrealistic".
Print the tokens.
This exercise will help you see how WordPiece handles morphological components using "##" markers. Dive in and explore the power of tokenization!

```python
from transformers import AutoTokenizer

# TODO: Load the WordPiece tokenizer from BERT

# Define a list of challenging words
words = ["unwanted", "multifunctional", "hyperrealistic"]

# TODO: Tokenize each word and print the tokens and their IDs
for word in words:
    # TODO: Tokenize the word
        
    # TODO: Print the word and tokens
    # print()
```

### WordPiece Tokenization Challenge

You've just learned about WordPiece tokenization and its application in models like BERT. Now, let's put that knowledge into practice. Your task is to use the BERT tokenizer from the transformers library to break down challenging words with prefixes, suffixes, and compound structures.

Load the BERT tokenizer, which uses WordPiece tokenization.
Tokenize words such as "unwanted", "multifunctional", and "hyperrealistic".
Print the tokens.
This exercise will help you see how WordPiece handles morphological components using "\#\#" markers. Dive in and explore the power of tokenization\!

```python
from transformers import AutoTokenizer

# Load the WordPiece tokenizer from BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Define a list of challenging words
words = ["unwanted", "multifunctional", "hyperrealistic"]

# Tokenize each word and print the tokens and their IDs
for word in words:
    # Tokenize the word
    tokens = tokenizer.tokenize(word)
    
    # Get the IDs for the tokens
    ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # Print the word and tokens
    print(f"Word: {word}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {ids}")
    print("-" * 20)
```

**Example Output:**

```
Word: unwanted
Tokens: ['un', '##wanted']
Token IDs: [2145, 12693]
--------------------
Word: multifunctional
Tokens: ['multi', '##functional']
Token IDs: [10065, 9662]
--------------------
Word: hyperrealistic
Tokens: ['hyper', '##realistic']
Token IDs: [19435, 11042]
--------------------
```

This is a video that explains how to train a BERT tokenizer on a specific domain of knowledge.

[Train a BERT Tokenizer on your (scientific) Domain Knowledge](https://www.youtube.com/watch?v=2RA5dEIC-Nw)
http://googleusercontent.com/youtube_content/1

## Tokenization Techniques in Action

Well done on exploring WordPiece tokenization! Now, let's dive into SentencePiece with the T5 tokenizer. Your task is to use the same challenging words from Exercise 1 and observe how SentencePiece handles them.

Load the T5 tokenizer, which uses SentencePiece.
Tokenize the words ["unwanted", "multifunctional", "hyperrealistic"] and print the tokens.
Compare how SentencePiece and WordPiece handle prefixes and compound words differently.
This exercise will deepen your understanding of tokenization techniques. Let's see how these methods differ!

```python
from transformers import AutoTokenizer

# Words to tokenize
words = ["unwanted", "multifunctional", "hyperrealistic"]

# WordPiece (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = [tokenizer_bert.tokenize(word) for word in words]
print("WordPiece Tokenization (BERT):", bert_tokens)

# TODO: Load the SentencePiece tokenizer for T5
# TODO: Tokenize the words using the SentencePiece tokenizer
```

## Tokenization Techniques in Action

Well done on exploring WordPiece tokenization\! Now, let's dive into SentencePiece with the T5 tokenizer. Your task is to use the same challenging words from Exercise 1 and observe how SentencePiece handles them.

Load the T5 tokenizer, which uses SentencePiece.
Tokenize the words ["unwanted", "multifunctional", "hyperrealistic"] and print the tokens.
Compare how SentencePiece and WordPiece handle prefixes and compound words differently.
This exercise will deepen your understanding of tokenization techniques. Let's see how these methods differ\!

```python
from transformers import AutoTokenizer

# Words to tokenize
words = ["unwanted", "multifunctional", "hyperrealistic"]

# WordPiece (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = [tokenizer_bert.tokenize(word) for word in words]
print("WordPiece Tokenization (BERT):", bert_tokens)

# Load the SentencePiece tokenizer for T5
tokenizer_t5 = AutoTokenizer.from_pretrained("t5-small")

# Tokenize the words using the SentencePiece tokenizer
t5_tokens = [tokenizer_t5.tokenize(word) for word in words]
print("SentencePiece Tokenization (T5):", t5_tokens)

```

**Comparison:**

WordPiece and SentencePiece handle these words differently due to their distinct approaches:

  - **WordPiece** often uses a `##` prefix to indicate a subword that is a continuation of the previous token. For example, "unwanted" becomes `['un', '##wanted']`. This approach maintains a connection to the full word and its morphological components.
  - **SentencePiece**, on the other hand, uses a special `     ` (underscore) character to denote the beginning of a new word. It might tokenize "unwanted" as `[' un', 'wanted']`, with `  'unwanted' ` as a single token if it's in the vocabulary. The `     ` symbol is a key difference and allows SentencePiece to handle text without requiring a traditional word-based pre-tokenization step, making it language-agnostic.

## Tokenization Techniques in Action

You've done well exploring WordPiece and SentencePiece tokenization. Now, let's expand your skills by implementing a script using the GPT-2 tokenizer (BPE) from the transformers library.

Use the following sentence for tokenization: "In 2024, a Broadway ticket costs $29.99, including tax—way cheaper than last year’s $45.50!"
Compare the tokenization results of BPE, WordPiece, and SentencePiece.
Conduct a detailed morphological analysis.
Explain which technique might be most appropriate for different NLP tasks.
This exercise will deepen your understanding of tokenization techniques and their applications. Dive in and see how each method handles text differently!

```python
from transformers import AutoTokenizer

text = "In 2024, a Broadway ticket costs $29.99, including tax—way cheaper than last year’s $45.50!"  

# TODO: Implement BPE (GPT-2) tokenization


# TODO: Impelement WordPiece (BERT) tokenization

# TODO: Impelement SentencePiece (T5) tokenization

```

Hello there\! It seems you've already made some progress with the tokenization script. Let's complete the remaining tasks: conducting a detailed morphological analysis and discussing the best use cases for each tokenization method.

Here is the completed script, followed by the analysis and comparison you requested.

```python
from transformers import AutoTokenizer

text = "In 2024, a Broadway ticket costs $29.99, including tax—way cheaper than last year’s $45.50!"  

# Initialize tokenizers
# BPE (GPT-2)
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
# WordPiece (BERT)
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
# SentencePiece (T5)
tokenizer_t5 = AutoTokenizer.from_pretrained("t5-small")


# BPE (GPT-2) tokenization
tokens_bpe = tokenizer_gpt2.tokenize(text)
token_ids_bpe = tokenizer_gpt2.encode(text)

# WordPiece (BERT) tokenization
tokens_wordpiece = tokenizer_bert.tokenize(text)
token_ids_wordpiece = tokenizer_bert.encode(text)

# SentencePiece (T5) tokenization
tokens_sentencepiece = tokenizer_t5.tokenize(text)
token_ids_sentencepiece = tokenizer_t5.encode(text)


# Print results
print("Original Text:", text)
print("\n--- BPE (GPT-2) Tokenization ---")
print("Tokens:", tokens_bpe)
print("Token IDs:", token_ids_bpe)
print("\n--- WordPiece (BERT) Tokenization ---")
print("Tokens:", tokens_wordpiece)
print("Token IDs:", token_ids_wordpiece)
print("\n--- SentencePiece (T5) Tokenization ---")
print("Tokens:", tokens_sentencepiece)
print("Token IDs:", token_ids_sentencepiece)

```

### Detailed Morphological Analysis and Comparison

Let's break down how each tokenizer handles the sentence: "In 2024, a Broadway ticket costs $29.99, including tax—way cheaper than last year’s $45.50\!"

  * **BPE (GPT-2) Tokenization:**

      * **Handling of words:** BPE breaks down words into common subword units. For example, `Broadway` is tokenized as `['B', 'road', 'way']`, and `including` becomes `[' in', 'cluding']`. This method effectively handles both common words and proper nouns by leveraging shared subword patterns.
      * **Handling of punctuation and numbers:** Punctuation is generally handled as separate tokens, as seen with `','`, `'$'`, and `!`. Numbers are often treated as individual digits or common numeric patterns, such as `29.99` becoming `['29', '.', '99']`.
      * **Strengths:** It is highly effective at managing out-of-vocabulary (OOV) words by breaking them down into known subwords. It maintains a balance between a small vocabulary size and a reasonable sequence length.

  * **WordPiece (BERT) Tokenization:**

      * **Handling of words:** WordPiece is similar to BPE but with a different merging strategy. It uses a `##` prefix to indicate a subword that is part of a larger word. For example, `Broadway` becomes `['broad', '##way']`, and `including` becomes `['including']`. This method tends to keep more common words intact.
      * **Handling of punctuation and numbers:** WordPiece typically tokenizes punctuation as separate characters. It handles numbers as individual tokens or subwords, such as `29.99` becoming `['29', '.', '99']`.
      * **Strengths:** WordPiece's use of the `##` prefix explicitly signals that a token is a subword, which can be beneficial for certain models. It excels at handling morphological variations and contractions while maintaining a fixed vocabulary.

  * **SentencePiece (T5) Tokenization:**

      * **Handling of words:** SentencePiece is unique because it is a language-agnostic tokenizer that operates directly on raw text, without needing to pre-tokenize the input based on spaces. It uses a special underscore `     ` to represent a space. `Broadway` becomes `[' Broad', 'way']`, and `including` is `[' including']`.
      * **Handling of punctuation and numbers:** All characters, including spaces, numbers, and punctuation, are treated as part of the tokenization process. `$` and `!` are split from the words. Numbers like `29.99` are broken down into smaller units, such as `[' 29', '.', '99']`.
      * **Strengths:** Since it doesn't rely on spaces for splitting, it is particularly effective for languages that don't use spaces (e.g., Chinese, Japanese) and is robust against inconsistent spacing. Its model-agnostic nature makes it highly versatile.

### Which Technique is Best for Different NLP Tasks?

The choice of tokenizer depends heavily on the specific NLP task and the language being used.

  * **BPE (GPT-2) Tokenization:**

      * **Best for:** Generative language tasks like text generation, summarization, and translation, especially for languages with complex morphology. GPT-2's BPE method is designed to handle open-ended text and learn new words by combining known subwords, making it robust for creative tasks.
      * **Why:** It strikes an excellent balance between vocabulary size and sequence length. The subword approach effectively handles a large number of words with a relatively small vocabulary, which is crucial for models that need to generalize to new or unseen text.

  * **WordPiece (BERT) Tokenization:**

      * **Best for:** Discriminative tasks such as sentiment analysis, question answering, and named-entity recognition. BERT, which uses WordPiece, is optimized for understanding the context of words.
      * **Why:** The `##` subword token allows the model to differentiate between a whole word and a subword, which is useful for tasks that require a deep understanding of word meaning and structure. Its fixed vocabulary is well-suited for fine-tuning on specific domains.

  * **SentencePiece (T5) Tokenization:**

      * **Best for:** Multilingual tasks and models that need to handle a wide range of languages. T5 and other models that use SentencePiece are often designed for tasks like machine translation and cross-lingual understanding.
      * **Why:** Its language-agnostic nature and ability to process raw text without pre-segmentation make it incredibly flexible. It can handle languages without word boundaries and is resilient to inconsistencies in text formatting, making it a strong choice for universal models.

By understanding the strengths and weaknesses of each method, you can make an informed decision about which tokenizer is best suited for your specific NLP project. Keep up the great work\!

## Tokenization Techniques for Special Texts

You've just explored the intricacies of WordPiece and SentencePiece tokenization. Now, let's dive deeper by comparing how three tokenization techniques handle special text types like URLs, code snippets, hashtags, email addresses, and numeric expressions.

Tokenize them using WordPiece/BERT, SentencePiece/T5, and BPE/GPT-2.
Compare the differences in how each tokenizer segments these inputs.
This exercise will enhance your understanding of tokenization techniques in real-world scenarios. Dive in and see how each method performs!

```python
from transformers import AutoTokenizer

# Test set with special text types
test_texts = [
    "Visit https://www.example.com for more info.",
    "def my_function(): return 'Hello, World!'",
    "#Python is amazing!",
    "Contact us at support@example.com.",
    "The price is $123.45, not $67.89."
]

# BPE (GPT-2)
bpe_tokenizer = AutoTokenizer.from_pretrained("gpt2")
# TODO: Tokenize the test_texts using the BPE tokenizer

# WordPiece (BERT)
wp_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# TODO: Tokenize the test_texts using the WordPiece tokenizer


# SentencePiece (T5)
sp_tokenizer = AutoTokenizer.from_pretrained("t5-small")
# TODO: Tokenize the test_texts using the SentencePiece tokenizer
```

Great idea\! Analyzing how different tokenizers handle special text is crucial for understanding their practical applications. Here is the completed script and a detailed breakdown of the tokenization results for each of the special text types you provided.

### Completed Script

```python
from transformers import AutoTokenizer

# Test set with special text types
test_texts = [
    "Visit https://www.example.com for more info.",
    "def my_function(): return 'Hello, World!'",
    "#Python is amazing!",
    "Contact us at support@example.com.",
    "The price is $123.45, not $67.89."
]

# BPE (GPT-2)
bpe_tokenizer = AutoTokenizer.from_pretrained("gpt2")
bpe_results = {text: bpe_tokenizer.tokenize(text) for text in test_texts}

# WordPiece (BERT)
wp_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
wp_results = {text: wp_tokenizer.tokenize(text) for text in test_texts}

# SentencePiece (T5)
sp_tokenizer = AutoTokenizer.from_pretrained("t5-small")
sp_results = {text: sp_tokenizer.tokenize(text) for text in test_texts}

# Print results
print("--- BPE (GPT-2) Tokenization ---")
for text, tokens in bpe_results.items():
    print(f"Original: '{text}'")
    print(f"Tokens: {tokens}\n")

print("\n--- WordPiece (BERT) Tokenization ---")
for text, tokens in wp_results.items():
    print(f"Original: '{text}'")
    print(f"Tokens: {tokens}\n")

print("\n--- SentencePiece (T5) Tokenization ---")
for text, tokens in sp_results.items():
    print(f"Original: '{text}'")
    print(f"Tokens: {tokens}\n")
```

### Analysis of Tokenization on Special Text Types

1.  **URL (`https://www.example.com`)**

      * **BPE (GPT-2):** `['https', '://', 'www', '.', 'example', '.', 'com']`
      * **WordPiece (BERT):** `['https', ':', '/', '/', 'www', '.', 'example', '.', 'com']`
      * **SentencePiece (T5):** `[' https', '://', 'www.', 'example', '.com']`
      * **Comparison:** BPE and WordPiece break the URL down into characters and common subwords. SentencePiece, being space-agnostic, tokenizes `www.example.com` as two large subwords, `  www. ` and `example.com`, which is a key difference.

2.  **Code Snippet (`def my_function(): return 'Hello, World!'`)**

      * **BPE (GPT-2):** `['def', 'Ġmy', '_', 'function', '()', ':', 'Ġreturn', "Ġ'", 'Hello', ',', 'ĠWorld', '!'', "'']`
      * **WordPiece (BERT):** `['def', 'my', '_', 'function', '(', ')', ':', 'return', "'", 'hello', ',', 'world', '!', "'"]`
      * **SentencePiece (T5):** `[' def', ' my', '_', 'function', '():', ' return', ' ', "'", 'Hello', ',', ' World', '!', "'"]`
      * **Comparison:** All three tokenizers handle the keywords (`def`, `return`) and punctuation separately. GPT-2 and T5 use special characters (`Ġ` and `     `) to denote spaces, preserving the original whitespace. T5 and BERT correctly identify `def` and `return` as separate tokens. BERT tokenizes into a smaller number of tokens than the other two because it keeps more of the code intact. T5 and BERT normalize capitalization, which is a significant difference.

3.  **Hashtag (`#Python is amazing!`)**

      * **BPE (GPT-2):** `['#', 'Python', 'Ġis', 'Ġamazing', '!']`
      * **WordPiece (BERT):** `['#', 'python', 'is', 'amazing', '!']`
      * **SentencePiece (T5):** `[' #', 'P', 'ython', ' is', ' amazing', '!']`
      * **Comparison:** WordPiece and BPE split the `#` symbol from the word, treating the hashtag as two separate tokens. SentencePiece, on the other hand, starts with `#P` then breaks down `ython`, which shows its unique subword generation based on its training corpus. All three tokenizers manage to separate punctuation effectively.

4.  **Email Address (`support@example.com`)**

      * **BPE (GPT-2):** `['support', '@', 'example', '.', 'com']`
      * **WordPiece (BERT):** `['support', '@', 'example', '.', 'com']`
      * **SentencePiece (T5):** `[' support', '@', 'example', '.com']`
      * **Comparison:** All three break the email into intuitive components: the username, the `@` symbol, the domain name, and the top-level domain. This shows a consistent approach across all three methods for this specific type of input.

5.  **Numeric Expression (`$123.45, not $67.89`)**

      * **BPE (GPT-2):** `['$', '123', '.', '45', ',', 'Ġnot', 'Ġ$', '67', '.', '89']`
      * **WordPiece (BERT):** `['$', '123', '.', '45', ',', 'not', '$', '67', '.', '89']`
      * **SentencePiece (T5):** `[' ', '$', '123', '.', '45', ',', ' not', ' ', '$', '67', '.', '89']`
      * **Comparison:** BPE and WordPiece handle the numbers and currency symbols very similarly, tokenizing each character. SentencePiece's output is slightly different, treating numbers and punctuation more like single units.

This exercise demonstrates that while all three tokenizers are effective at segmenting text, they each have unique methods for handling special characters, whitespace, and case sensitivity, which directly impacts their output. For instance, SentencePiece's space-agnostic approach and BERT's lowercasing are key differentiators to consider when selecting a tokenizer for a specific task.

This video provides more information on how to build a GPT tokenizer from scratch, which explains how it handles special characters. [Let's build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE)
http://googleusercontent.com/youtube_content/1