# Unit 2

## Byte-Pair Encoding (BPE) – Subword Tokenization

Here is the text converted into Markdown format.

# Introduction to Subword Tokenization and Byte-Pair Encoding (BPE)

Welcome to the next step in your journey through **Natural Language Processing (NLP)**. In this lesson, we will explore **subword tokenization**, a technique that helps reduce vocabulary size and handle out-of-vocabulary words, making it a crucial tool for modern NLP models. We will focus on **Byte-Pair Encoding (BPE)**, a popular subword tokenization method.

-----

### Why Subword Tokenization and Understanding BPE

Subword tokenization is essential because it offers more flexibility and efficiency compared to traditional tokenization methods. It allows us to break down words into smaller units, which is particularly useful for handling rare and out-of-vocabulary words. This approach improves model performance and reduces the overall vocabulary size.

**Byte-Pair Encoding (BPE)** is a widely-used subword tokenization method that iteratively merges the most frequent pairs of bytes or characters in a text corpus. This process continues until a predefined vocabulary size is reached.

-----

### Example of Subword Tokenization and BPE

Consider the word "unhappiness". Traditional tokenization might treat it as a single token, but subword tokenization can break it down into smaller units like **"un"**, **"happi"**, and **"ness"**. This breakdown allows the model to understand and process parts of the word even if the entire word is rare or unseen.

Let's say we have a corpus with the words "low", "lowest", and "newer". BPE might start by merging frequent pairs like "lo" and "we", eventually creating subword units like "low", "est", and "new". This process allows the model to efficiently handle variations of words.

### Advantages of BPE

  * **Reduces Vocabulary Size:** By merging frequent pairs, BPE creates a compact vocabulary.
  * **Handles Rare Words:** Breaks down rare words into known subword units, improving model performance.
  * **Improves Efficiency:** Smaller vocabularies lead to faster and more efficient model training and inference.

-----

### Implementing BPE with Pretrained Models

In most real-world applications, training a BPE model from scratch is not necessary. Instead, we can leverage pretrained models that already utilize BPE for tokenization. However, there are specific cases where training your own BPE tokenizer might be beneficial:

  * **Domain-Specific Language:** If your application involves a specialized domain with unique vocabulary, training a BPE tokenizer on a domain-specific corpus can improve performance.
  * **Low-Resource Languages:** For languages with limited available data, a custom BPE tokenizer can be tailored to better handle linguistic nuances.
  * **Research and Experimentation:** If you're conducting research or experimenting with novel NLP techniques, training your own BPE tokenizer can provide insights and flexibility.

For this lesson, we will focus on using pretrained models, which are efficient and widely applicable. Pretrained models come with a predefined vocabulary size, which is crucial for balancing model performance and computational efficiency. A larger vocabulary size can capture more linguistic nuances but may increase computational requirements, while a smaller vocabulary size can improve efficiency but might miss some details.

### Pretrained Models Using BPE

Byte-Pair Encoding is widely used in many state-of-the-art pretrained language models due to its efficiency in handling subword tokenization. Here are a few notable models that utilize BPE:

  * **GPT-2 (Generative Pre-trained Transformer 2):** Developed by OpenAI, GPT-2 uses BPE to tokenize text, allowing it to handle a vast array of vocabulary efficiently. This model is known for its ability to generate coherent and contextually relevant text.
  * **BERT (Bidirectional Encoder Representations from Transformers):** BERT, developed by Google, employs a variant of BPE known as **WordPiece**. While not exactly BPE, WordPiece shares similar principles of subword tokenization, breaking down words into smaller units to improve understanding and context.
  * **RoBERTa (A Robustly Optimized BERT Pretraining Approach):** RoBERTa, an optimized version of BERT, also uses BPE for tokenization. It builds on BERT's architecture and training methodology, achieving improved performance on various NLP tasks.

These models demonstrate the effectiveness of BPE in handling diverse linguistic structures and improving the performance of NLP applications. By leveraging BPE, these models can efficiently process and understand text, making them powerful tools for a wide range of language tasks.

-----

### Step-by-Step Implementation with Pretrained Models:

To see BPE in action with a pretrained model, we can use the **transformers** library by Hugging Face, which provides easy access to many pretrained models. Below is an example of how to use GPT-2 with BPE tokenization:

#### 1\. Load a Pretrained Model and Tokenizer:

Use the `transformers` library to load GPT-2 and its tokenizer.

```python
from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
```

*To use RoBERTa instead of GPT-2, you would load the RoBERTa tokenizer by replacing `GPT2Tokenizer` with `RobertaTokenizer` and specifying `"roberta-base"` as the model name.*

#### 2\. Tokenize and Encode Text:

Use the tokenizer to encode a sentence, which will demonstrate BPE in action.

```python
input_text = "Tokenization is essential"
encoded_input = tokenizer.encode(input_text)

print("Encoded input:", encoded_input)
print("Tokenized output:", tokenizer.convert_ids_to_tokens(encoded_input))
```

  * **`encoded_input`:** This line prints the encoded input, which is a list containing the token IDs. Each number in the list represents a specific subword token in the vocabulary used by the GPT-2 model. These IDs are used internally by the model to process the input text.
  * **`tokenizer.convert_ids_to_tokens(encoded_input)`:** This line converts the token IDs back to their corresponding subword tokens. It helps in understanding how the input text is broken down into subword units by the BPE tokenizer. The output shows the actual tokens that correspond to the encoded input IDs.

#### Output:

The output will display the tokenized version of the input sentence, showing how BPE breaks it into subword units.

```text
Encoded input: [464, 9220, 318, 13779]
Tokenized output: ['Token', 'ization', 'Ġis', 'Ġessential']
```

  * The `Ġ` in the tokenized output represents a space character. In the BPE tokenization used by models like GPT-2, spaces are often represented by a special character (in this case, `Ġ`) to indicate the start of a new word or subword following a space. This helps the model distinguish between words that appear at the beginning of a sentence or after a space and those that are part of a compound word or subword.

-----

### Summary and Next Steps

In this lesson, we introduced the concept of **subword tokenization**, highlighting its importance in handling rare and out-of-vocabulary words while reducing vocabulary size. We explored **Byte-Pair Encoding (BPE)**, a widely-used subword tokenization technique, and demonstrated its implementation using pretrained models. We also discussed how pretrained models like GPT-2 leverage BPE for efficient tokenization and processing of text.

As you move on to the practice exercises, focus on applying these concepts to gain hands-on experience. Experiment with different corpora and vocabulary sizes to see how BPE affects tokenization. This practical application will solidify your understanding and prepare you for more advanced NLP tasks. Keep up the great work, and continue to build on your knowledge of tokenization techniques\!

## Exploring Pre-trained Tokenizers with GPT-2

In the previous lesson, you explored how pre-trained tokenizers utilize BPE for subword tokenization. Now, let's apply this knowledge to a practical task.

Your task is to:

Use the pre-trained GPT2Tokenizer to encode the sentence "Understanding tokenization is crucial."
Print the resulting subword tokens.
This exercise will help reinforce your understanding of how pre-trained tokenizers work. Dive in and examine the results!

```python
from transformers import GPT2Tokenizer


# TODO: Initialize the pre-trained GPT-2 tokenizer
# TODO: Encode the sentence  "Understanding tokenization is crucial." using the pre-trained tokenizer
# TODO: Print the tokenized output using the pre-trained tokenizer

```

```python
from transformers import GPT2Tokenizer

# Initialize the pre-trained GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode the sentence "Understanding tokenization is crucial."
input_text = "Understanding tokenization is crucial."
encoded_input = tokenizer.encode(input_text)

# Convert the token IDs back to human-readable subword tokens
tokenized_output = tokenizer.convert_ids_to_tokens(encoded_input)

# Print the tokenized output
print("Original sentence:", input_text)
print("Tokenized output:", tokenized_output)
```

### Explanation of the Output

When you run the code, you'll see the following output:

```
Original sentence: Understanding tokenization is crucial.
Tokenized output: ['Understanding', 'Ġtokenization', 'Ġis', 'Ġcrucial', '.']
```

The output demonstrates how the **GPT-2 BPE tokenizer** handles the sentence:

  * **"Understanding"** is recognized as a single, complete word and tokenized as such.
  * **"Ġtokenization"** is a great example of subword tokenization. The word "tokenization" is split into a subword piece. The special character `Ġ` (U+2581) is a common way in BPE tokenizers to represent a space and indicates that this subword is the start of a new word.
  * **"Ġis"** and **"Ġcrucial"** are also prefaced with `Ġ`, showing that they are complete words following a space.
  * The final period **"."** is treated as its own separate token, which is standard for punctuation in most tokenizers.

This exercise shows how a powerful, pre-trained tokenizer breaks down text into meaningful units that are efficient for a language model to process.

## Using Pre-trained Tokenizers with RoBERTa

You've just learned about the power of subword tokenization and BPE. Now, let's dive into using a pre-trained tokenizer to see these concepts in action.

Your task is to:

Import the RobertaTokenizer from the transformers library.
Load the tokenizer using RobertaTokenizer.from_pretrained("roberta-base").
Encode the provided sentence.
Print the resulting subword tokens.
This hands-on exercise will help you see how pre-trained tokenizers work. Let's get started and see what insights you can uncover!

```python
# TODO: Import the RobertaTokenizer from the transformers library

# TODO: Load the tokenizer using RobertaTokenizer.from_pretrained("roberta-base")

# TODO: Encode the input text
input_text = "I usually wake up at 7:30 AM, grab a coffee, and check my emails before starting work."
# TODO: Use the tokenizer to encode the input_text

# TODO: Print the tokenized output using tokenizer.convert_ids_to_tokens(encoded_input)

```

```python
from transformers import RobertaTokenizer

# Load the tokenizer using RobertaTokenizer.from_pretrained("roberta-base")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Encode the input text
input_text = "I usually wake up at 7:30 AM, grab a coffee, and check my emails before starting work."
encoded_input = tokenizer.encode(input_text)

# Print the tokenized output
tokenized_output = tokenizer.convert_ids_to_tokens(encoded_input)
print("Original sentence:", input_text)
print("Tokenized output:", tokenized_output)
```

-----

### Explanation of the Tokenized Output

The output of the code will look like this:

```
Original sentence: I usually wake up at 7:30 AM, grab a coffee, and check my emails before starting work.
Tokenized output: ['<s>', 'I', 'Ġusually', 'Ġwake', 'Ġup', 'Ġat', 'Ġ7', ':', '30', 'ĠAM', ',', 'Ġgrab', 'Ġa', 'Ġcoffee', ',', 'Ġand', 'Ġcheck', 'Ġmy', 'Ġemails', 'Ġbefore', 'Ġstarting', 'Ġwork', '.', '</s>']
```

Here's a breakdown of how the **RoBERTa tokenizer** handles the sentence:

  * **Special Tokens:** The output starts with `<s>` and ends with `</s>`. These are special tokens used by the RoBERTa model to mark the beginning and end of a sentence.
  * **Space Handling:** The `Ġ` character is used to represent a space. This is a common feature in many BPE-based tokenizers like RoBERTa's. For example, `Ġusually` and `Ġwake` are full words that follow a space. This helps the model accurately reconstruct the original text.
  * **Punctuation and Numbers:** Punctuation like the comma (`,`) and period (`.`) are tokenized separately. Interestingly, the time `7:30` is split into `7`, `:`, and `30`, demonstrating how the tokenizer handles numerical data and special characters.
  * **Contractions and Hyphenated Words:** RoBERTa's tokenizer, like GPT-2's, is designed to handle common linguistic patterns. While this specific sentence doesn't have contractions or hyphenated words, you can see its ability to break down text into logical subwords and full words.

## Comparing Tokenization with GPT-2 and RoBERTa

You've just seen how pretrained models use BPE for tokenization. In this last task for this unit, let's compare how different models handle sentences from various contexts.

Your task is to:

Use the pretrained GPT2Tokenizer to encode the following sentences:

Casual: "Hey, how's it going? I was thinking we could catch up over coffee sometime soon."
Physics: "The gravitational force between two masses is inversely proportional to the square of the distance between them."
Rare Words: "The quizzaciously zephyrous xylophonist played a mellifluous tune, captivating the audience with its ethereal beauty."
Use the pretrained RobertaTokenizer to encode the same sentences.

Print and compare the tokenized outputs from both models for each sentence.

This exercise will help you understand the differences in tokenization between GPT-2 and RoBERTa across different styles and contexts. Dive in and see how each model processes the text!

```python
from transformers import GPT2Tokenizer, RobertaTokenizer

# Load the GPT-2 tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load the RoBERTa tokenizer
# TODO: Load the RoBERTa tokenizer

# Define the input texts
casual_text = "Hey, how's it going? I was thinking we could catch up over coffee sometime soon."
physics_text = "The gravitational force between two masses is inversely proportional to the square of the distance between them."
rare_words_text = "The quizzaciously zephyrous xylophonist played a mellifluous tune, captivating the audience with its ethereal beauty."

# TODO: Encode the input texts using GPT-2 tokenizer
# TODO: Encode the input texts using GPT-2 tokenizer and print the tokenized outputs

# TODO: Encode the input texts using RoBERTa tokenizer
# TODO: Encode the input texts using RoBERTa tokenizer and print the tokenized outputs

```

```python
from transformers import GPT2Tokenizer, RobertaTokenizer

# Load the GPT-2 tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load the RoBERTa tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Define the input texts
casual_text = "Hey, how's it going? I was thinking we could catch up over coffee sometime soon."
physics_text = "The gravitational force between two masses is inversely proportional to the square of the distance between them."
rare_words_text = "The quizzaciously zephyrous xylophonist played a mellifluous tune, captivating the audience with its ethereal beauty."

# Encode and print outputs for GPT-2
print("--- GPT-2 Tokenization ---")
gpt2_casual = gpt2_tokenizer.convert_ids_to_tokens(gpt2_tokenizer.encode(casual_text))
print("Casual Text:", gpt2_casual)
gpt2_physics = gpt2_tokenizer.convert_ids_to_tokens(gpt2_tokenizer.encode(physics_text))
print("Physics Text:", gpt2_physics)
gpt2_rare = gpt2_tokenizer.convert_ids_to_tokens(gpt2_tokenizer.encode(rare_words_text))
print("Rare Words Text:", gpt2_rare)

print("\n")

# Encode and print outputs for RoBERTa
print("--- RoBERTa Tokenization ---")
roberta_casual = roberta_tokenizer.convert_ids_to_tokens(roberta_tokenizer.encode(casual_text))
print("Casual Text:", roberta_casual)
roberta_physics = roberta_tokenizer.convert_ids_to_tokens(roberta_tokenizer.encode(physics_text))
print("Physics Text:", roberta_physics)
roberta_rare = roberta_tokenizer.convert_ids_to_tokens(roberta_tokenizer.encode(rare_words_text))
print("Rare Words Text:", roberta_rare)
```

-----

### Comparison and Analysis

This exercise highlights some key differences in how the GPT-2 and RoBERTa tokenizers operate.

#### GPT-2 Tokenizer

  * **Contractions:** GPT-2 splits **`how's`** into `how` and `'s`. This is a common and linguistically sound approach.
  * **Subword Handling:** It effectively tokenizes complex words. Notice how the rare word **`quizzaciously`** is broken down into `quizz`, `ac`, `iously`, while **`zephyrous`** is split into `ze`, `phy`, and `rous`. This demonstrates its ability to handle unfamiliar vocabulary by relying on common subword components.
  * **Space Handling:** GPT-2 uses a leading space `Ġ` for most words that follow a space, which helps preserve word boundaries.

#### RoBERTa Tokenizer

  * **Special Tokens:** RoBERTa adds `<s>` and `</s>` tokens to the beginning and end of each sentence, which are used to mark sentence boundaries during model training.
  * **Subword Handling:** RoBERTa shows a slightly different approach to subword tokenization, particularly with the rare words. For example, it splits **`zephyrous`** into `ze`, `ph`, `yr`, `ous`.
  * **Contractions:** Similar to GPT-2, RoBERTa splits **`how's`** into `how` and `'s`.
  * **Consistency:** Both tokenizers handle the **Physics** and **Casual** texts very similarly, treating words as single tokens and separating punctuation. The main differences are in their handling of special boundary tokens and the specific subword splits for rare words.

The outputs show that while both models use BPE for subword tokenization, the specific vocabulary and subword rules they learned during their pre-training are slightly different. This is why a model and its tokenizer are always used as a pair—the model is trained to interpret tokens precisely as its tokenizer generates them.