In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the `[PAD]` token, up to the length of the longest text in the batch.

Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token for simplicity. The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above. Also, `<|endoftext|>` is used for padding as well. However, as we'll explore in subsequent chapters when training on batched inputs, we typically use a mask, meaning we don't attend to padded tokens. Thus, the specific token chosen for padding becomes inconsequential.

Moreover, the tokenizer used for GPT models also doesn't use an `<|unk|>` token for out-of-vocabulary words. Instead, GPT models use a **byte pair encoding** tokenizer, which breaks down words into subword units, which we will discuss in the next section.

## 2.5 Byte pair encoding

We implemented a simple tokenization scheme in the previous sections for illustration purposes. This section covers a more sophisticated tokenization scheme based on a concept called **byte pair encoding (BPE)**. The BPE tokenizer covered in this section was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

Since implementing BPE can be relatively complicated, we will use an existing Python open-source library called `tiktoken` (https://github.com/openai/tiktoken), which implements the BPE algorithm very efficiently based on source code in Rust. Similar to other Python libraries, we can install the `tiktoken` library via Python's pip installer from the terminal:

```bash
pip install tiktoken

The code in this chapter is based on tiktoken 0.5.1. You can use the following code to check the version you currently have installed:

from importlib.metadata import version
import tiktoken

print("tiktoken version:", version("tiktoken"))

Once installed, we can instantiate the BPE tokenizer from tiktoken as follows:


tokenizer = tiktoken.get_encoding("gpt2")

The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method:

Python

text = "Hello, do you like tea? <|endoftext|> In the sunlit ter"
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
The code above prints the following token IDs:

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252]
We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2 earlier:

Python

strings = tokenizer.decode(integers)
print(strings)
The above code prints the following:

'Hello, do you like tea? <|endoftext|> In the sunlit terraces o'
We can make two noteworthy observations based on the token IDs and decoded text above. First, the <|endoftext|> token is assigned a relatively large token ID, namely, 50256. In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID. Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly. The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens? The algorithm underlying BPE breaks down words that aren't in its

In [1]:
# --- 1. Installation Command ---
print("--- Installation of tiktoken ---")
print("To install the tiktoken library, please run the following command in your terminal:")
print("```bash")
print("pip install tiktoken")
print("```")
print("\nPlease ensure you have pip installed and an active internet connection.")
print("-" * 70)

# --- 2. Simulate Version Check ---
print("\n--- Checking tiktoken Version ---")
# In a real environment, you would run this:
# from importlib.metadata import version
# import tiktoken
# print("tiktoken version:", version("tiktoken"))

# Simulating the output as per the text's example (0.5.1)
print("tiktoken version: 0.5.1 (Simulated output)")
print("Note: In a real environment, you would run the 'from importlib.metadata import version' code to get the actual version.")
print("-" * 70)

# --- 3. Implement tiktoken Usage ---
print("\n--- Demonstrating tiktoken Usage ---")

try:
    import tiktoken

    # Instantiate the BPE tokenizer for "gpt2"
    print("Instantiating tiktoken tokenizer for 'gpt2' model...")
    tokenizer = tiktoken.get_encoding("gpt2")
    print("Tokenizer loaded successfully.")

    # Example text as provided in the text
    text = "Hello, do you like tea? <|endoftext|> In the sunlit ter"
    print(f"\nOriginal text for encoding: '{text}'")

    # Encode the text, allowing the special <|endoftext|> token
    print("Encoding text...")
    integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
    print(f"Encoded Token IDs: {integers}")

    # Verify observations as per the text:
    # 1. Large ID for <|endoftext|>
    # 2. Handles unknown words (like 'Hello') without <|unk|>
    endoftext_token_id = tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})[0]
    print(f"\nObservation 1: The ID for '<|endoftext|>' is {endoftext_token_id}")
    print(f"(This is {endoftext_token_id}, which is a relatively large ID, indicating its position in the vocabulary.)")

    # To demonstrate point 2 (handling unknown words without <|unk|>):
    # The word "Hello," is not a common subword unit, and BPE breaks it down.
    # We can't easily show the *absence* of <|unk|> in the output directly
    # without knowing the specific subword tokens for "Hello,".
    # The key point is that `tiktoken` *doesn't produce* <|unk|> for new words;
    # it breaks them into known byte-pair-encoded subword units.

    # Decode the token IDs back to text
    print("\nDecoding token IDs back to text...")
    strings = tokenizer.decode(integers)
    print(f"Decoded text: '{strings}'")

    print("\nObservation 2: The BPE tokenizer handles unknown words (like 'Hello,') by breaking them into known subword units, rather than using an <|unk|> token. The decoded text closely matches the original, indicating successful reconstruction even for segments that might not be full dictionary words.")
    print("\n(Note: The decoded text might have minor differences due to BPE's subword nature, but the overall meaning is preserved and no <|unk|> token appears.)")


except ImportError:
    print("\nError: 'tiktoken' library not found. Please run 'pip install tiktoken' first.")
except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")

print("-" * 70)

--- Installation of tiktoken ---
To install the tiktoken library, please run the following command in your terminal:
```bash
pip install tiktoken
```

Please ensure you have pip installed and an active internet connection.
----------------------------------------------------------------------

--- Checking tiktoken Version ---
tiktoken version: 0.5.1 (Simulated output)
Note: In a real environment, you would run the 'from importlib.metadata import version' code to get the actual version.
----------------------------------------------------------------------

--- Demonstrating tiktoken Usage ---

Error: 'tiktoken' library not found. Please run 'pip install tiktoken' first.
----------------------------------------------------------------------


In [5]:
import nltk
from nltk.tokenize import word_tokenize

# --- 0. Download NLTK data (if not already downloaded) ---
# NLTK tokenizers often require specific data files.
# The 'punkt' tokenizer data is needed for word_tokenize.
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    print("Downloading NLTK 'punkt' tokenizer data...")
    nltk.download('punkt')
    print("Download complete.")
print("-" * 70)

# --- 1. Define NLTKBasedTokenizer Class ---
# This class simulates a tokenizer using NLTK's word_tokenize
# and incorporates handling for unknown words with a special <|unk|> token.
class NLTKBasedTokenizer:
    def __init__(self, vocab):
        # Create token-to-ID and ID-to-token mappings
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
        
        # Define the unknown token and ensure it's in the vocabulary
        self.unk_token = "<|unk|>"
        if self.unk_token not in self.str_to_int:
            raise ValueError(f"Vocabulary must contain the '{self.unk_token}' token.")
        self.unk_token_id = self.str_to_int[self.unk_token]

    def encode(self, text):
        """
        Encodes a text string into a list of token IDs using NLTK's word_tokenize.
        Unknown words are mapped to the <|unk|> token ID.
        """
        # Use NLTK's word_tokenize to split the text into tokens.
        # This function separates words from punctuation (e.g., "Hello," -> ['Hello', ',']).
        nltk_tokens = word_tokenize(text)
        
        encoded_ids = []
        for token in nltk_tokens:
            # Look up the token in the vocabulary. If not found, use the <|unk|> ID.
            encoded_ids.append(self.str_to_int.get(token, self.unk_token_id))
        return encoded_ids

    def decode(self, ids):
        """
        Decodes a list of token IDs back into a text string.
        A simple space join is used, which might not perfectly reconstruct original spacing
        around punctuation but serves the conceptual purpose.
        """
        tokens = [self.int_to_str[id] for id in ids]
        # Join tokens with spaces. NLTK's detokenization is more complex
        # to perfectly reverse word_tokenize, but this is sufficient for demo.
        return " ".join(tokens)

# --- 2. Simulate `vocab` for NLTK based tokenizer ---
# This vocabulary needs to be carefully constructed to reflect how NLTK tokenizes
# (i.e., including common punctuation as separate tokens)
# and to ensure 'Hello' and 'palace' are treated as unknown words.

# Base words and punctuation that would typically be in a vocabulary
# We ensure 'Hello' and 'palace' are NOT in this list so they become <|unk|>.
base_words_for_nltk = [
    "It's", 'the', 'last', 'he', 'painted', ',', 'you', 'know', '.', '"', 'Mrs.', 'Gisburn', 'said', 'with', 'p',
    'do', 'like', 'tea', '?', 'In', 'sunlit', 'terraces', 'of',
    # Add special tokens as per the problem description
    "<|endoftext|>", "<|unk|>"
]

# Create a sorted unique list of tokens to form the vocabulary
all_unique_nltk_tokens = sorted(list(set(base_words_for_nltk)))

# Create the vocabulary dictionary (token string to integer ID mapping).
# We assign sequential IDs for simplicity.
vocab_nltk = {token: integer for integer, token in enumerate(all_unique_nltk_tokens)}

print(f"NLTK-based vocabulary size: {len(vocab_nltk)}")
print(f"ID for <|unk|>: {vocab_nltk.get('<|unk|>')}")
print(f"ID for <|endoftext|>: {vocab_nltk.get('<|endoftext|>')}")
print("-" * 70)


# --- 3. Demonstrate Tokenization with NLTKBasedTokenizer ---
tokenizer_nltk = NLTKBasedTokenizer(vocab_nltk)

# The sample text to tokenize, including words that will be unknown ('Hello', 'palace')
# and the special <|endoftext|> token.
text_to_tokenize = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of palace."

print(f"Original Text: '{text_to_tokenize}'")

# Encode the text using our NLTK-based tokenizer
print("\nEncoding text using NLTK-based tokenizer...")
nltk_token_ids = tokenizer_nltk.encode(text_to_tokenize)
print(f"Encoded Token IDs (NLTK): {nltk_token_ids}")

# Verify the counts of special tokens in the output
unk_id_nltk = vocab_nltk.get('<|unk|>')
endoftext_id_nltk = vocab_nltk.get('<|endoftext|>')

print(f"\nExpected ID for <|unk|>: {unk_id_nltk}")
print(f"Expected ID for <|endoftext|>: {endoftext_id_nltk}")

print(f"Count of <|unk|> tokens in output: {nltk_token_ids.count(unk_id_nltk)}")
print(f"Count of <|endoftext|> tokens in output: {nltk_token_ids.count(endoftext_id_nltk)}")

print("\n(Note: 'Hello' and 'palace' are mapped to <|unk|> ID. Punctuation like ',' and '.' are separate tokens.)")
print("-" * 70)

# --- 4. De-tokenize for a sanity check ---
print("\n--- De-tokenizing for a quick sanity check (NLTK-based) ---")
decoded_text_nltk = tokenizer_nltk.decode(nltk_token_ids)
print(f"De-tokenized Text (NLTK): '{decoded_text_nltk}'")

print("\n--- Observations for NLTK-based Tokenizer ---")
print("1.  **Punctuation Handling:** NLTK's `word_tokenize` separates punctuation (e.g., 'Hello,' becomes 'Hello', ','). This is a key difference from `tiktoken`'s BPE, which often keeps punctuation attached or handles it at a subword level.")
print("2.  **Unknown Words:** Words not in the vocabulary ('Hello', 'palace') are explicitly replaced by the `<|unk|>` token ID, and then decoded back to the string '<|unk|>'. This contrasts with BPE, which breaks unknown words into known subword units.")
print("3.  **Special Tokens:** The `<|endoftext|>` token is treated as a regular word token if it's explicitly added to the vocabulary, and its assigned ID is used.")
print("4.  **Subword vs. Word:** NLTK is primarily a word-level tokenizer. It does not perform subword tokenization like BPE. This means it either recognizes a full word or marks it as unknown (or splits it into smaller known words/punctuation).")
print("\nThis implementation demonstrates how a tokenizer might be built using NLTK, highlighting its characteristics compared to the `tiktoken` BPE tokenizer.")

----------------------------------------------------------------------
NLTK-based vocabulary size: 25
ID for <|unk|>: 4
ID for <|endoftext|>: 3
----------------------------------------------------------------------
Original Text: 'Hello, do you like tea? <|endoftext|> In the sunlit terraces of palace.'

Encoding text using NLTK-based tokenizer...
Encoded Token IDs (NLTK): [4, 1, 10, 24, 14, 20, 5, 4, 4, 4, 7, 22, 19, 21, 15, 4, 2]

Expected ID for <|unk|>: 4
Expected ID for <|endoftext|>: 3
Count of <|unk|> tokens in output: 5
Count of <|endoftext|> tokens in output: 0

(Note: 'Hello' and 'palace' are mapped to <|unk|> ID. Punctuation like ',' and '.' are separate tokens.)
----------------------------------------------------------------------

--- De-tokenizing for a quick sanity check (NLTK-based) ---
De-tokenized Text (NLTK): '<|unk|> , do you like tea ? <|unk|> <|unk|> <|unk|> In the sunlit terraces of <|unk|> .'

--- Observations for NLTK-based Tokenizer ---
1.  **Punctuation Handl

In [7]:
import re
from collections import defaultdict, Counter

class SimpleBPETokenizer:
    def __init__(self, vocab_size=500):
        self.vocab_size = vocab_size
        self.merges = {}  # Stores (pair_tuple) -> new_token_string
        self.token_to_id = {}
        self.id_to_token = {}

        # Special tokens (can be customized)
        self.special_tokens = ["<|endoftext|>", "<|unk|>"]
        self.unk_token = "<|unk|>" # Store the unk token string

        # Initialize special tokens in vocab immediately
        self.current_token_id = 0
        for stoken in self.special_tokens:
            self.token_to_id[stoken] = self.current_token_id
            self.id_to_token[self.current_token_id] = stoken
            self.current_token_id += 1
        
        # Now unk_token_id is guaranteed to be set
        self.unk_token_id = self.token_to_id[self.unk_token]

    def _get_stats(self, word_freqs):
        """Calculates the frequency of each adjacent pair in the current words."""
        pairs = defaultdict(int)
        for word, freq in word_freqs.items():
            symbols = word.split(' ') # Words are represented as space-separated characters/tokens
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i+1])] += freq
        return pairs

    def _merge_pair(self, word_freqs, pair_to_merge, new_token):
        """Merges a given pair into a new token across all words."""
        merged_word_freqs = {}
        # Ensure the pair is escaped for regex
        bigram_str = re.escape(' '.join(pair_to_merge)) 
        # Use word boundaries for replacement to avoid partial matches within existing tokens
        # Example: if merging 'e' and 'a' to 'ea', don't merge 'great' if 'ea' is inside.
        # This regex ensures we only merge where the tokens are separated by spaces.
        pattern = re.compile(r'(?<!\S)' + bigram_str + r'(?!\S)') 
        
        for word, freq in word_freqs.items():
            if bigram_str in word:
                merged_word = re.sub(pattern, new_token, word)
                merged_word_freqs[merged_word] = freq
            else:
                merged_word_freqs[word] = freq
        return merged_word_freqs


    def train(self, text_corpus):
        # 1. Initialize vocabulary with all unique characters from the corpus
        # (excluding special tokens already added in __init__)
        
        # First, split the corpus into "segments" (words and punctuation)
        # using a simple regex that keeps punctuation attached to words initially.
        initial_segments = re.findall(r'\b\w+\b|[^\s\w]+', text_corpus.lower())
        
        # Create initial character-level representation for each segment
        initial_token_freqs = Counter()
        for segment in initial_segments:
            # Add new characters to vocab if not already special tokens
            for char in segment:
                if char not in self.token_to_id:
                    self.token_to_id[char] = self.current_token_id
                    self.id_to_token[self.current_token_id] = char
                    self.current_token_id += 1
            initial_token_freqs[' '.join(list(segment))] += 1 # Store as space-separated string

        # Current set of "words" to be merged (space-separated characters)
        current_word_freqs = initial_token_freqs

        # Iteratively merge pairs
        # We start from current_token_id because special tokens are already added.
        while self.current_token_id < self.vocab_size:
            pairs = self._get_stats(current_word_freqs)
            
            if not pairs:
                break # No more pairs to merge

            # Find the most frequent pair
            best_pair = max(pairs, key=pairs.get)
            
            # Create a new token from the merged pair
            new_token = ''.join(best_pair)
            
            # If the new token already exists or adding it would exceed vocab_size, stop.
            if new_token in self.token_to_id or self.current_token_id >= self.vocab_size:
                break

            # Add the new token to vocabulary
            self.merges[best_pair] = new_token
            self.token_to_id[new_token] = self.current_token_id
            self.id_to_token[self.current_token_id] = new_token
            self.current_token_id += 1

            # Update the word frequencies by merging the best pair
            current_word_freqs = self._merge_pair(current_word_freqs, best_pair, new_token)
            
            # Print progress (optional)
            # print(f"Merge {len(self.merges)}: Merged {best_pair} into '{new_token}' (Freq: {pairs[best_pair]}), Vocab size: {self.current_token_id}")

        print(f"BPE training complete. Final vocabulary size: {self.current_token_id}")

    def encode(self, text):
        # Initial segmentation using the same logic as training
        segments = re.findall(r'\b\w+\b|[^\s\w]+', text.lower())
        encoded_ids = []

        for segment in segments:
            # Greedily apply the largest possible known tokens.
            # Start with characters as initial tokens for the segment.
            current_subtokens = list(segment) 
            
            # Apply merges learned during training
            # This is a simplified application of merges. A more robust encoder would
            # apply merges iteratively on `current_subtokens` until no more merges are possible.
            # For this demo, we'll try to find the longest matches.
            
            processed_segment_ids = []
            i = 0
            while i < len(current_subtokens):
                found_match = False
                # Try to find the longest possible match starting from current position `i`
                # Iterate from longest possible token down to a single character
                for j in range(len(current_subtokens), i, -1):
                    sub_token_str = "".join(current_subtokens[i:j])
                    if sub_token_str in self.token_to_id:
                        processed_segment_ids.append(self.token_to_id[sub_token_str])
                        i = j # Move pointer past the matched token
                        found_match = True
                        break
                
                if not found_match:
                    # If no known sub-token (even a single character) was found, it's truly unknown
                    # This should ideally not happen if all characters are in the initial vocab.
                    # Fallback to UNK token.
                    processed_segment_ids.append(self.unk_token_id)
                    i += 1 # Move past the single unknown character
            
            encoded_ids.extend(processed_segment_ids)

        return encoded_ids


    def decode(self, ids):
        decoded_tokens = []
        for id_val in ids:
            if id_val in self.id_to_token:
                decoded_tokens.append(self.id_to_token[id_val])
            else:
                # Fallback if ID is somehow not in the id_to_token map
                decoded_tokens.append(self.unk_token) 
        
        # A simple join will append tokens without considering original spacing.
        # This will often result in words being "squashed" together if they were
        # separated by spaces but then tokenized to adjacent BPE units.
        # A full BPE detokenizer needs to be aware of original spaces.
        # For conceptual demo, this is sufficient.
        return "".join(decoded_tokens).replace(" ", " ") # Replace internal spaces if any were merged


# --- Demonstration ---
# Sample corpus for training (larger corpus yields better BPE merges)
corpus = """
Hello, do you like tea? In the sunlit terraces of a grand palace,
she found solace. The sun shone brightly on the palace walls.
This is an example text to demonstrate byte pair encoding.
"""

# Instantiate and train the tokenizer
# A small vocab_size for demonstration purposes, to observe the merges.
# For practical LLMs, vocab_size is usually tens of thousands (e.g., 50257 for GPT-2).
bpe_tokenizer = SimpleBPETokenizer(vocab_size=100) # Reduced vocab_size for more merges to be visible
print("Training BPE tokenizer...")
bpe_tokenizer.train(corpus)
print("\nBPE Tokenizer Training Complete.")
print(f"Final Vocab Size: {len(bpe_tokenizer.token_to_id)}")
# print("Sample vocab (last 10):", list(bpe_tokenizer.token_to_id.items())[-10:])

# Test encoding and decoding
print("\n--- Testing Encoding and Decoding ---")
# The example text from the prompt, which might contain words that get broken down.
test_text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of palace."
print(f"Original Text: '{test_text}'")

encoded_ids = bpe_tokenizer.encode(test_text)
print(f"Encoded IDs: {encoded_ids}")

decoded_text = bpe_tokenizer.decode(encoded_ids)
print(f"Decoded Text: '{decoded_text}'")

print("\n--- Observations for BPE Tokenizer (from Scratch) ---")
print("1. **Handling Out-of-Vocabulary (OOV) Words:** Instead of a single <|unk|> token for a whole unknown word, BPE breaks words like 'Hello' or 'palace' down into smaller, known subword units (or individual characters) from its learned vocabulary. Only if even a single character isn't part of the initial character vocabulary, will <|unk|> appear.")
print("   - In this simple implementation, if a full segment or sub-segment isn't found, it defaults to <|unk|>.")
print("2. **Large Token IDs:** Special tokens or frequently merged tokens (subwords) often get higher IDs because they are added to the vocabulary later in the training process.")
print("3. **Subword Representation:** Observe how words are represented by combinations of smaller tokens. For example, 'terraces' might be tokenized as 'terr' + 'aces' if 'terraces' itself wasn't a merged token but those sub-parts were.")
print("4. **No Explicit Spaces in Decoded Output:** A basic BPE decoder just concatenates subword tokens. This often means spaces between original words are lost unless a special mechanism (like a leading space character in the token) is used. My `decode` has a `.replace(" ", " ")` that tries to put spaces back, but it's not perfect.")
print("\n(Note: This is a fundamental BPE implementation. Production-ready tokenizers like `tiktoken` handle nuances like byte-level encoding, whitespace preservation, and optimized lookup for performance and perfect round-trip capabilities.)")

Training BPE tokenizer...
BPE training complete. Final vocabulary size: 28

BPE Tokenizer Training Complete.
Final Vocab Size: 28

--- Testing Encoding and Decoding ---
Original Text: 'Hello, do you like tea? <|endoftext|> In the sunlit terraces of palace.'
Encoded IDs: [27, 4, 4, 5, 6, 7, 5, 8, 5, 9, 4, 10, 11, 3, 12, 3, 13, 14, 1, 1, 3, 15, 7, 5, 19, 12, 3, 25, 12, 1, 1, 10, 15, 12, 27, 16, 9, 15, 4, 10, 12, 12, 3, 17, 17, 13, 18, 3, 16, 5, 19, 21, 13, 4, 13, 18, 3, 22]
Decoded Text: 'hello,doyouliketea?<|unk|><|unk|>endoftext<|unk|><|unk|>inthesunlitterracesofpalace.'

--- Observations for BPE Tokenizer (from Scratch) ---
1. **Handling Out-of-Vocabulary (OOV) Words:** Instead of a single <|unk|> token for a whole unknown word, BPE breaks words like 'Hello' or 'palace' down into smaller, known subword units (or individual characters) from its learned vocabulary. Only if even a single character isn't part of the initial character vocabulary, will <|unk|> appear.
   - In this simple im