##  Tokenizing a Short Story for LLM Training (Educational Version)

<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; color: #155724;">

In this notebook, we aim to tokenize a 20,479-character short story, The Verdict by Edith Wharton, into a sequence of individual words and special characters to simulate preprocessing steps used in Large Language Model (LLM) training. 

</div>

<div style="background-color: #d1ecf1; padding: 15px; border-radius: 5px; color: #0c5460;">

While LLMs are typically trained on gigabytes of text data from millions of documents, we use this shorter text sample for educational purposes and to ensure quick runtime on consumer hardware.We begin by reading the entire file into memory and printing the character count and a sample of the content for context. 

</div>

## Step 1: Reading the Raw Text

In [5]:
# Load the raw text file
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Print total number of characters and a sample
print("Total number of characters:", len(raw_text))
print("First 100 characters:\n", raw_text[:99])

Total number of characters: 20541
First 100 characters:
 This is a sample text data we took for the training purpose.

I HAD always thought Jack Gisburn rat


<div style="background-color: #fff3cd; padding: 15px; border-radius: 5px; color: #856404;">

To tokenize the text, we use Python’s re (regular expressions) module, splitting the text into words and punctuation marks using a carefully designed pattern: re.split(r'([,.:;?_!"()\']|--|\s)', raw_text). This pattern ensures that we retain all meaningful tokens—including punctuation and whitespace—as separate elements.

</div>

<div style="background-color: #e2e3e5; padding: 15px; border-radius: 5px; color: #383d41;">

We then remove empty strings and pure whitespace from the resulting list using a combination of strip() and a filtering condition. The result is a list of clean, discrete tokens that can be used as input for embedding generation or other downstream NLP tasks. While whitespace is discarded in our approach for simplicity, this decision is task-dependent: retaining whitespace may be important for applications involving structured or indentation-sensitive text (e.g., programming code).

</div>

##  Step 2: Basic Tokenization Using Regular Expressions

In [9]:
import re

# Split on various punctuation and whitespace characters, keeping them as separate tokens
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)

# Remove empty strings and pure whitespace tokens
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# Show the first 30 tokens for inspection
print("First 30 tokens:\n", preprocessed[:30])

First 30 tokens:
 ['This', 'is', 'a', 'sample', 'text', 'data', 'we', 'took', 'for', 'the', 'training', 'purpose', '.', 'I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--']


<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; color: #155724;">

This simplified tokenizer demonstrates key principles of tokenization and prepares us for transitioning to pre-built tokenizers from libraries like Hugging Face Transformers, spaCy, or SentencePiece, which handle more complex linguistic phenomena and are optimized for modern LLM workflows.

</div>

 ## Building a custom tokenizer

<div style="background-color: #fff3cd; padding: 15px; border-radius: 5px; color: #856404;">

From the preprocessed text, we constructed a vocabulary by identifying all unique tokens and assigning them integer IDs, forming the foundation for converting text into numerical form.

</div>

##  Step 1: Import and Prepare Vocabulary

In [13]:
import re

# Step 1: Assume 'preprocessed' is a list of tokens from the training data
# Create a sorted list of unique tokens from training data
all_tokens = sorted(list(set(preprocessed)))

# Add special tokens
all_tokens.extend(["<|unk|>", "<|endoftext|>"])

# Create vocabulary (token to ID) and inverse vocabulary (ID to token)
vocab = {token: integer for integer, token in enumerate(all_tokens)}

# Display last 5 entries in the vocabulary
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1133)
('your', 1134)
('yourself', 1135)
('<|unk|>', 1136)
('<|endoftext|>', 1137)


<div style="background-color: #f8d7da; padding: 15px; border-radius: 5px; color: #721c24;">

Initially, we created a basic tokenizer class that could encode input text into token IDs using the vocabulary and decode it back to the original text. However, this version failed when encountering unknown words—tokens that did not appear in the training text—leading to KeyError exceptions.

</div>


<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; color: #155724;">

To solve this, we enhanced the basic tokinenizer by introducing two special tokens: <|unk|> to represent unknown tokens and <|endoftext|> to mark the end of a sentence or document boundary. These were appended to the vocabulary, and logic was added in the encode() method to replace any out-of-vocabulary token with <|unk|>, ensuring robustness against new or unseen input. Additionally, we appended <|endoftext|> to every input to mimic GPT-like tokenization, which is critical when training on multiple documents. The decode() method was also refined to cleanly join tokens into readable sentences, fixing spacing issues before punctuation.

</div>

##  Step 2: Tokenizer Class Definition

In [24]:
class CustomTokenizer:
    
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
        
    def encode(self, text):
        # Split the input text into tokens (words, punctuation, spaces)
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        
        # Replace unknown tokens with <|unk|>
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" 
            for item in preprocessed
        ]

        # Convert tokens to token IDs
        ids = [self.str_to_int[s] for s in preprocessed]

        # Append end-of-text token
        ids.append(self.str_to_int["<|endoftext|>"])
        return ids
        
    def decode(self, ids):
        # Convert token IDs back to string tokens
        tokens = [self.int_to_str.get(i, "<|unk|>") for i in ids]

        # Join tokens into a string
        text = " ".join(tokens)

        # Fix spacing before punctuation
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

## Step 3 : Instantiate and Test the Tokenizer

In [27]:
# Instantiate tokenizer
tokenizer = SimpleTokenizerV2(vocab)

# Test it
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

## Step 4: Encode and Decode the Sample Text

In [30]:
# Encode
ids = tokenizer.encode(text)
print("Token IDs:", ids)

# Decode
decoded_text = tokenizer.decode(ids)
print("Decoded Text:", decoded_text)


Token IDs: [1136, 5, 356, 1132, 629, 978, 10, 1137, 55, 992, 959, 987, 723, 992, 1136, 7, 1137]
Decoded Text: <|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>. <|endoftext|>


<div style="background-color: #e2e3e5; padding: 15px; border-radius: 5px; color: #383d41;">

Overall, this tokenizer provides a foundational understanding of how text is transformed into machine-readable formats for LLMs, and how even simple rule-based methods must gracefully handle exceptions, unknowns, and formatting nuances to be effective.

</div>