##  Tokenizing a Short Story for LLM Training (Educational Version)

<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; color: #155724;">

In this notebook, we aim to tokenize a 20,479-character short story, The Verdict by Edith Wharton, into a sequence of individual words and special characters to simulate preprocessing steps used in Large Language Model (LLM) training. 

</div>

<div style="background-color: #d1ecf1; padding: 15px; border-radius: 5px; color: #0c5460;">

While LLMs are typically trained on gigabytes of text data from millions of documents, we use this shorter text sample for educational purposes and to ensure quick runtime on consumer hardware.We begin by reading the entire file into memory and printing the character count and a sample of the content for context. 

</div>

## Step 1: Reading the Raw Text

In [11]:
# Load the raw text file
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Print total number of characters and a sample
print("Total number of characters:", len(raw_text))
print("First 100 characters:\n", raw_text[:99])

Total number of characters: 20479
First 100 characters:
 I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div style="background-color: #fff3cd; padding: 15px; border-radius: 5px; color: #856404;">

To tokenize the text, we use Python’s re (regular expressions) module, splitting the text into words and punctuation marks using a carefully designed pattern: re.split(r'([,.:;?_!"()\']|--|\s)', raw_text). This pattern ensures that we retain all meaningful tokens—including punctuation and whitespace—as separate elements.

</div>

<div style="background-color: #e2e3e5; padding: 15px; border-radius: 5px; color: #383d41;">

We then remove empty strings and pure whitespace from the resulting list using a combination of strip() and a filtering condition. The result is a list of clean, discrete tokens that can be used as input for embedding generation or other downstream NLP tasks. While whitespace is discarded in our approach for simplicity, this decision is task-dependent: retaining whitespace may be important for applications involving structured or indentation-sensitive text (e.g., programming code).

</div>

##  Step 2: Basic Tokenization Using Regular Expressions

In [14]:
import re

# Split on various punctuation and whitespace characters, keeping them as separate tokens
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)

# Remove empty strings and pure whitespace tokens
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# Show the first 30 tokens for inspection
print("First 30 tokens:\n", preprocessed[:30])

First 30 tokens:
 ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; color: #155724;">

This simplified tokenizer demonstrates key principles of tokenization and prepares us for transitioning to pre-built tokenizers from libraries like Hugging Face Transformers, spaCy, or SentencePiece, which handle more complex linguistic phenomena and are optimized for modern LLM workflows.

</div>