### <font color='red'>  Tokenizer Design and Training</font> 

Tokenization breaks raw text into discrete units (tokens) that a model can process. Different strategies balance vocabulary size, handling of rare words, and efficiency. 

Common strategies for subword-level vocab building include:
- Byte-Pair Encoding (BPE)
- WordPiece
- Unigram Language Model (SentencePiece)
- Character-level
- Whitespace/Regex


#### 1. Byte-Pair Encoding (BPE)

- Builds vocabulary by iteratively merging the most frequent pair of symbols (bytes or characters).
- Captures common subwords (e.g., “ing”, “tion”).
- Used by GPT and RoBERTa.
  
**Advantages**
- Compact vocabulary
- Efficient for languages with rich morphology
  
**Disadvantages**
- Greedy merges can miss some patterns
- Fixed merge order


In [1]:
# 1. Install the tokenizers library if you haven't already:
#    pip install tokenizers

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
from pathlib import Path

# 2. Prepare your training data: a directory of plain .txt files
data_dir = Path("./")       # e.g. contains file1.txt, file2.txt, …
files = [str(p) for p in data_dir.glob("*.jsonl")]

# 3. Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# 4. Pre-tokenize on bytes + whitespace
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)

# 5. Set up the BPE trainer
trainer = trainers.BpeTrainer(
    vocab_size=50_000,              # target vocab size
    min_frequency=2,                # ignore subwords that occur <2 times
    show_progress=True,
    special_tokens=[
        "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"
    ],
)

# 6. Train the tokenizer on your files
tokenizer.train(files, trainer)

# 7. Post-processing: add special tokens around sequences
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# 8. Set up a decoder for human-readable output
tokenizer.decoder = decoders.ByteLevel()

# 9. Save the trained tokenizer to disk
tokenizer.save("bpe-tokenizer.json")

# 10. Quick test: encode & decode a sample sentence
sample = "Transformers are powerful models for NLP."
encoded = tokenizer.encode(sample)

print("Tokens:", encoded.tokens)
print("IDs:   ", encoded.ids)

decoded = tokenizer.decode(encoded.ids)
print("Decoded:", decoded)




Tokens: ['ĠTrans', 'form', 'ers', 'Ġare', 'Ġpowerful', 'Ġmodels', 'Ġfor', 'ĠN', 'LP', '.']
IDs:    [3426, 759, 295, 350, 5322, 4668, 287, 347, 17886, 18]
Decoded:  Transformers are powerful models for NLP.


**Explanation of key steps:**

- Data files: point data_dir to a folder of raw text (.txt) files.
- models.BPE: initializes a BPE subword model with an unknown token.
- ByteLevel pre-tokenizer: splits on byte-level whitespace and punctuation for robustness.
- BpeTrainer: merges the most frequent symbol pairs up to vocab_size, ignoring very rare tokens.
- post_processor and decoder: ensure encoded sequences retain special-token semantics and decode nicely.
- Save: you can later load this JSON into transformers via
  PreTrainedTokenizerFast(tokenizer_file="bpe-tokenizer.json").
  
This pipeline yields a BPE subword vocabulary tuned to your corpus—ideal for LLM pretraining or fine-tuning


#### 2. Wordpiece Encoding

In [2]:
# Sample training corpus
corpus = [
    "Hello world!",
    "Tokenization is fun.",
    "Transformers are powerful models.",
    "Natural language processing."
]

# Sentence to encode/decode
sample = "Transformers are powerful models for NLP."



In [3]:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize WordPiece tokenizer
wp_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
wp_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Trainer: learn merges under a likelihood model
wp_trainer = trainers.WordPieceTrainer(
    vocab_size=500,
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
)
wp_tokenizer.train_from_iterator(corpus, trainer=wp_trainer)

# Decoder for WordPiece (handles '##' prefix)
wp_tokenizer.decoder = decoders.WordPiece(prefix="##")

# Encode & decode
encoded = wp_tokenizer.encode(sample)
print("\nWordPiece Tokens:", encoded.tokens)
print("IDs:   ", encoded.ids)
print("WordPiece Decoded:", wp_tokenizer.decode(encoded.ids))





WordPiece Tokens: ['Transformers', 'are', 'powerful', 'models', 'f', '##or', '[UNK]', '.']
IDs:    [105, 92, 108, 107, 14, 51, 1, 6]
WordPiece Decoded: Transformers are powerful models for.


#### 3. Unigram Language Model (SentencePiece-style)
A probabilistic model that starts from a large seed vocab and prunes under an EM algorithm to maximize data likelihood.


In [4]:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers, decoders

# Sample corpus and sentence
corpus = [
    "Hello world!",
    "Tokenization is fun.",
    "Transformers are powerful models.",
    "Natural language processing."
]
sample = "Transformers are powerful models for NLP."

# 1. Initialize Unigram model (no unk_token here!)
uni_model = models.Unigram()
uni_tokenizer = Tokenizer(uni_model)

# 2. Pre-tokenizer
uni_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# 3. Trainer with unk_token specified
uni_trainer = trainers.UnigramTrainer(
    vocab_size=500,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    unk_token="[UNK]"  # This is where you define it
)

# 4. Train the tokenizer
uni_tokenizer.train_from_iterator(corpus, trainer=uni_trainer)

# 6. Optional: add decoder
uni_tokenizer.decoder = decoders.ByteLevel()

# 7. Encode and decode
encoded = uni_tokenizer.encode(sample)
print("\nUnigram Tokens:", encoded.tokens)
print("IDs:   ", encoded.ids)
print("Unigram Decoded:", uni_tokenizer.decode(encoded.ids))




Unigram Tokens: ['T', 'ra', 'n', 's', 'f', 'or', 'm', 'er', 's', 'a', 'r', 'e', 'p', 'o', 'w', 'er', 'fu', 'l', 'm', 'o', 'd', 'el', 's', 'f', 'or', 'N', 'LP', '.']
IDs:    [17, 26, 10, 6, 28, 24, 18, 23, 6, 11, 13, 8, 15, 5, 16, 23, 21, 7, 18, 5, 19, 25, 6, 28, 24, 30, 1, 12]
Unigram Decoded: TransformersarepowerfulmodelsforN.


#### 4. Character-Level Tokenization
Every character becomes a token. No subwords or merges—useful for robust handling of any text.


In [5]:
def char_tokenize(text):
    return list(text)

tokens = char_tokenize(sample)
print("\nCharacter Tokens:", tokens)
# Decoded is just joining chars
print("Character Decoded:", "".join(tokens))


Character Tokens: ['T', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'e', 'r', 's', ' ', 'a', 'r', 'e', ' ', 'p', 'o', 'w', 'e', 'r', 'f', 'u', 'l', ' ', 'm', 'o', 'd', 'e', 'l', 's', ' ', 'f', 'o', 'r', ' ', 'N', 'L', 'P', '.']
Character Decoded: Transformers are powerful models for NLP.


#### 5. Whitespace/Regex Tokenization
Splits on spaces or simple regex. Fast but yields large vocabularies and poor subword handling.


In [6]:
import re

# Simple whitespace split
ws_tokens = sample.split()
print("\nWhitespace Tokens:", ws_tokens)

# Regex: split on non-word characters
regex_tokens = re.findall(r"\w+|[^\s\w]+", sample)
print("Regex Tokens:", regex_tokens)


Whitespace Tokens: ['Transformers', 'are', 'powerful', 'models', 'for', 'NLP.']
Regex Tokens: ['Transformers', 'are', 'powerful', 'models', 'for', 'NLP', '.']


#### Summary
Each method balances trade-offs between vocabulary size, handling of rare words, and computational cost.
- Use BPE or WordPiece for most LLMs (compact, subword-aware).
- Choose Unigram when you want a probabilistic split (multilingual corpora).
- Opt for Character when you need full coverage without OOV tokens.
- Whitespace/Regex is best for quick prototyping or tasks where subword modeling is less critical.
