████████╗ ██████╗ ██╗ ██╗███████╗███╗ ██╗██╗███████╗███████╗ ██████╗
╚══██╔══╝██╔═══██╗██║ ██╔╝██╔════╝████╗ ██║██║╚══███╔╝██╔════╝ ╚════██╗
██║ ██║ ██║█████╔╝ █████╗ ██╔██╗ ██║██║ ███╔╝ █████╗ █████╔╝
██║ ██║ ██║██╔═██╗ ██╔══╝ ██║╚██╗██║██║ ███╔╝ ██╔══╝ ██╔═══╝
██║ ╚██████╔╝██║ ██╗███████╗██║ ╚████║██║███████╗███████╗ ███████╗
╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═══╝╚═╝╚══════╝╚══════╝ ╚══════╝
Tokenize-2 is a fast, extensible tokenizer designed for LLM and AGI research.
It includes byte-level encoding, multiple merge strategies (BPE, frequency, entropy, context),
parallel batch tokenization, special token handling, and customizable out-of-vocabulary logic.
• Byte-Level Tokenization (UTF-8 / UTF-16) • Stable byte-to-unicode mapping • Multiple merge strategies: - bpe: classic byte-pair encoding - frequency: frequency-based merges - entropy: entropy-minimizing merges - context: designed for future expansion • Special token support (, , custom) • OOV handling: - split: break into bytes - approximation: fallback token • Parallel batch tokenization using multiprocessing
Clone the repository:
git clone https://github.com/yourusername/tokenize2.git cd tokenize2
- Initialize:
from tokenizer import Tokenize2
tokenizer = Tokenize2( vocab_size=5000, merge_strategy="bpe", encoding="utf-8", special_tokens=["", "", "", ""], oov_strategy="split" )
- Train:
corpus = [ "This is an example.", "Tokenize-2 is a powerful tokenizer framework." ]
tokenizer.train_tokenizer(corpus)
- Tokenize:
tokens = tokenizer.tokenize("Tokenizers are cool!") print(tokens)
- Batch tokenization:
texts = ["Hello world!", "This is Tokenize-2."] output = tokenizer.tokenize_batch(texts, num_processes=4)
- Save vocabulary:
tokenizer.save_vocab("vocab.json")
Tokenize-2 performs:
- Byte-level pre-tokenization
- Pair frequency or entropy scanning
- Merge rule learning
- Vocabulary construction
- Fast runtime lookup + OOV fallback
• Context-based merge training
• Rust backend
• GPU accelerated merges
• Learned OOV approximation
• Pre-built vocab sets for TNSA models
TNSA OpenWeight License
Star the repo if you like Tokenize-2!
Tokenize-2 is a fast, extensible tokenizer designed for LLM and AGI research.
It includes byte-level encoding, multiple merge strategies (BPE, frequency, entropy, context),
parallel batch tokenization, special token handling, and customizable out-of-vocabulary logic.
• Byte-Level Tokenization (UTF-8 / UTF-16) • Stable byte-to-unicode mapping • Multiple merge strategies: - bpe: classic byte-pair encoding - frequency: frequency-based merges - entropy: entropy-minimizing merges - context: designed for future expansion • Special token support (, , custom) • OOV handling: - split: break into bytes - approximation: fallback token • Parallel batch tokenization using multiprocessing
Clone the repository:
git clone https://github.com/yourusername/tokenize2.git cd tokenize2
- Initialize:
from tokenizer import Tokenize2
tokenizer = Tokenize2( vocab_size=5000, merge_strategy="bpe", encoding="utf-8", special_tokens=["", "", "", ""], oov_strategy="split" )
- Train:
corpus = [ "This is an example.", "Tokenize-2 is a powerful tokenizer framework." ]
tokenizer.train_tokenizer(corpus)
- Tokenize:
tokens = tokenizer.tokenize("Tokenizers are cool!") print(tokens)
- Batch tokenization:
texts = ["Hello world!", "This is Tokenize-2."] output = tokenizer.tokenize_batch(texts, num_processes=4)
- Save vocabulary:
tokenizer.save_vocab("vocab.json")
Tokenize-2 performs:
- Byte-level pre-tokenization
- Pair frequency or entropy scanning
- Merge rule learning
- Vocabulary construction
- Fast runtime lookup + OOV fallback
• Context-based merge training
• Rust backend
• GPU accelerated merges
• Learned OOV approximation
• Pre-built vocab sets for TNSA models
TNSA OpenWeight License
Star the repo if you like Tokenize-2!