GitHub - tnsaai/Tokenize-2: Official Repository of Tokenize2 Tokenizers by TNSA AI

████████╗ ██████╗ ██╗  ██╗███████╗███╗   ██╗██╗███████╗███████╗    ██████╗ 
╚══██╔══╝██╔═══██╗██║ ██╔╝██╔════╝████╗  ██║██║╚══███╔╝██╔════╝    ╚════██╗
   ██║   ██║   ██║█████╔╝ █████╗  ██╔██╗ ██║██║  ███╔╝ █████╗       █████╔╝
   ██║   ██║   ██║██╔═██╗ ██╔══╝  ██║╚██╗██║██║ ███╔╝  ██╔══╝      ██╔═══╝ 
   ██║   ╚██████╔╝██║  ██╗███████╗██║ ╚████║██║███████╗███████╗    ███████╗
   ╚═╝    ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═╝  ╚═══╝╚═╝╚══════╝╚══════╝    ╚══════╝

TOKENIZE-2 — ADVANCED BYTE-LEVEL TOKENIZER

Tokenize-2 is a fast, extensible tokenizer designed for LLM and AGI research.
It includes byte-level encoding, multiple merge strategies (BPE, frequency, entropy, context),
parallel batch tokenization, special token handling, and customizable out-of-vocabulary logic.

FEATURES

• Byte-Level Tokenization (UTF-8 / UTF-16) • Stable byte-to-unicode mapping • Multiple merge strategies: - bpe: classic byte-pair encoding - frequency: frequency-based merges - entropy: entropy-minimizing merges - context: designed for future expansion • Special token support (, , custom) • OOV handling: - split: break into bytes - approximation: fallback token • Parallel batch tokenization using multiprocessing

INSTALLATION

Clone the repository:

git clone https://github.com/yourusername/tokenize2.git cd tokenize2

USAGE

Initialize:

from tokenizer import Tokenize2

tokenizer = Tokenize2( vocab_size=5000, merge_strategy="bpe", encoding="utf-8", special_tokens=["", "", "", ""], oov_strategy="split" )

Train:

corpus = [ "This is an example.", "Tokenize-2 is a powerful tokenizer framework." ]

tokenizer.train_tokenizer(corpus)

Tokenize:

tokens = tokenizer.tokenize("Tokenizers are cool!") print(tokens)

Batch tokenization:

texts = ["Hello world!", "This is Tokenize-2."] output = tokenizer.tokenize_batch(texts, num_processes=4)

Save vocabulary:

tokenizer.save_vocab("vocab.json")

ARCHITECTURE

Tokenize-2 performs:

Byte-level pre-tokenization
Pair frequency or entropy scanning
Merge rule learning
Vocabulary construction
Fast runtime lookup + OOV fallback

ROADMAP

• Context-based merge training
• Rust backend
• GPU accelerated merges
• Learned OOV approximation
• Pre-built vocab sets for TNSA models

LICENSE

TNSA OpenWeight License

SUPPORT

Star the repo if you like Tokenize-2!

TOKENIZE-2 — ADVANCED BYTE-LEVEL TOKENIZER

Tokenize-2 is a fast, extensible tokenizer designed for LLM and AGI research.
It includes byte-level encoding, multiple merge strategies (BPE, frequency, entropy, context),
parallel batch tokenization, special token handling, and customizable out-of-vocabulary logic.

FEATURES

• Byte-Level Tokenization (UTF-8 / UTF-16) • Stable byte-to-unicode mapping • Multiple merge strategies: - bpe: classic byte-pair encoding - frequency: frequency-based merges - entropy: entropy-minimizing merges - context: designed for future expansion • Special token support (, , custom) • OOV handling: - split: break into bytes - approximation: fallback token • Parallel batch tokenization using multiprocessing

INSTALLATION

Clone the repository:

git clone https://github.com/yourusername/tokenize2.git cd tokenize2

USAGE

Initialize:

from tokenizer import Tokenize2

tokenizer = Tokenize2( vocab_size=5000, merge_strategy="bpe", encoding="utf-8", special_tokens=["", "", "", ""], oov_strategy="split" )

Train:

corpus = [ "This is an example.", "Tokenize-2 is a powerful tokenizer framework." ]

tokenizer.train_tokenizer(corpus)

Tokenize:

tokens = tokenizer.tokenize("Tokenizers are cool!") print(tokens)

Batch tokenization:

texts = ["Hello world!", "This is Tokenize-2."] output = tokenizer.tokenize_batch(texts, num_processes=4)

Save vocabulary:

tokenizer.save_vocab("vocab.json")

ARCHITECTURE

Tokenize-2 performs:

Byte-level pre-tokenization
Pair frequency or entropy scanning
Merge rule learning
Vocabulary construction
Fast runtime lookup + OOV fallback

ROADMAP

• Context-based merge training
• Rust backend
• GPU accelerated merges
• Learned OOV approximation
• Pre-built vocab sets for TNSA models

LICENSE

TNSA OpenWeight License

SUPPORT

Star the repo if you like Tokenize-2!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
tokenizer.py		tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TOKENIZE-2 — ADVANCED BYTE-LEVEL TOKENIZER

FEATURES

INSTALLATION

USAGE

ARCHITECTURE

ROADMAP

LICENSE

SUPPORT

TOKENIZE-2 — ADVANCED BYTE-LEVEL TOKENIZER

FEATURES

INSTALLATION

USAGE

ARCHITECTURE

ROADMAP

LICENSE

SUPPORT

About

Uh oh!

Releases

Packages

Languages

License

tnsaai/Tokenize-2

Folders and files

Latest commit

History

Repository files navigation

TOKENIZE-2 — ADVANCED BYTE-LEVEL TOKENIZER

FEATURES

INSTALLATION

USAGE

ARCHITECTURE

ROADMAP

LICENSE

SUPPORT

TOKENIZE-2 — ADVANCED BYTE-LEVEL TOKENIZER

FEATURES

INSTALLATION

USAGE

ARCHITECTURE

ROADMAP

LICENSE

SUPPORT

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages