Skip to content

spellsaif/tenya

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Tenya: Compiled Distilled Unigram (CDU) Tokenizer

Tenya is a Rust-first, zero-allocation experimental tokenizer implementing a Compiled Distilled Unigram (CDU) architecture. It distills the segmentation decisions of a SentencePiece Unigram teacher model into a compiled, high-performance prefix matcher, shifting the computational complexity from inference runtime to compile/export time. (Named in brief inspiration of the swift, disciplined anime character Tenya Iida).


๐ŸŒ€ Philosophy & Core Design Principles

Tenya's architecture is guided by three core design principles focused on extreme efficiency and determinism:

  1. Raw Inference Speed: By completely removing the expensive dynamic programming path-searches (Viterbi) found in standard Unigram tokenizers, Tenya accelerates tokenization speeds by over 9.5x, processing up to 24 million tokens per second on a single CPU core.
  2. Strict Deterministic Rules: Tenya operates under a deterministic, strict ruleset. It compiles its vocabulary into a flat, contiguous trie and matches character sequences using deterministic algorithms (greedy longest-match or distilled priority-match), completely eliminating probabilistic path drift at runtime.
  3. Single-Pass Directness: Tenya tokenizes text in a single, non-recursive, forward-pass traversal. It does not backtrack, allocate temporary nodes on the heap, or re-evaluate past decisions, ensuring predictable performance.

๐Ÿง  Core Architecture: Compiled Distilled Unigram (CDU)

Traditional SentencePiece Unigram tokenization relies on probabilistic segmentation. During inference, it constructs a dense lattice representing all possible tokenizations of a string, evaluates their joint log-probabilities, and runs the Viterbi Algorithm (a dynamic programming search) to find the path that maximizes likelihood. While optimal for model alignment, this graph-construction step is highly CPU-bound and requires dynamic heap allocations.

Tenya's Compiled Distilled Unigram (CDU) architecture shifts this paradigm by moving the segmentation search from runtime inference to training/distillation time:

  • Compiled: The vocabulary pieces are compiled into a flat, contiguous memory-efficient prefix trie, allowing $O(1)$ child state transitions without pointer chasing or heap lookup overhead.
  • Distilled: The teacher's segmentation decisions over a corpus are distilled into static priority weights, removing all probability and dynamic programming path calculations from the hot path.
  • Unigram: The underlying vocabulary and base pieces originate from a SentencePiece Unigram model, preserving its vocabulary characteristics.
graph TD
    A[Raw Text Corpus] --> B[Train SentencePiece Unigram Teacher]
    B --> C[Teacher Model & Vocab]
    C --> D[Distill Policy: Tokenize Corpus & Count Frequency]
    D --> E[Export Tenya Vocab JSON: Normal + Byte Fallback + Priority]
    E --> F[Compile Tenya Rust Core: Flat Contiguous Trie]
    F --> G[Deterministic Fast Inference: LongestMatch / PriorityMatch]
Loading

1. The Compiled Trie (trie.rs)

To avoid heap allocation indirection and pointer chasing during tokenization, Tenya compiles its vocabulary into a flat, contiguous vector of TrieNode structs:

  • Each node stores its child transitions as a sorted list of (byte, next_node_index) tuples.
  • Transitions are resolved via a cache-friendly binary search on the sorted transitions.
  • Looking up prefixes is a zero-allocation operation, traversing a contiguous block of memory with near-zero cache misses.

2. The Distillation Policy (distill_policy.py & export_tenya_vocab.py)

Rather than keeping raw Unigram log-probabilities, Tenya applies a distillation formula. During distillation, the training corpus is tokenized by the Unigram teacher. We count how frequently each vocabulary piece is actually selected in context: $$\text{Priority} = \text{Teacher Log-Probability} + \alpha \cdot \ln(\text{Corpus Frequency} + 1)$$ This formula boosts the priority of tokens that are frequently chosen by the teacher. When Tenya performs its deterministic search, it uses these priorities to make split decisions that closely mimic the teacher.

3. Reversibility & Byte Fallback (fallback.rs & detokenize.rs)

To ensure that no characters are ever lost (e.g. unknown symbols, emojis, or non-Latin scripts), Tenya reserves the final 256 IDs in its vocabulary for explicit byte fallbacks (mapping directly to bytes 0x00 through 0xFF).

  • If the prefix matcher fails to find any matching vocabulary piece at a given offset, it falls back to the byte fallback token representing the current byte, advancing by 1.
  • During decoding, these byte tokens are written back to a raw buffer and parsed back into a UTF-8 string, guaranteeing a 100.00% reversibility rate.

๐Ÿ•น๏ธ Match Strategies

Tenya implements two deterministic inference matching strategies:

1. Greedy Longest-Match (MatchStrategy::LongestMatch)

  • Logic: At any offset, search the trie for all prefixes. Pick the longest matching token. If there is a tie in length, select the token with the highest priority score.
  • Characteristics: Fastest lookup, minimizes total token count, but has lower agreement with the teacher's stylistic segmentations.

2. Distilled-Priority Match (MatchStrategy::PriorityMatch)

  • Logic: Search the trie for all prefixes. Pick the token with the highest distilled priority score. If there is a tie in priority, select the longest matching token.
  • Characteristics: Aligns significantly closer to the teacher's segmentations (F1 agreement rises by ~5% and Recall jumps by ~16%), but produces slightly more, smaller tokens.

๐Ÿ“ก Deployments & Use Cases

Tenya is highly useful in the following scenarios:

  • Training LLMs from Scratch: When training a new transformer model from scratch, you are not bound to an existing tokenizer. You can train the SentencePiece model, distill it to Tenya, and use the Tenya Rust library to tokenize your dataset. The speedup saves massive amounts of data preprocessing CPU hours.
  • LLM Pretraining Preprocessing Pipelines: Tokenizing terabytes of raw text dataset is highly CPU-bound. Tenya's 24M tokens/sec throughput makes it an ideal data loading and preprocessing utility.
  • IoT, Edge, and Resource-Constrained Inference: Running standard tokenizers on low-power devices is constrained by memory allocation overhead. Tenya has a tiny memory footprint, does zero heap allocations in the hot path, and compiles into a single fast binary.
  • Real-time Log & Stream Processing: High-throughput indexing systems, search engines, and real-time log processors can use Tenya to quickly tokenize stream chunks.

โ˜ฏ๏ธ Trade-Offs: Advantages & Caveats

Advantages

  • Superhuman Speed: Latency is reduced by over 9.5x compared to SentencePiece Unigram (Python).
  • Zero Allocation: The hot-path tokenization loop (encode_into / encode_bytes_into) performs zero heap allocations, eliminating garbage collection or memory manager overhead.
  • 100% Round-trip Reversibility: Guarantees that any arbitrary sequence of characters can be tokenized and restored exactly to the original bytes.
  • Compact footprint: Uses a flat vector-based trie rather than dynamic nested maps.

Caveats & Limitations

  • Not a Drop-in Replacement for Pretrained Models: If you try to use Tenya with an already-trained model (like LLaMA or Mistral) without fine-tuning, the model will output gibberish. This is because Tenya's boundaries only agree with Unigram's boundaries by about 40-44%, resulting in different token IDs.
  • Sequence Fragmentation: Because Tenya uses a deterministic match, it tends to over-split text into smaller pieces compared to the teacher. This increases the total token count by 1.5x to 1.9x, slightly increasing the model context window length.
  • Spaces Mismatch: SentencePiece Unigram prepends a space symbol (U+2581) to the beginning of sentences. Since Tenya matches raw text bytes directly, it may fallback to characters for the very first word of a sentence (where no space exists).

๐Ÿ Empirical Benchmarks & Evaluation

The following evaluations were performed on a mixed-prose and code validation corpus (data/sample.txt):

Metric SentencePiece Unigram (Teacher) Tenya (LongestMatch) Tenya (PriorityMatch)
p50 Latency (ms) 0.415 ms 0.047 ms (8.8x faster) 0.124 ms (3.3x faster)
p95 Latency (ms) 0.890 ms 0.056 ms 0.149 ms
Throughput (MB/s) 3.67 MB/s 32.21 MB/s 12.25 MB/s
Throughput (Tokens/sec) 1.67 M tokens/s 23.30 M tokens/s 11.28 M tokens/s
Teacher F1 Agreement 100.00% (Baseline) 38.57% 43.80% (+5.23%)
Boundary Precision 100.00% 31.93% 33.06%
Boundary Recall 100.00% 48.69% 64.87% (+16.18%)
Token Count Ratio (vs Teacher) 1.0000 1.5469 1.9798
Reversibility Success Rate 100.00% 100.00% 100.00%

๐Ÿ’ป Quickstart & Executables

1. Setup the Environment

python3 -m venv .venv
source .venv/bin/activate
pip install sentencepiece

2. Run the Full Distillation Pipeline

# Step A: Train the SentencePiece Unigram teacher
python3 python/train_teacher.py --input data/sample.txt --vocab-size 500 --model-prefix teacher

# Step B: Distill the teacher's segmentations on the corpus to collect frequencies
python3 python/distill_policy.py --input data/sample.txt --model teacher.model --output policy.json

# Step C: Export the final Tenya JSON vocabulary
python3 python/export_tenya_vocab.py --model teacher.model --policy policy.json --output vocab.json

3. Run the Rust CLI

Compile the project:

cargo build --release

Encode text:

./target/release/tenya-cli --vocab vocab.json encode --text "hello world"

Decode token IDs:

./target/release/tenya-cli --vocab vocab.json decode --ids "243 231 78 276 52 77"

Benchmark performance natively:

./target/release/tenya-cli --vocab vocab.json bench --file data/sample.txt --iterations 1000

4. Evaluate Alignment and Speed

Compare Tenya and SentencePiece side-by-side:

python3 python/evaluate.py --corpus data/sample.txt --model teacher.model --vocab vocab.json --strategy priority

5. Run Microbenchmarks

Evaluate execution time using Criterion:

cargo bench

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors