feat: add compression ratio tracking to FastBPETrainer by AmitMY · Pull Request #21 · sign-language-processing/complex-tokenization

AmitMY · 2026-04-08T11:03:05Z

Summary

Track per-step stats in FastBPETrainer.stats: pair, frequency, total tokens, compression ratio
Compression ratio = fraction of tokens eliminated vs initial count

Stacked on #20.

What improved

Can now analyze how compression evolves during training
Useful for comparing tokenizer variants

Test plan

5 new compression tests pass
ruff check . passes

🤖 Generated with Claude Code

- FastBPETrainer flattens words into byte tuples and counts word frequencies, avoiding repeated graph traversal - Pair counting operates on word-freq dict instead of full corpus - Produces identical merges to graph-based BPE (tested) - Significantly faster on repeated text patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test training on plain Hebrew, nikkud text, mixed text - Verify dagesh/qamats appear in early merges for repeated patterns - Verify bytes preservation and pretokenization of Hebrew text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- from complex_tokenization import BPETokenizer, Tokenizer, etc. - Add __all__ for explicit export control - Add import tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test empty text, single char, all same chars, whitespace-only, multiple empty texts for all 4 tokenizer variants - Test emoji, mixed scripts, newlines for all variants - Parametrized across BPE, BNE, Boundless BPE, Super BPE Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- bench_scaling.py compares graph BPE vs FastBPE across text sizes (5k-270k chars) and merge counts (50-200) - FastBPE consistently 7-12x faster, identical merge output - Add scaling tests: 270k chars in <5s, identical merges across sizes Results (270k chars, 200 merges): Graph 1.8s vs Fast 0.23s (7.9x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Optional on_merge(step, total, token, nodes) callback called after each merge, enabling progress bars and logging - Backward compatible — callback is None by default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Return self if merged subgraphs are identical to originals - Avoids creating new tuples and UnconnectedGraphs on no-op merges Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Track per-step stats: pair, frequency, total tokens, compression ratio - Accessible via trainer.stats after training - Tests verify: monotonic compression, decreasing token count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY · 2026-04-08T17:24:13Z

Closing — compression tracking was built on FastBPETrainer which was closed in #13.

AmitMY mentioned this pull request Apr 8, 2026

test: add reference BPE comparison tests #22

Closed

2 tasks

AmitMY force-pushed the feat/compression-tracking branch 11 times, most recently from 33be31f to ca4682a Compare April 8, 2026 17:21

AmitMY and others added 8 commits April 8, 2026 19:22

feat: export tokenizer classes from package __init__

51d3249

- from complex_tokenization import BPETokenizer, Tokenizer, etc. - Add __all__ for explicit export control - Add import tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: skip UnconnectedGraphs reconstruction when merge changes nothing

94fc71b

- Return self if merged subgraphs are identical to originals - Avoids creating new tuples and UnconnectedGraphs on no-op merges Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY force-pushed the feat/compression-tracking branch from ca4682a to 1ff85eb Compare April 8, 2026 17:22

AmitMY closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add compression ratio tracking to FastBPETrainer#21

feat: add compression ratio tracking to FastBPETrainer#21
AmitMY wants to merge 8 commits intomainfrom
feat/compression-tracking

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitMY commented Apr 8, 2026

Summary

What improved

Test plan

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant