Skip to content

feat: add compression ratio tracking to FastBPETrainer#21

Closed
AmitMY wants to merge 8 commits intomainfrom
feat/compression-tracking
Closed

feat: add compression ratio tracking to FastBPETrainer#21
AmitMY wants to merge 8 commits intomainfrom
feat/compression-tracking

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Track per-step stats in FastBPETrainer.stats: pair, frequency, total tokens, compression ratio
  • Compression ratio = fraction of tokens eliminated vs initial count

Stacked on #20.

What improved

  • Can now analyze how compression evolves during training
  • Useful for comparing tokenizer variants

Test plan

  • 5 new compression tests pass
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the feat/compression-tracking branch 11 times, most recently from 33be31f to ca4682a Compare April 8, 2026 17:21
AmitMY and others added 8 commits April 8, 2026 19:22
- FastBPETrainer flattens words into byte tuples and counts word
  frequencies, avoiding repeated graph traversal
- Pair counting operates on word-freq dict instead of full corpus
- Produces identical merges to graph-based BPE (tested)
- Significantly faster on repeated text patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test training on plain Hebrew, nikkud text, mixed text
- Verify dagesh/qamats appear in early merges for repeated patterns
- Verify bytes preservation and pretokenization of Hebrew text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- from complex_tokenization import BPETokenizer, Tokenizer, etc.
- Add __all__ for explicit export control
- Add import tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test empty text, single char, all same chars, whitespace-only,
  multiple empty texts for all 4 tokenizer variants
- Test emoji, mixed scripts, newlines for all variants
- Parametrized across BPE, BNE, Boundless BPE, Super BPE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- bench_scaling.py compares graph BPE vs FastBPE across text sizes
  (5k-270k chars) and merge counts (50-200)
- FastBPE consistently 7-12x faster, identical merge output
- Add scaling tests: 270k chars in <5s, identical merges across sizes

Results (270k chars, 200 merges): Graph 1.8s vs Fast 0.23s (7.9x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Optional on_merge(step, total, token, nodes) callback called after
  each merge, enabling progress bars and logging
- Backward compatible — callback is None by default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Return self if merged subgraphs are identical to originals
- Avoids creating new tuples and UnconnectedGraphs on no-op merges

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Track per-step stats: pair, frequency, total tokens, compression ratio
- Accessible via trainer.stats after training
- Tests verify: monotonic compression, decreasing token count

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the feat/compression-tracking branch from ca4682a to 1ff85eb Compare April 8, 2026 17:22
@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

Closing — compression tracking was built on FastBPETrainer which was closed in #13.

@AmitMY AmitMY closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant