Skip to content

fix: address PR review feedback (#4, #6)#23

Closed
AmitMY wants to merge 10 commits intomainfrom
fix/pr-review-feedback
Closed

fix: address PR review feedback (#4, #6)#23
AmitMY wants to merge 10 commits intomainfrom
fix/pr-review-feedback

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

Addresses review feedback from PR #4 and #6:

Stacked on #22.

Test plan

  • All tests pass
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the fix/pr-review-feedback branch 10 times, most recently from ac34ba5 to e129df2 Compare April 8, 2026 16:06
AmitMY and others added 9 commits April 8, 2026 18:19
- Tokenizer(units, merge_size, connected) — configurable base class
- BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer
- Units can be string ("utf8_clusters", "utf8", "characters") or callable
- 10 tests covering all variants, custom units, error handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- FastBPETrainer flattens words into byte tuples and counts word
  frequencies, avoiding repeated graph traversal
- Pair counting operates on word-freq dict instead of full corpus
- Produces identical merges to graph-based BPE (tested)
- Significantly faster on repeated text patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test training on plain Hebrew, nikkud text, mixed text
- Verify dagesh/qamats appear in early merges for repeated patterns
- Verify bytes preservation and pretokenization of Hebrew text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- from complex_tokenization import BPETokenizer, Tokenizer, etc.
- Add __all__ for explicit export control
- Add import tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test empty text, single char, all same chars, whitespace-only,
  multiple empty texts for all 4 tokenizer variants
- Test emoji, mixed scripts, newlines for all variants
- Parametrized across BPE, BNE, Boundless BPE, Super BPE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- bench_scaling.py compares graph BPE vs FastBPE across text sizes
  (5k-270k chars) and merge counts (50-200)
- FastBPE consistently 7-12x faster, identical merge output
- Add scaling tests: 270k chars in <5s, identical merges across sizes

Results (270k chars, 200 merges): Graph 1.8s vs Fast 0.23s (7.9x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Optional on_merge(step, total, token, nodes) callback called after
  each merge, enabling progress bars and logging
- Backward compatible — callback is None by default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Return self if merged subgraphs are identical to originals
- Avoids creating new tuples and UnconnectedGraphs on no-op merges

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Track per-step stats: pair, frequency, total tokens, compression ratio
- Accessible via trainer.stats after training
- Tests verify: monotonic compression, decreasing token count

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the fix/pr-review-feedback branch 2 times, most recently from f0d6272 to 48333af Compare April 8, 2026 16:23
- Standalone reference BPE and Boundless BPE implementations as ground
  truth (no graph abstraction, just byte sequences + counters)
- Verify graph BPE, FastBPE, and Boundless BPE all produce identical
  merges to reference on same input
- 5 tests across small, medium, and multi-implementation comparisons

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the fix/pr-review-feedback branch from 48333af to da53069 Compare April 8, 2026 17:17
@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

Absorbed into #12 — all review feedback changes are now in the clean API PR directly.

@AmitMY AmitMY closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant