Skip to content

feat: add high-level Tokenizer API with all 4 variants#12

Merged
AmitMY merged 1 commit intomainfrom
feat/clean-api
Apr 8, 2026
Merged

feat: add high-level Tokenizer API with all 4 variants#12
AmitMY merged 1 commit intomainfrom
feat/clean-api

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Add Tokenizer(units, merge_size, connected) — clean configurable base class
  • Add BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer convenience classes
  • Units can be string ("utf8_clusters", "utf8", "characters") or callable
  • All 4 tokenizer variants accessible through a consistent .train() / .get_merges() API

Stacked on #11.

What improved

  • Clean, consistent API: tok = BPETokenizer(); tok.train(texts, num_merges=100)
  • All configuration in one place (no more manual GraphSettings mutation)

Test plan

  • 10 new API tests
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the feat/clean-api branch 10 times, most recently from 3395935 to ecc3851 Compare April 8, 2026 16:19
- Tokenizer(units, merge_size, connected) — configurable base class
- BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer
- Units can be string ("utf8_clusters", "utf8", "characters") or callable
- 10 tests covering all variants, custom units, error handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY merged commit f9bee1c into main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant