feat: add high-level Tokenizer API with all 4 variants by AmitMY · Pull Request #12 · sign-language-processing/complex-tokenization

AmitMY · 2026-04-08T10:40:16Z

Summary

Add Tokenizer(units, merge_size, connected) — clean configurable base class
Add BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer convenience classes
Units can be string ("utf8_clusters", "utf8", "characters") or callable
All 4 tokenizer variants accessible through a consistent .train() / .get_merges() API

Stacked on #11.

What improved

Clean, consistent API: tok = BPETokenizer(); tok.train(texts, num_merges=100)
All configuration in one place (no more manual GraphSettings mutation)

Test plan

10 new API tests
ruff check . passes

🤖 Generated with Claude Code

- Tokenizer(units, merge_size, connected) — configurable base class - BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer - Units can be string ("utf8_clusters", "utf8", "characters") or callable - 10 tests covering all variants, custom units, error handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY mentioned this pull request Apr 8, 2026

perf: add FastBPETrainer using word-frequency counting #13

Closed

2 tasks

AmitMY force-pushed the feat/clean-api branch 10 times, most recently from 3395935 to ecc3851 Compare April 8, 2026 16:19

AmitMY mentioned this pull request Apr 8, 2026

feat: register Hebrew and Chinese units in Tokenizer API #17

Closed

2 tasks

AmitMY force-pushed the feat/clean-api branch from ecc3851 to ac65d6f Compare April 8, 2026 17:20

AmitMY mentioned this pull request Apr 8, 2026

fix: address PR review feedback (#4, #6) #23

Closed

2 tasks

AmitMY merged commit f9bee1c into main Apr 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add high-level Tokenizer API with all 4 variants#12

feat: add high-level Tokenizer API with all 4 variants#12
AmitMY merged 1 commit intomainfrom
feat/clean-api

AmitMY commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitMY commented Apr 8, 2026

Summary

What improved

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant