Skip to content

feat: add Hebrew language support with diacritics decomposition#10

Merged
AmitMY merged 2 commits intomainfrom
feat/hebrew
Apr 8, 2026
Merged

feat: add Hebrew language support with diacritics decomposition#10
AmitMY merged 2 commits intomainfrom
feat/hebrew

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Add FullyConnectedGraph vertex type where every pair of nodes is a valid merge candidate
  • Hebrew decomposition: each grapheme cluster splits into base letter + fully connected graph of diacritics (nikkud, dagesh, cantillation marks)
  • Single diacritics and bare letters collapse to plain Nodes
  • Bytes roundtrip verified: bytes(hebrew_grapheme_clusters(text)) == text.encode()

Stacked on #9.

What improved

  • Hebrew text like בְּרֵאשִׁ֖ית can now be tokenized with linguistically meaningful structure
  • Diacritics can merge in any order (fully connected), reflecting their unordered nature

Example

בְּ (bet + sheva + dagesh) → NodesSequence([
    utf8("ב"),
    FullyConnectedGraph([utf8("ְ"), utf8("ּ")])
])

Test plan

  • 11 new Hebrew tests (decomposition, bytes roundtrip, merge candidates)
  • All tests pass (HF API timeout is transient)
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the feat/hebrew branch 6 times, most recently from 2930eb5 to 6e96c63 Compare April 8, 2026 15:36
AmitMY and others added 2 commits April 8, 2026 17:39
- Special-case MAX_MERGE_SIZE=2 in get_merges to avoid range() and
  tuple slicing, yielding pre-built pairs directly
- Special-case size-2 merge to unpack and compare elements directly
  instead of creating tuple slices on every position
- Use token directly instead of creating new Node in merge
- Cache settings and len() as local variables

50 samples / 100 merges: 3.9s → 0.92s (4.3x faster)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add FullyConnectedGraph vertex type for unordered merge candidates
- Hebrew decomposition: each grapheme cluster splits into base letter
  + fully connected graph of diacritics (nikkud, dagesh, cantillation)
- Single diacritics and bare letters collapse to plain Nodes
- All pairs of diacritics are valid merge candidates in any order

Example: בְּ (bet+sheva+dagesh) → NodesSequence([
  utf8("ב"), FullyConnectedGraph([utf8("ְ"), utf8("ּ")])
])

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY merged commit dcd0428 into main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant