Skip to content

fix: implement Viterbi SentencePiece encoding (replaces greedy)#3

Merged
dndungu merged 1 commit intomainfrom
fix/viterbi-sentencepiece
Mar 26, 2026
Merged

fix: implement Viterbi SentencePiece encoding (replaces greedy)#3
dndungu merged 1 commit intomainfrom
fix/viterbi-sentencepiece

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Mar 26, 2026

Viterbi DP for globally optimal tokenization. Byte fallback via <0xNN>. Fixes Mistral garbage output. 6 new tests.

The greedy longest-match approach in sentencePieceEncode produced
suboptimal tokenization for SentencePiece unigram models (e.g.,
Mistral 7B). Replace it with Viterbi dynamic programming that finds
the segmentation maximizing the sum of log-probability scores.

Also adds:
- Byte fallback encoding/decoding via <0xNN> tokens for chars not in vocab
- decodeSentencePieceBytes for proper round-trip of byte fallback tokens
- Tests: Viterbi vs greedy, byte fallback, sentence round-trip, edge cases
@dndungu dndungu merged commit 8f43e44 into main Mar 26, 2026
1 check passed
@dndungu dndungu deleted the fix/viterbi-sentencepiece branch March 26, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant