fix: byte fallback penalty in SentencePiece Viterbi tokenizer by dndungu · Pull Request #4 · zerfoo/ztoken

dndungu · 2026-03-26T18:14:23Z

Byte tokens used fixed -1e6 penalty instead of vocab scores. Fixes Mistral producing 43 byte tokens instead of 7 word tokens.

Byte fallback tokens (<0xNN>) were competing with multi-character vocab tokens in the Viterbi DP using their actual vocabulary scores. When byte token scores happened to be higher than vocab token scores, the Viterbi algorithm preferred 43 byte-level tokens over 7 word-level tokens. Fix: assign byte fallback tokens a fixed score of -1e6 instead of their vocabulary score, ensuring they are only used as a last resort when no vocab token covers a position. This matches llama.cpp behavior.

dndungu merged commit 45a5ae0 into main Mar 26, 2026

dndungu deleted the fix/viterbi-byte-priority branch March 26, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: byte fallback penalty in SentencePiece Viterbi tokenizer#4

fix: byte fallback penalty in SentencePiece Viterbi tokenizer#4
dndungu merged 1 commit intomainfrom
fix/viterbi-byte-priority

dndungu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant