Skip to content

fix: byte fallback penalty in SentencePiece Viterbi tokenizer#4

Merged
dndungu merged 1 commit intomainfrom
fix/viterbi-byte-priority
Mar 26, 2026
Merged

fix: byte fallback penalty in SentencePiece Viterbi tokenizer#4
dndungu merged 1 commit intomainfrom
fix/viterbi-byte-priority

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Mar 26, 2026

Byte tokens used fixed -1e6 penalty instead of vocab scores. Fixes Mistral producing 43 byte tokens instead of 7 word tokens.

Byte fallback tokens (<0xNN>) were competing with multi-character vocab
tokens in the Viterbi DP using their actual vocabulary scores. When byte
token scores happened to be higher than vocab token scores, the Viterbi
algorithm preferred 43 byte-level tokens over 7 word-level tokens.

Fix: assign byte fallback tokens a fixed score of -1e6 instead of their
vocabulary score, ensuring they are only used as a last resort when no
vocab token covers a position. This matches llama.cpp behavior.
@dndungu dndungu merged commit 45a5ae0 into main Mar 26, 2026
@dndungu dndungu deleted the fix/viterbi-byte-priority branch March 26, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant