Skip to content

Fix GGUF BPE merge parsing — Qwen3/Llama3 garbage output#21

Merged
unamedkr merged 1 commit intomainfrom
fix/gguf-bpe-merge-parsing
Apr 10, 2026
Merged

Fix GGUF BPE merge parsing — Qwen3/Llama3 garbage output#21
unamedkr merged 1 commit intomainfrom
fix/gguf-bpe-merge-parsing

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Fixes the Qwen3 0.6B WASM demo producing complete garbage output (random Unicode, mixed languages, nonsense bytes).

Root cause

tq_load_tokenizer_from_gguf() had a critical bug: it read the tokenizer.ggml.merges GGUF key, allocated the merge_pairs buffer, set n_merges — but never parsed the actual merge strings. The buffer was memset(0) and left empty.

BPE tokenizers (Qwen3 with 248K vocab, Llama 3, GPT-2 style) depend on merge pairs to combine byte tokens into word tokens. Without parsed merges, every byte was emitted as a separate token → garbage Unicode output.

SmolLM2 worked because it uses SentencePiece (character-level encoding, no BPE merges needed).

Fix

Iterate over the GGUF tq_gguf_string_t array, split each "tok_a tok_b" merge rule on space, look up token IDs via str_lookup(), and store (id_a, id_b, id_merged) triples with priority scores. This is identical to the existing JSON tokenizer path (tq_tokenizer.c:596-672) which already worked correctly.

Applied to both:

  • src/engine/tq_tokenizer.c (library build)
  • quant.h (single-header / WASM build)

Test plan

  • Native build passes (cmake --build build)
  • WASM rebuild + Qwen3 0.6B produces coherent output
  • Llama 3.2 1B produces coherent output
  • SmolLM2 still works (regression check — SentencePiece path unchanged)

🤖 Generated with Claude Code

tq_load_tokenizer_from_gguf() allocated the merge_pairs buffer and
set n_merges, but never actually parsed the GGUF merge strings into
(id_a, id_b, id_merged) triples. The buffer was zeroed and left
unpopulated.

BPE tokenizers (Qwen3 248K vocab, Llama 3, GPT-2 style) depend on
merge pairs to combine byte tokens into word tokens. Without parsed
merges, every byte was emitted as a separate token, producing
garbage Unicode output.

SentencePiece tokenizers (SmolLM2, Gemma) worked because they use
character-level encoding and don't need BPE merges.

The fix iterates over the GGUF string array, splits each "tok_a tok_b"
merge rule, looks up token IDs, and stores the triple — identical to
the existing JSON tokenizer path (tq_tokenizer.c:596-672).

Applied to both tq_tokenizer.c (library) and quant.h (single-header /
WASM).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 2396027 into main Apr 10, 2026
3 checks passed
@unamedkr unamedkr deleted the fix/gguf-bpe-merge-parsing branch April 10, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant