Fix GGUF BPE merge parsing — Qwen3/Llama3 garbage output by unamedkr · Pull Request #21 · quantumaikr/quant.cpp

unamedkr · 2026-04-10T05:21:41Z

Summary

Fixes the Qwen3 0.6B WASM demo producing complete garbage output (random Unicode, mixed languages, nonsense bytes).

Root cause

tq_load_tokenizer_from_gguf() had a critical bug: it read the tokenizer.ggml.merges GGUF key, allocated the merge_pairs buffer, set n_merges — but never parsed the actual merge strings. The buffer was memset(0) and left empty.

BPE tokenizers (Qwen3 with 248K vocab, Llama 3, GPT-2 style) depend on merge pairs to combine byte tokens into word tokens. Without parsed merges, every byte was emitted as a separate token → garbage Unicode output.

SmolLM2 worked because it uses SentencePiece (character-level encoding, no BPE merges needed).

Fix

Iterate over the GGUF tq_gguf_string_t array, split each "tok_a tok_b" merge rule on space, look up token IDs via str_lookup(), and store (id_a, id_b, id_merged) triples with priority scores. This is identical to the existing JSON tokenizer path (tq_tokenizer.c:596-672) which already worked correctly.

Applied to both:

src/engine/tq_tokenizer.c (library build)
quant.h (single-header / WASM build)

Test plan

Native build passes (cmake --build build)
WASM rebuild + Qwen3 0.6B produces coherent output
Llama 3.2 1B produces coherent output
SmolLM2 still works (regression check — SentencePiece path unchanged)

🤖 Generated with Claude Code

tq_load_tokenizer_from_gguf() allocated the merge_pairs buffer and set n_merges, but never actually parsed the GGUF merge strings into (id_a, id_b, id_merged) triples. The buffer was zeroed and left unpopulated. BPE tokenizers (Qwen3 248K vocab, Llama 3, GPT-2 style) depend on merge pairs to combine byte tokens into word tokens. Without parsed merges, every byte was emitted as a separate token, producing garbage Unicode output. SentencePiece tokenizers (SmolLM2, Gemma) worked because they use character-level encoding and don't need BPE merges. The fix iterates over the GGUF string array, splits each "tok_a tok_b" merge rule, looks up token IDs, and stores the triple — identical to the existing JSON tokenizer path (tq_tokenizer.c:596-672). Applied to both tq_tokenizer.c (library) and quant.h (single-header / WASM). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 2396027 into main Apr 10, 2026
3 checks passed

unamedkr deleted the fix/gguf-bpe-merge-parsing branch April 10, 2026 05:40

unamedkr mentioned this pull request Apr 10, 2026

perf: sort vocab before merge parsing + rebuild WASM with ASYNCIFY #22

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GGUF BPE merge parsing — Qwen3/Llama3 garbage output#21

Fix GGUF BPE merge parsing — Qwen3/Llama3 garbage output#21
unamedkr merged 1 commit intomainfrom
fix/gguf-bpe-merge-parsing

unamedkr commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 10, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant