Description
The /v1/chat/completions API responses contain raw chat template tokens that should be stripped before returning to the client.
Examples
Llama-3.2-1B:
{"content": "Hi<line>assistant</line>\n</s><s><p>\n\n## Step 1:"}
SmolLM2-1.7B:
{"content": "The capital of South Korea is Seoul.\n<|im_ennd|>"}
Llama-3.2-1B (another example):
{"content": "<|im_ststart|>>mathematical\n<im_imst_start>assistantant</i>\nThe answer is 555."}
Expected Behavior
Clean text output with all template/control tokens stripped:
{"content": "The capital of South Korea is Seoul."}
Tokens to Strip
<|im_start|>, <|im_end|> and malformed variants (<|im_ststart|>, <|im_ennd|>)
<line>assistant</line>, <line>user</line>
</s>, <s>
<|begin_of_text|>, <|end_of_text|>
- Any
<|...|> special tokens
Suggested Fix
Add a post-processing step in tq_server.c (or quant.h) that strips known stop/template tokens from the generated text before returning. This could use the model's EOS/BOS token IDs from GGUF metadata, plus a regex/pattern match for common chat template markers.
Impact
- Severity: P1 — Every API response requires client-side cleanup
- Breaks downstream parsing (JSON extraction, structured output)
- Poor user experience in chat UIs
Environment
- quant.cpp: latest main
- Models tested: Llama-3.2-1B (Q4_K_M), SmolLM2-1.7B (Q8)
- OS: macOS 15
Reported by ClawTeam Claw-2 (Builder persona)
Description
The
/v1/chat/completionsAPI responses contain raw chat template tokens that should be stripped before returning to the client.Examples
Llama-3.2-1B:
{"content": "Hi<line>assistant</line>\n</s><s><p>\n\n## Step 1:"}SmolLM2-1.7B:
{"content": "The capital of South Korea is Seoul.\n<|im_ennd|>"}Llama-3.2-1B (another example):
{"content": "<|im_ststart|>>mathematical\n<im_imst_start>assistantant</i>\nThe answer is 555."}Expected Behavior
Clean text output with all template/control tokens stripped:
{"content": "The capital of South Korea is Seoul."}Tokens to Strip
<|im_start|>,<|im_end|>and malformed variants (<|im_ststart|>,<|im_ennd|>)<line>assistant</line>,<line>user</line></s>,<s><|begin_of_text|>,<|end_of_text|><|...|>special tokensSuggested Fix
Add a post-processing step in
tq_server.c(orquant.h) that strips known stop/template tokens from the generated text before returning. This could use the model's EOS/BOS token IDs from GGUF metadata, plus a regex/pattern match for common chat template markers.Impact
Environment
Reported by ClawTeam Claw-2 (Builder persona)