Skip to content

Chat template tokens leaked in API responses #57

@unamedkr

Description

@unamedkr

Description

The /v1/chat/completions API responses contain raw chat template tokens that should be stripped before returning to the client.

Examples

Llama-3.2-1B:

{"content": "Hi<line>assistant</line>\n</s><s><p>\n\n## Step 1:"}

SmolLM2-1.7B:

{"content": "The capital of South Korea is Seoul.\n<|im_ennd|>"}

Llama-3.2-1B (another example):

{"content": "<|im_ststart|>>mathematical\n<im_imst_start>assistantant</i>\nThe answer is 555."}

Expected Behavior

Clean text output with all template/control tokens stripped:

{"content": "The capital of South Korea is Seoul."}

Tokens to Strip

  • <|im_start|>, <|im_end|> and malformed variants (<|im_ststart|>, <|im_ennd|>)
  • <line>assistant</line>, <line>user</line>
  • </s>, <s>
  • <|begin_of_text|>, <|end_of_text|>
  • Any <|...|> special tokens

Suggested Fix

Add a post-processing step in tq_server.c (or quant.h) that strips known stop/template tokens from the generated text before returning. This could use the model's EOS/BOS token IDs from GGUF metadata, plus a regex/pattern match for common chat template markers.

Impact

  • Severity: P1 — Every API response requires client-side cleanup
  • Breaks downstream parsing (JSON extraction, structured output)
  • Poor user experience in chat UIs

Environment

  • quant.cpp: latest main
  • Models tested: Llama-3.2-1B (Q4_K_M), SmolLM2-1.7B (Q8)
  • OS: macOS 15

Reported by ClawTeam Claw-2 (Builder persona)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions