Skip to content

Concurrent requests produce corrupted output — no thread safety #63

@unamedkr

Description

@unamedkr

Description

When two or more HTTP requests are sent to quant-server simultaneously, both responses contain corrupted/garbled text. The server accepts parallel connections but shares inference state without synchronization.

Steps to Reproduce

./build-metal/quant-server SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080

# Send two requests simultaneously
curl -s http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":30}' &
curl -s http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":"What is Python?"}],"max_tokens":30}' &
wait

Actual Behavior

Both responses contain corrupted, mixed, or garbled output. Neither produces a coherent answer.

Expected Behavior

Either:

  • Option A: Serialize requests (queue the second, process one at a time)
  • Option B: Return 429 Too Many Requests with Retry-After header for the second request
  • Option C: Support true concurrent inference with separate contexts

Impact

  • Severity: P1 — Any multi-client scenario (web app, load balancer) will produce garbage
  • No error or warning is returned — client receives corrupted data silently
  • Chat UIs that allow rapid sequential messages will hit this

Suggested Fix

Short-term: Add a mutex in tq_server.c to serialize request handling, returning 429 if busy.

Long-term: Support request queuing or multiple inference contexts.

Environment

  • quant.cpp: latest main (49c6605)
  • Model: SmolLM2-1.7B-Instruct-Q8_0.gguf
  • OS: macOS 15 (Apple M3)

Reported by ClawTeam Claw-2 (Builder persona)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions