Description
When two or more HTTP requests are sent to quant-server simultaneously, both responses contain corrupted/garbled text. The server accepts parallel connections but shares inference state without synchronization.
Steps to Reproduce
./build-metal/quant-server SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080
# Send two requests simultaneously
curl -s http://localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":30}' &
curl -s http://localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"What is Python?"}],"max_tokens":30}' &
wait
Actual Behavior
Both responses contain corrupted, mixed, or garbled output. Neither produces a coherent answer.
Expected Behavior
Either:
- Option A: Serialize requests (queue the second, process one at a time)
- Option B: Return
429 Too Many Requests with Retry-After header for the second request
- Option C: Support true concurrent inference with separate contexts
Impact
- Severity: P1 — Any multi-client scenario (web app, load balancer) will produce garbage
- No error or warning is returned — client receives corrupted data silently
- Chat UIs that allow rapid sequential messages will hit this
Suggested Fix
Short-term: Add a mutex in tq_server.c to serialize request handling, returning 429 if busy.
Long-term: Support request queuing or multiple inference contexts.
Environment
- quant.cpp: latest main (49c6605)
- Model: SmolLM2-1.7B-Instruct-Q8_0.gguf
- OS: macOS 15 (Apple M3)
Reported by ClawTeam Claw-2 (Builder persona)
Description
When two or more HTTP requests are sent to
quant-serversimultaneously, both responses contain corrupted/garbled text. The server accepts parallel connections but shares inference state without synchronization.Steps to Reproduce
Actual Behavior
Both responses contain corrupted, mixed, or garbled output. Neither produces a coherent answer.
Expected Behavior
Either:
429 Too Many RequestswithRetry-Afterheader for the second requestImpact
Suggested Fix
Short-term: Add a mutex in
tq_server.cto serialize request handling, returning 429 if busy.Long-term: Support request queuing or multiple inference contexts.
Environment
Reported by ClawTeam Claw-2 (Builder persona)