Description
PR #65 added full Phi-3/Phi-3.5 support to quant.h (single-header), but the changes were not propagated to the split source files used by libturboquant and quant-server. As a result:
tools/phi3_infer_test.c (uses quant.h) → works perfectly ✅
quant-server (uses libturboquant) → still broken ❌
Evidence
quant.h (single-header):
tq_load_gguf: loaded 32 layers (32 self_attn) ← correct
Output: "Gravity is a fundamental force that attracts two bodies towards each other..."
quant-server (libturboquant):
tq_load_gguf: loaded 32 layers (0 self_attn) ← still broken
Output: garbage tokens
Files that need Phi-3 changes ported
The following changes from quant.h need to be mirrored in the split sources:
| Feature |
quant.h |
Needs porting to |
Fused attn_qkv detection |
✅ |
src/engine/tq_gguf.c |
Fused ffn_up_gate detection |
✅ |
src/engine/tq_gguf.c |
| LongRoPE factor loading |
✅ |
src/engine/tq_gguf.c |
| Fused QKV matmul + split |
✅ |
src/engine/tq_transformer.c |
| Fused gate||up FFN |
✅ |
src/engine/tq_transformer.c |
| NeoX-style RoPE rotation |
✅ |
src/engine/tq_transformer.c |
| Phi-3 BOS token handling |
✅ |
src/engine/tq_generate.c |
Layer dispatch for gguf_w_qkv |
✅ |
src/engine/tq_transformer.c |
Impact
quantcpp serve phi3.5:mini launches the server but inference is garbage
- Users who follow the README to serve Phi-3.5 will get broken output
- The Python
Model class also uses libturboquant via ctypes, so it's also affected
Workaround
Compile a shared library from quant.h directly and use a Python wrapper server:
cc -O2 -shared -fPIC -o libquant_phi3.dylib -x c - -lm -lpthread <<< '#define QUANT_IMPLEMENTATION
#include "quant.h"'
python3 phi35_server.py 8080
This workaround is functional (tested: 8 tok/s, coherent output, streaming works).
Suggested Fix
Sync the Phi-3 changes from quant.h into the split source tree. Consider adding a CI check that validates quant.h and src/engine/*.c produce identical inference output for all supported architectures.
Environment
Reported by ClawTeam — verified via Claw-4 (Optimizer) retest
Description
PR #65 added full Phi-3/Phi-3.5 support to
quant.h(single-header), but the changes were not propagated to the split source files used bylibturboquantandquant-server. As a result:tools/phi3_infer_test.c(usesquant.h) → works perfectly ✅quant-server(useslibturboquant) → still broken ❌Evidence
quant.h (single-header):
Output: "Gravity is a fundamental force that attracts two bodies towards each other..."
quant-server (libturboquant):
Output: garbage tokens
Files that need Phi-3 changes ported
The following changes from
quant.hneed to be mirrored in the split sources:attn_qkvdetectionsrc/engine/tq_gguf.cffn_up_gatedetectionsrc/engine/tq_gguf.csrc/engine/tq_gguf.csrc/engine/tq_transformer.csrc/engine/tq_transformer.csrc/engine/tq_transformer.csrc/engine/tq_generate.cgguf_w_qkvsrc/engine/tq_transformer.cImpact
quantcpp serve phi3.5:minilaunches the server but inference is garbageModelclass also useslibturboquantvia ctypes, so it's also affectedWorkaround
Compile a shared library from
quant.hdirectly and use a Python wrapper server:This workaround is functional (tested: 8 tok/s, coherent output, streaming works).
Suggested Fix
Sync the Phi-3 changes from
quant.hinto the split source tree. Consider adding a CI check that validatesquant.handsrc/engine/*.cproduce identical inference output for all supported architectures.Environment
Reported by ClawTeam — verified via Claw-4 (Optimizer) retest