Skip to content

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67

@unamedkr

Description

@unamedkr

Description

PR #65 added full Phi-3/Phi-3.5 support to quant.h (single-header), but the changes were not propagated to the split source files used by libturboquant and quant-server. As a result:

  • tools/phi3_infer_test.c (uses quant.h) → works perfectly
  • quant-server (uses libturboquant) → still broken

Evidence

quant.h (single-header):

tq_load_gguf: loaded 32 layers (32 self_attn)   ← correct

Output: "Gravity is a fundamental force that attracts two bodies towards each other..."

quant-server (libturboquant):

tq_load_gguf: loaded 32 layers (0 self_attn)    ← still broken

Output: garbage tokens

Files that need Phi-3 changes ported

The following changes from quant.h need to be mirrored in the split sources:

Feature quant.h Needs porting to
Fused attn_qkv detection src/engine/tq_gguf.c
Fused ffn_up_gate detection src/engine/tq_gguf.c
LongRoPE factor loading src/engine/tq_gguf.c
Fused QKV matmul + split src/engine/tq_transformer.c
Fused gate||up FFN src/engine/tq_transformer.c
NeoX-style RoPE rotation src/engine/tq_transformer.c
Phi-3 BOS token handling src/engine/tq_generate.c
Layer dispatch for gguf_w_qkv src/engine/tq_transformer.c

Impact

  • quantcpp serve phi3.5:mini launches the server but inference is garbage
  • Users who follow the README to serve Phi-3.5 will get broken output
  • The Python Model class also uses libturboquant via ctypes, so it's also affected

Workaround

Compile a shared library from quant.h directly and use a Python wrapper server:

cc -O2 -shared -fPIC -o libquant_phi3.dylib -x c - -lm -lpthread <<< '#define QUANT_IMPLEMENTATION
#include "quant.h"'
python3 phi35_server.py 8080

This workaround is functional (tested: 8 tok/s, coherent output, streaming works).

Suggested Fix

Sync the Phi-3 changes from quant.h into the split source tree. Consider adding a CI check that validates quant.h and src/engine/*.c produce identical inference output for all supported architectures.

Environment


Reported by ClawTeam — verified via Claw-4 (Optimizer) retest

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions