Phi-3 support not propagated from quant.h to libturboquant (quant-server broken)

## Description

PR #65 added full Phi-3/Phi-3.5 support to `quant.h` (single-header), but the changes were **not propagated to the split source files** used by `libturboquant` and `quant-server`. As a result:

- `tools/phi3_infer_test.c` (uses `quant.h`) → **works perfectly** ✅
- `quant-server` (uses `libturboquant`) → **still broken** ❌

## Evidence

**quant.h (single-header):**
```
tq_load_gguf: loaded 32 layers (32 self_attn)   ← correct
```
Output: "Gravity is a fundamental force that attracts two bodies towards each other..."

**quant-server (libturboquant):**
```
tq_load_gguf: loaded 32 layers (0 self_attn)    ← still broken
```
Output: garbage tokens

## Files that need Phi-3 changes ported

The following changes from `quant.h` need to be mirrored in the split sources:

| Feature | quant.h | Needs porting to |
|---------|---------|-----------------|
| Fused `attn_qkv` detection | ✅ | `src/engine/tq_gguf.c` |
| Fused `ffn_up_gate` detection | ✅ | `src/engine/tq_gguf.c` |
| LongRoPE factor loading | ✅ | `src/engine/tq_gguf.c` |
| Fused QKV matmul + split | ✅ | `src/engine/tq_transformer.c` |
| Fused gate\|\|up FFN | ✅ | `src/engine/tq_transformer.c` |
| NeoX-style RoPE rotation | ✅ | `src/engine/tq_transformer.c` |
| Phi-3 BOS token handling | ✅ | `src/engine/tq_generate.c` |
| Layer dispatch for `gguf_w_qkv` | ✅ | `src/engine/tq_transformer.c` |

## Impact

- `quantcpp serve phi3.5:mini` launches the server but inference is garbage
- Users who follow the README to serve Phi-3.5 will get broken output
- The Python `Model` class also uses `libturboquant` via ctypes, so it's also affected

## Workaround

Compile a shared library from `quant.h` directly and use a Python wrapper server:
```bash
cc -O2 -shared -fPIC -o libquant_phi3.dylib -x c - -lm -lpthread <<< '#define QUANT_IMPLEMENTATION
#include "quant.h"'
python3 phi35_server.py 8080
```

This workaround is functional (tested: 8 tok/s, coherent output, streaming works).

## Suggested Fix

Sync the Phi-3 changes from `quant.h` into the split source tree. Consider adding a CI check that validates `quant.h` and `src/engine/*.c` produce identical inference output for all supported architectures.

## Environment

- quant.cpp: commit 1e1ea2c (PR #65 merged)
- Model: Phi-3.5-mini-instruct-Q8_0.gguf (3.9GB)
- OS: macOS 15 (Apple M3)

---
*Reported by ClawTeam — verified via Claw-4 (Optimizer) retest*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67

Description

Evidence

Files that need Phi-3 changes ported

Impact

Workaround

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	quant.h	Needs porting to
Fused `attn_qkv` detection	✅	`src/engine/tq_gguf.c`
Fused `ffn_up_gate` detection	✅	`src/engine/tq_gguf.c`
LongRoPE factor loading	✅	`src/engine/tq_gguf.c`
Fused QKV matmul + split	✅	`src/engine/tq_transformer.c`
Fused gate\|\|up FFN	✅	`src/engine/tq_transformer.c`
NeoX-style RoPE rotation	✅	`src/engine/tq_transformer.c`
Phi-3 BOS token handling	✅	`src/engine/tq_generate.c`
Layer dispatch for `gguf_w_qkv`	✅	`src/engine/tq_transformer.c`

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67

Description

Description

Evidence

Files that need Phi-3 changes ported

Impact

Workaround

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions