fix: Phi-3 Q8_0 default + unified server in CLI/CMake by unamedkr · Pull Request #80 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T09:39:23Z

Summary

Two quick wins that immediately improve the user experience:

1. Python registry → Q8_0 (2x speed)

Before	After
`Phi-3.5-mini-instruct-Q4_K_M.gguf` (2.2 GB)	`Phi-3.5-mini-instruct-Q8_0.gguf` (3.8 GB)
1.5 tok/s on M3	3.0 tok/s on M3

Q8_0's simple int8 dequant is NEON-friendly. Q4_K_M's complex super-block dequant dominates compute at batch-1. Both produce identical quality.

2. CLI `serve` → prefers `quant-server-unified`

`quantcpp serve` now searches for `quant-server-unified` first (quant.h-based, fixes #77), falls back to legacy `quant-server` (libturboquant-based).

3. CMake `quant-server-unified` target

Added under `TQ_BUILD_SERVER=ON`. Compiles `tools/quant_server_unified.c` directly against quant.h — no sync divergence possible.

Verified

ctest → 35/35 passed
`quant-server-unified` builds (360 KB)
Python registry confirms Q8_0
CLI search order: unified → legacy

🤖 Generated with Claude Code

## Phi-3.5 registry → Q8_0 (2x faster) Q8_0 is 2x faster than Q4_K_M on Apple Silicon NEON (3.0 vs 1.5 tok/s measured on M3). Q4_K_M's complex super-block dequant dominates compute at batch-1, while Q8_0's simple int8 dequant is NEON-friendly. Both produce identical quality output. - Registry: `Phi-3.5-mini-instruct-Q4_K_M.gguf` (2.2 GB) → `Phi-3.5-mini-instruct-Q8_0.gguf` (3.8 GB) - Module docstring size updated (2.4 GB → 3.8 GB) ## CLI `serve` → prefers `quant-server-unified` `quantcpp serve` now searches for `quant-server-unified` first, then falls back to the legacy `quant-server`. The unified server builds directly on quant.h (single-header amalgamation), which fixes #77 (SmolLM2-1.7B regression from libturboquant divergence). Search order: PATH → ./build/ → ./build_metal/ → ./build_cpu/ ## CMake `quant-server-unified` target Added `quant-server-unified` build target under `TQ_BUILD_SERVER=ON`. Compiles `tools/quant_server_unified.c` directly against quant.h. ## Verified - ctest → 35/35 passed - `quant-server-unified` builds (360 KB binary) - Python registry confirms Q8_0 filename - CLI `quantcpp serve` prefers unified binary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 72e815b into main Apr 12, 2026
2 of 3 checks passed

unamedkr deleted the fix/phi3-q8-default-unified-serve branch April 12, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Phi-3 Q8_0 default + unified server in CLI/CMake#80

fix: Phi-3 Q8_0 default + unified server in CLI/CMake#80
unamedkr merged 1 commit intomainfrom
fix/phi3-q8-default-unified-serve

unamedkr commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Summary

1. Python registry → Q8_0 (2x speed)

2. CLI `serve` → prefers `quant-server-unified`

3. CMake `quant-server-unified` target

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant