Summary
Add a quantcpp recommend command that suggests the optimal model based on hardware specs and user priorities (speed vs quality).
Motivation
Our testing revealed a counter-intuitive finding: vocab size dominates speed, not parameter count.
SmolLM2-1.7B (vocab 49K): 23 tok/s ← bigger model, FASTER
Llama-3.2-1B (vocab 128K): 2.3 tok/s ← smaller model, SLOWER
Phi-3.5-mini (vocab 32K): 6.5 tok/s ← best speed/quality ratio
Most users don't know this and pick models by parameter count alone, getting disappointed by speed.
Proposed UX
quantcpp recommend
# Hardware: Apple M3, 16GB RAM
# Priority: balanced
#
# Recommended: Phi-3.5-mini (Q8_0)
# Speed: ~6.5 tok/s
# Quality: MMLU 65.5, GSM8K 76.9
# Size: 4.1 GB
# Vocab: 32K (fastest in 3-4B class)
#
# Alternatives:
# SmolLM2-1.7B (Q8) — 23 tok/s, lower quality
# Qwen3-4B (Q4) — best quality, ~2 tok/s
quantcpp recommend --priority speed
# → SmolLM2-1.7B
quantcpp recommend --priority quality
# → Qwen3-4B (Q4_K_M)
Implementation
Speed prediction formula (empirically derived):
estimated_tok_s = base_tok_s * (base_vocab / model_vocab) * (base_params / model_params)^0.5
Priority: P2
Summary
Add a
quantcpp recommendcommand that suggests the optimal model based on hardware specs and user priorities (speed vs quality).Motivation
Our testing revealed a counter-intuitive finding: vocab size dominates speed, not parameter count.
Most users don't know this and pick models by parameter count alone, getting disappointed by speed.
Proposed UX
Implementation
Speed prediction formula (empirically derived):
Priority: P2