Skip to content

quantcpp recommend: vocab-aware model selection for target hardware #88

@unamedkr

Description

@unamedkr

Summary

Add a quantcpp recommend command that suggests the optimal model based on hardware specs and user priorities (speed vs quality).

Motivation

Our testing revealed a counter-intuitive finding: vocab size dominates speed, not parameter count.

SmolLM2-1.7B (vocab 49K):   23 tok/s  ← bigger model, FASTER
Llama-3.2-1B (vocab 128K):   2.3 tok/s ← smaller model, SLOWER
Phi-3.5-mini (vocab 32K):    6.5 tok/s ← best speed/quality ratio

Most users don't know this and pick models by parameter count alone, getting disappointed by speed.

Proposed UX

quantcpp recommend
# Hardware: Apple M3, 16GB RAM
# Priority: balanced
#
# Recommended: Phi-3.5-mini (Q8_0)
#   Speed:   ~6.5 tok/s
#   Quality: MMLU 65.5, GSM8K 76.9
#   Size:    4.1 GB
#   Vocab:   32K (fastest in 3-4B class)
#
# Alternatives:
#   SmolLM2-1.7B (Q8) — 23 tok/s, lower quality
#   Qwen3-4B (Q4)     — best quality, ~2 tok/s

quantcpp recommend --priority speed
# → SmolLM2-1.7B

quantcpp recommend --priority quality
# → Qwen3-4B (Q4_K_M)

Implementation

Speed prediction formula (empirically derived):

estimated_tok_s = base_tok_s * (base_vocab / model_vocab) * (base_params / model_params)^0.5

Priority: P2

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions