You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users need guidance on which model to choose for their use case. Based on extensive testing and research, we propose adding a model selection guide to the documentation that helps users make informed decisions based on their priorities: speed, quality, multilingual, or multimodal.
Model Comparison (3B-4B class, Q4_K_M)
Phi-4-mini (3.8B)
Qwen3.5-4B
Gemma 4 E2B (~2.3B eff)
Developer
Microsoft
Alibaba
Google
Release
Feb 2025
Mar 2026
Apr 2026
Vocab Size
200,064
248,320
262,144
Attention
GQA (24Q/8KV)
GQA + Gated DeltaNet
GQA (8Q/1KV)
Context
128K (LongRoPE)
262K (1M extended)
128K
Q4_K_M Size
2.5 GB
2.5 GB
3.2 GB
License
MIT
Apache 2.0
Gemma
Benchmarks
Benchmark
Phi-4-mini
Qwen3.5-4B
Gemma 4 E2B
MMLU
67.3
~72-75
~60
HumanEval (code)
74.4
~70+
~55
GSM8K (math)
88.6
~88+
~75
MATH
64.0
~60+
~50
Multilingual (MGSM)
63.9
Best
Good
Korean/Chinese/Japanese
Improved
Excellent
Good
Multimodal
No
No
Yes (img+audio+video)
Speed vs Quality Trade-off
Model
Vocab
Relative Speed*
Quality
Best For
Phi-3.5-mini (32K vocab)
32K
Fastest
Good
Speed-first, English
Phi-4-mini (200K vocab)
200K
Moderate
Better
Math/code, multilingual
Qwen3.5-4B (248K vocab)
248K
Moderate
Best
Quality-first, CJK, long docs
Gemma 4 E2B (262K vocab)
262K
Moderate
Good
Multimodal on-device
Speed is heavily influenced by vocab size — the final logit projection (hidden_dim × vocab_size) is the largest matmul per token. Smaller vocab = faster generation.
Proposed Model Selection Guide
We suggest adding a guide (e.g., docs/model_guide.md or a README section) that helps users choose:
Choose by priority:
"I want the fastest response"
→ Phi-3.5-mini (vocab 32K, ~100+ tok/s on M3)
Simplest architecture, maximum compatibility
2024 model, benchmarks are dated but quality is solid for English
"I want the best text quality"
→ Qwen3.5-4B (vocab 248K, ~60-80 tok/s on M3)
Highest benchmarks across MMLU, reasoning, coding
DeltaNet hybrid: only 25% of layers need KV cache → 75% memory savings for long context
262K native context (1M extended) — longest available
Best Korean/Chinese/Japanese support
"I want strong math and code"
→ Phi-4-mini (vocab 200K, ~70-90 tok/s on M3)
HumanEval 74.4 (best in class), MATH 64.0
Massive improvement over Phi-3.5 in math (+14.2) and multilingual (+14.3)
MIT license — most permissive
"I need multimodal (images, audio, video)"
→ Gemma 4 E2B (vocab 262K, ~70-90 tok/s on M3)
Only model with native text + image + audio + video at this size
Google ecosystem integration (Android AICore)
Per-Layer Embeddings for deeper representation
"I need to process very long documents"
→ Qwen3.5-4B (262K context)
DeltaNet layers don't need KV cache → fits longer contexts in same RAM
3.7x longer context than Phi-4-mini/Gemma 4
The vocab size trade-off explained
Users often assume bigger model = slower. In practice:
The output projection layer (hidden_dim × vocab_size) runs every token and dominates generation time on memory-bandwidth-limited hardware like Apple Silicon. When choosing between models of similar parameter count, prefer the one with the smaller vocabulary if speed is your priority.
Summary
Users need guidance on which model to choose for their use case. Based on extensive testing and research, we propose adding a model selection guide to the documentation that helps users make informed decisions based on their priorities: speed, quality, multilingual, or multimodal.
Model Comparison (3B-4B class, Q4_K_M)
Benchmarks
Speed vs Quality Trade-off
Speed is heavily influenced by vocab size — the final logit projection (hidden_dim × vocab_size) is the largest matmul per token. Smaller vocab = faster generation.
Proposed Model Selection Guide
We suggest adding a guide (e.g.,
docs/model_guide.mdor a README section) that helps users choose:Choose by priority:
"I want the fastest response"
→ Phi-3.5-mini (vocab 32K, ~100+ tok/s on M3)
"I want the best text quality"
→ Qwen3.5-4B (vocab 248K, ~60-80 tok/s on M3)
"I want strong math and code"
→ Phi-4-mini (vocab 200K, ~70-90 tok/s on M3)
"I need multimodal (images, audio, video)"
→ Gemma 4 E2B (vocab 262K, ~70-90 tok/s on M3)
"I need to process very long documents"
→ Qwen3.5-4B (262K context)
The vocab size trade-off explained
Users often assume bigger model = slower. In practice:
The output projection layer (hidden_dim × vocab_size) runs every token and dominates generation time on memory-bandwidth-limited hardware like Apple Silicon. When choosing between models of similar parameter count, prefer the one with the smaller vocabulary if speed is your priority.
Architecture Support Status in quant.cpp
Action items for quant.cpp team:
Environment
Tested on Apple M3 (16GB). All models fit comfortably at Q4_K_M.
Reported by ClawTeam — based on Claw-1 (Quickstart), Claw-4 (Optimizer), and Claw-5 (Researcher) combined findings