Skip to content

Model selection guide: Phi-4-mini vs Qwen3.5-4B vs Gemma-4-E2B — help users choose #68

@unamedkr

Description

@unamedkr

Summary

Users need guidance on which model to choose for their use case. Based on extensive testing and research, we propose adding a model selection guide to the documentation that helps users make informed decisions based on their priorities: speed, quality, multilingual, or multimodal.

Model Comparison (3B-4B class, Q4_K_M)

Phi-4-mini (3.8B) Qwen3.5-4B Gemma 4 E2B (~2.3B eff)
Developer Microsoft Alibaba Google
Release Feb 2025 Mar 2026 Apr 2026
Vocab Size 200,064 248,320 262,144
Attention GQA (24Q/8KV) GQA + Gated DeltaNet GQA (8Q/1KV)
Context 128K (LongRoPE) 262K (1M extended) 128K
Q4_K_M Size 2.5 GB 2.5 GB 3.2 GB
License MIT Apache 2.0 Gemma

Benchmarks

Benchmark Phi-4-mini Qwen3.5-4B Gemma 4 E2B
MMLU 67.3 ~72-75 ~60
HumanEval (code) 74.4 ~70+ ~55
GSM8K (math) 88.6 ~88+ ~75
MATH 64.0 ~60+ ~50
Multilingual (MGSM) 63.9 Best Good
Korean/Chinese/Japanese Improved Excellent Good
Multimodal No No Yes (img+audio+video)

Speed vs Quality Trade-off

Model Vocab Relative Speed* Quality Best For
Phi-3.5-mini (32K vocab) 32K Fastest Good Speed-first, English
Phi-4-mini (200K vocab) 200K Moderate Better Math/code, multilingual
Qwen3.5-4B (248K vocab) 248K Moderate Best Quality-first, CJK, long docs
Gemma 4 E2B (262K vocab) 262K Moderate Good Multimodal on-device

Speed is heavily influenced by vocab size — the final logit projection (hidden_dim × vocab_size) is the largest matmul per token. Smaller vocab = faster generation.

Proposed Model Selection Guide

We suggest adding a guide (e.g., docs/model_guide.md or a README section) that helps users choose:

Choose by priority:

"I want the fastest response"
Phi-3.5-mini (vocab 32K, ~100+ tok/s on M3)

  • Simplest architecture, maximum compatibility
  • 2024 model, benchmarks are dated but quality is solid for English

"I want the best text quality"
Qwen3.5-4B (vocab 248K, ~60-80 tok/s on M3)

  • Highest benchmarks across MMLU, reasoning, coding
  • DeltaNet hybrid: only 25% of layers need KV cache → 75% memory savings for long context
  • 262K native context (1M extended) — longest available
  • Best Korean/Chinese/Japanese support

"I want strong math and code"
Phi-4-mini (vocab 200K, ~70-90 tok/s on M3)

  • HumanEval 74.4 (best in class), MATH 64.0
  • Massive improvement over Phi-3.5 in math (+14.2) and multilingual (+14.3)
  • MIT license — most permissive

"I need multimodal (images, audio, video)"
Gemma 4 E2B (vocab 262K, ~70-90 tok/s on M3)

  • Only model with native text + image + audio + video at this size
  • Google ecosystem integration (Android AICore)
  • Per-Layer Embeddings for deeper representation

"I need to process very long documents"
Qwen3.5-4B (262K context)

  • DeltaNet layers don't need KV cache → fits longer contexts in same RAM
  • 3.7x longer context than Phi-4-mini/Gemma 4

The vocab size trade-off explained

Users often assume bigger model = slower. In practice:

SmolLM2-1.7B (vocab 49K):  12.5 tok/s  ← faster than 1B Llama!
Llama-3.2-1B (vocab 128K):  2.3 tok/s  ← 5x slower despite fewer params

The output projection layer (hidden_dim × vocab_size) runs every token and dominates generation time on memory-bandwidth-limited hardware like Apple Silicon. When choosing between models of similar parameter count, prefer the one with the smaller vocabulary if speed is your priority.

Architecture Support Status in quant.cpp

Model Architecture quant.h quant-server Status
Phi-3.5-mini phi3 ❌ (#67) quant.h only
Phi-4-mini phi3 ✅ (likely) ❌ (#67) Needs testing
Qwen3.5-4B qwen35 (DeltaNet) ⚠️ ⚠️ Loads but output quality needs verification
Gemma 4 E2B gemma4 (PLE) ⚠️ ⚠️ Loads but output was garbage in testing (#61)

Action items for quant.cpp team:

  1. Port Phi-3 changes to libturboquant (Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67) — unlocks Phi-3.5/Phi-4 in quant-server
  2. Verify Qwen3.5-4B DeltaNet inference — architecture is implemented but output quality was poor for 0.8B
  3. Fix Gemma 4 E2B inference — PLE + dual-FFN may have remaining issues (GQA/MQA attention broken — only MHA (Q_heads == KV_heads) produces coherent output #61)
  4. Add model selection guide to docs/README
  5. Consider adding Phi-4-mini to model registry — it's the natural successor to Phi-3.5

Environment

Tested on Apple M3 (16GB). All models fit comfortably at Q4_K_M.


Reported by ClawTeam — based on Claw-1 (Quickstart), Claw-4 (Optimizer), and Claw-5 (Researcher) combined findings

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions