New models to benchmark
For the MobiHoc 2026 paper revision, we should include newer model families to strengthen the generalizability of our findings.
Models to add
Qwen 3.5 family:
- Qwen 3.5 (small variants, ~1-4B) — successor to Qwen 3, likely improved tool calling
- Check mlx-community for INT4 variants
- SGLang should support these with
qwen parser
Gemma 4 family:
- Gemma 4 (small variants) — successor to Gemma 3n
- May resolve the SGLang/vLLM compatibility issues we had with Gemma 3n
- Check if standard attention (no Conv3d) improves runtime support
What to do
- Check HuggingFace for available model variants and sizes
- Add model configs to
MLX_MODEL_MAP and MODELS lists
- Create optimized prompts in
optimized_prompts.py
- Validate tool calling works on both MLX and SGLang
- Run full benchmark sweep
Context
- Current models: Qwen 3 (4B, 0.6B), Llama 3.2 3B, DeepSeek R1 1.5B, Gemma 3n E2B
- Newer models would test if our structural findings (capability threshold, prefill dominance, inverse efficiency) hold across model generations
- Reviewer A: "the paper feels like a snapshot in time" — newer models address this
New models to benchmark
For the MobiHoc 2026 paper revision, we should include newer model families to strengthen the generalizability of our findings.
Models to add
Qwen 3.5 family:
qwenparserGemma 4 family:
What to do
MLX_MODEL_MAPandMODELSlistsoptimized_prompts.pyContext