Add Qwen 3.5 and Gemma 4 models to benchmark suite

## New models to benchmark

For the MobiHoc 2026 paper revision, we should include newer model families to strengthen the generalizability of our findings.

### Models to add

**Qwen 3.5 family:**
- Qwen 3.5 (small variants, ~1-4B) — successor to Qwen 3, likely improved tool calling
- Check mlx-community for INT4 variants
- SGLang should support these with `qwen` parser

**Gemma 4 family:**
- Gemma 4 (small variants) — successor to Gemma 3n
- May resolve the SGLang/vLLM compatibility issues we had with Gemma 3n
- Check if standard attention (no Conv3d) improves runtime support

### What to do
1. Check HuggingFace for available model variants and sizes
2. Add model configs to `MLX_MODEL_MAP` and `MODELS` lists
3. Create optimized prompts in `optimized_prompts.py`
4. Validate tool calling works on both MLX and SGLang
5. Run full benchmark sweep

### Context
- Current models: Qwen 3 (4B, 0.6B), Llama 3.2 3B, DeepSeek R1 1.5B, Gemma 3n E2B
- Newer models would test if our structural findings (capability threshold, prefill dominance, inverse efficiency) hold across model generations
- Reviewer A: "the paper feels like a snapshot in time" — newer models address this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen 3.5 and Gemma 4 models to benchmark suite #1

New models to benchmark

Models to add

What to do

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add Qwen 3.5 and Gemma 4 models to benchmark suite #1

Description

New models to benchmark

Models to add

What to do

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions