Pure C/Metal inference engine for running large Qwen MoE models on Apple Silicon by streaming routed experts from SSD.
Qwen3.5-35B-A3B on a $600 Mac mini (M4, 16GB): 11.5 tok/s sustained, 2.5s TTFT, production-quality output with tool calling.
This is a 2.6x speedup over the original M3 Max baseline, on lower-cost hardware.
| Machine | Model | K (active experts) | Sustained tok/s | TTFT | Notes |
|---|---|---|---|---|---|
| M3 Max MacBook Pro (48GB, original) | Qwen3.5-35B-A3B-4bit | 4 | 4.4 | ~5.6s | Original public baseline |
| M4 Mac mini (16GB, current) | Qwen3.5-35B-A3B-4bit | 6 | 11.5 | 2.5s | Current production setup |
| Machine | CPU/GPU | Unified Memory | Role |
|---|---|---|---|
| MacBook Pro (M3 Max) | M3 Max | 48GB | Original bring-up + baseline optimization |
| Mac mini (M4) | M4 | 16GB | Current optimized runtime target |
- Qwen3.5-35B-A3B MoE inference implemented in C/Objective-C + Metal.
- Non-expert weights are loaded once (
model_weights.bin), expert weights stream from SSD at token time. - Routing K is runtime-configurable (
--k).- Original M3 tuning focused on
K=4. - Current M4 production setup uses
K=6.
- Original M3 tuning focused on
- Pipeline remains SSD-aware and unified-memory-aware: optimize total token latency, not isolated kernel microbenchmarks.
tg128matvec kernels for better threadgroup utilization.- Encoder coalescing to reduce launch/synchronization overhead in prefill/encode paths.
- Kernel fusion in critical hot paths to cut memory traffic and CPU-GPU handoff overhead.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install numpy tokenizersexport MODEL_DIR="${MODEL_DIR:-$HOME/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit/snapshots/<snapshot_id>}"python3 build_expert_index_35b.py --model-path "$MODEL_DIR" --out expert_index_35b.json
python3 repack_experts_35b.py --index expert_index_35b.json
python3 metal_infer/extract_weights_35b.py --model "$MODEL_DIR" --output metal_infer/out_35b
python3 metal_infer/export_tokenizer_35b.py "$MODEL_DIR/tokenizer.json" metal_infer/tokenizer.bin
python3 metal_infer/export_vocab_35b.py "$MODEL_DIR/tokenizer.json" metal_infer/vocab.bincd metal_infer
make infer chat
cd .../metal_infer/infer \
--model "$MODEL_DIR" \
--weights metal_infer/out_35b/model_weights.bin \
--manifest metal_infer/out_35b/model_weights.json \
--vocab metal_infer/vocab.bin \
--k 6 \
--serve 8000curl -N http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Explain mixture-of-experts in one paragraph."}],"max_tokens":128,"stream":true}'- Core runtime:
metal_infer/infer.m,metal_infer/shaders.metal - Chat client:
metal_infer/chat.m - Benchmark helper:
bench.sh - Experiment notes:
docs/optimization-experiments-q4.md - Technical paper:
paper/flash_moe.pdf
MIT — see LICENSE.