Production LLM inference on the Apple Neural Engine — a practitioner's guide, complete with converters, Swift runtimes, and validated model manifests.
Every model in this repo runs 100% on the Neural Engine (verified with
MLComputePlan). No GPU fallback. No CPU matmuls.
| Model | Type | Params | ANE tok/s | Status |
|---|---|---|---|---|
| Phi-4-mini-instruct | Dense LLM | 3.8B | ~17 | ✅ v1.0 |
| Hy-MT 1.5 | Dense translation | 1.8B | ~34 | ✅ v1.0 |
| ZAYA1-8B | MoE LLM | 8B | ~9 | ✅ v1.0 |
| Privacy Filter | MoE NER / PII | ~1.5B | ~24.6 sent/s | ✅ v1.0 |
Hardware: Apple M4 Max, 48 GB unified memory, macOS 15, Xcode 16.
- macOS 15+ (Sequoia), Apple Silicon
- Xcode 16+ (for
xcrun coremlcompilerand coremltools 9) - Python 3.11+ via Xcode tools
# Build once (downloads weights from HuggingFace, ~3 GB)
/usr/bin/python3 models/privacy-filter/build_scripts/build_pf_packed_alllayers.py
# Extract Swift weights
/usr/bin/python3 models/privacy-filter/build_scripts/extract_pf_swift_weights.py
# Redact a file
bash demo/demo_redact.sh demo/pii_examples.txt# Download GGUF (requires HuggingFace account for Phi-4)
# Place at: models/phi4-mini/Phi-4-mini-instruct.Q8_0.gguf
# Convert all shards (Xcode python3 only)
/usr/bin/python3 converters/phi4_mini_rangedim_export_shard.py --all
# Convert LM head shards
/usr/bin/python3 converters/phi4_mini_lm_head_shards.py
# Check ANE residency
/usr/bin/python3 validators/phi4_mini_residency_check.pyThe Apple Neural Engine Inference Book in book/ is a chapter-by-chapter porting guide for
practitioners who want to port their own models to ANE:
| Chapter | Topic |
|---|---|
| 00 — Modern Inference | Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick |
| 01 — ANE Laws | Empirical rules: shard limits, quantization, residency |
| 02 — Porting Recipe | GGUF → CoreML, step by step |
| 03 — Quantization | INT8 production, INT4 tradeoffs, the silent CPU fallback |
| 04 — Shard Sizing | Layer count vs size, 250 MB limit, LM-head splits |
| 05 — Stateful KV Cache | MLState, Swift daemon design, decode loop |
| 06 — RangeDim + Speculative | Variable T, n-gram acceptance |
| 07 — MoE on ANE | Soft routing, per-expert dispatch, ZAYA & Privacy Filter |
| 08 — Swift Runtime | Cache-friendly CoreML orchestration, state, buffers, and serving |
| 09 — Experiment Index | Searchable index of experiment writeups |
| 10 — Decision Journal | The thinking behind the hard calls |
| Glossary | Definitions for inference, CoreML, ANE, and validation terms |
ane-book/
├── book/ ← the porting guide (chapters 00–10)
├── converters/ ← Python scripts for GGUF → CoreML (Xcode python3)
├── runtime/ ← Swift inference runtimes
├── models/ ← per-model manifests, goldens, build scripts
│ ├── phi4-mini/
│ ├── hymt/
│ ├── zaya/
│ └── privacy-filter/build_scripts/
├── validators/ ← residency checks + quality gates
├── demo/ ← end-user demos
├── research/ ← findings, negative results, ANE internals
└── blogposts/ ← published and draft blog posts
-
ANE-only: every matmul, norm, and activation runs on the Neural Engine.
MLComputePlanmust show 100%ios18.convops on ANE before any benchmark. -
Quality before perf: cosine similarity ≥ 0.97 vs FP16 golden before any benchmarking or model shipping.
-
INT8 per-tensor is the production baseline. INT4 per-block silently falls to CPU on small shards — see research/INT4_SHARD_ANE_BUG.md.
-
Shard size ≤ 250 MB. Above this, ANEF compiler emits error -14.
research/ contains findings that don't fit in the how-to chapters:
- ANE_CHAIN_SCHEMA.md — ObjC runtime reflection of the ANE private API
- ANE_SCALING_FINDINGS.md — 0.5B → 3B scaling limits
- INT4_SHARD_ANE_BUG.md — The silent CPU fallback with INT4 per-block
See LICENSE.