FFAI

Fucking Fast Apple Inference.

A minimal, dependency-light LLM inference library for Apple Silicon, built on pre-compiled Metal kernels generated from the metaltile DSL. No Python. No MLX. No C compilation. No JIT. No four-repo dependency chain.

Just really fucking fast AI! 🚀

Status

Early bootstrap — Phase 4 complete (end-to-end inference + perf pass).

planning/plan.md — phased build-out, what we're shipping when
planning/architecture.md — visual reference for kernel generation, model loading, and the inference dispatch loop
documentation/ — user-facing docs

Features

Functionality	Description	Status
Apple Silicon native	Built ground-up for M-series GPUs. No fallbacks dragging it down.	✅
Pre-compiled kernels	The Metal kernels ship ready-to-run. No JIT delay the first time you load a model.	✅
One-line model loading	`Model.load("org/repo")` and you're generating. Download, cache, tokenizer, prewarm — one async call.	✅
HuggingFace native	Pull any compatible model straight from the Hub. Same cache as Python.	✅
3 / 4 / 5 / 6 / 8-bit quantization	Run beefy models on lean machines. The `mlx-community` quants you're already using.	✅
Single-buffer-per-token dispatch	Forward + sample on one Metal command buffer per token. Just 4 bytes cross CPU↔GPU.	✅
Capability-driven hot loading/unloading	Only load what you'll use. Add and remove vision and audio encoders as you need them.	✅
Async lifecycle stream	Real progress for your UI — download, load, ready — as an `AsyncStream`.	✅
Built in performance profiling	Run benchmarks using the FFAI CLI and get performance telemtry data as you do inference.	✅
Streaming generation	Streaming inference support across all models.	✅
Quantized KV cache	Squeeze long contexts into a fraction of the memory. Affine 4/6/8-bit + TurboQuant.	🚧 Phase 5
Hybrid models (GDN + SSM)	Qwen 3.5, Mamba, NemotronH — the families that mix attention with recurrence.	🚧 Phase 5
Vision (multi-modal)	Drop in an image, get text back. Qwen 2.5-VL / 3.5-VL first.	🚧 Phase 6
Audio in / out	Whisper-style speech-to-text and text-to-speech.	🚧 Phase 8+
Speculative decoding	Faster generation via n-gram lookup + draft models.	🚧 Phase 8+
Autotuner	Per-shape kernel tuning so you never leave perf on the table.	🚧 Phase 7
GGUF support	Run llama.cpp's quants directly.	🚧 Phase 8+

For the longer-form view of what's shipped vs planned, see planning/roadmap.md. For the per-topic deep-dives (KV cache, quantization, performance, capabilities) see documentation/.

Quick Start

Install via SwiftPM:

.package(url: "https://github.com/thewafflehaus/FFAI", from: "0.1.0")

Then generate text in five lines:

import FFAI

let model = try await Model.load("unsloth/Llama-3.2-1B")
let result = try await model.generate(
    prompt: "Once upon a time",
    parameters: model.defaultGenerationParameters.with { $0.maxTokens = 64 }
)
print(result.text)
print("\(result.tokensPerSecond) tok/s")

Model.load resolves the HuggingFace repo, downloads the snapshot (or hits the cache), parses config.json, mmap-loads weights into per-tensor MTLBuffers, attaches the tokenizer, and prewarms the PSO cache. The first call costs a few seconds; subsequent loads of the same repo are near-instant.

CLI equivalent (the ffai executable target):

ffai --model unsloth/Llama-3.2-1B --prompt "Once upon a time"

See quickstart.md for streaming, chat templates, capability gating, and lower-level forward APIs. Using a non-default cache directory (external SSD, shared cache between Python tools, etc.)? See Custom model cache path.

Models Supported

Two architecture families ship today; both run real HuggingFace checkpoints end-to-end. Adding a new family is one Swift file plus test fixtures — see adding-a-model.md.

Family	Variants	Sizes	Quantizations
Llama 3.x (`Llama.swift`)	`LlamaDense` (GQA + RoPE3 scaling + RMSNorm + SwiGLU MLP)	1B / 3B / 8B / 70B	bf16 / 8bit / 6bit / 5bit / 4bit / 3bit
Qwen 3 (`Qwen3.swift`)	`Qwen3Dense` (Llama core + per-head q_norm/k_norm)	0.6B / 1.7B / 4B / 8B / 14B / 32B	bf16 / 8bit / 6bit / 5bit / 4bit / 3bit

Quant layouts follow the mlx-community packed-uint32 format (weights + scales + biases per group). Pass any HuggingFace repo ID and the loader resolves architecture, downloads the snapshot, and routes to the right family. See models.md for the full matrix and known gaps.

Coming next (per planning/plan.md): Qwen 3.5 hybrid (GDN + attention), Qwen 3.5 MoE, Mistral, Phi, Gemma, vision (Qwen 2.5/3.5-VL), audio (Whisper / Qwen-Omni).

High Level Architecture

┌─────────────────────────────────────────────────────────┐
│  FFAI (Swift)                                           │
│   • Tensor (MTLBuffer-backed)                           │
│   • Module / Linear / Embedding / RMSNorm               │
│   • Model definitions (Llama, Qwen, …)                  │
│   • SafeTensors loader                                  │
│   • KV cache, sampling, generate loop                   │
└────────────────────────┬────────────────────────────────┘
                         │ calls
┌────────────────────────▼────────────────────────────────┐
│  MetalTileSwift (Swift, in-repo)                        │
│   • Loads kernels.metallib (pre-compiled at build time) │
│   • PSO cache, function-constant specialization         │
│   • Generated typed wrappers (one per kernel)           │
└────────────────────────┬────────────────────────────────┘
                         │ resources from
┌────────────────────────▼────────────────────────────────┐
│  metaltile (Rust, sibling repo)                         │
│   • #[kernel] DSL → IR → MSL                            │
│   • `tile build --emit all` (metaltile-cli) produces:   │
│       kernels.metallib   (compiled by xcrun metal)      │
│       manifest.json      (kernel metadata)              │
│       MetalTileKernels.swift  (typed wrappers)          │
└─────────────────────────────────────────────────────────┘

For the longer-form view (build pipeline, model load sequence, inference dispatch loop) see planning/architecture.md and documentation/architecture.md.

Contributing

Read CONTRIBUTING.md first — it covers:

the community guidelines;
issue-first rule;
what good PRs look like;
how we deal with AI-assisted contributions; and
how to get started! 🚀

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github		.github
Sources		Sources
Tests		Tests
Tools		Tools
documentation		documentation
planning		planning
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.spi.yml		.spi.yml
.swift-format		.swift-format
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FFAI

Status

Features

Quick Start

Models Supported

High Level Architecture

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FFAI

Status

Features

Quick Start

Models Supported

High Level Architecture

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages