atomr-infer

A native Rust multi-runtime inference layer built as a supervised actor topology on top of atomr. atomr-infer gives you a single mental model — one Deployment value object, one routing CRDT, one supervision tree — that scales from a single OpenAI-key script to a heterogeneous cluster blending owned GPU hardware with managed APIs. The same actor_ref.tell(msg) lands a request on an H100 two racks away or in another company's data center.

use atomr_infer::prelude::*;

// Same value object describes a vLLM-on-4×H100 replica or a Gemini
// Vertex deployment. The `runtime` field is the only thing that
// changes — and it's auto-inferred from the model name when omitted.
let dep = Deployment {
    name: "gpt-4o-mini".into(),
    model: "gpt-4o-mini".into(),
    runtime: None,
    runtime_config: None,
    gpus: None,
    replicas: 1,
    serving: Serving::default(),
    budget: None,
    idempotent: true,
};

Why multi-runtime inference, in Rust, now

Production AI rarely runs only on owned hardware. Frontier models, burst capacity, and compliance edge cases all push work onto managed APIs. Bolting a provider SDK onto a separate retry / rate-limit / observability stack from your local GPU pool fragments the system — and the cracks are exactly where 3 a.m. pages come from.

Heterogeneous workloads are the norm, not the exception. vLLM on a DGX node, a Candle CPU model in a sidecar, an OpenAI call for the long tail of hard prompts, an Anthropic fallback when OpenAI rate-limits — that's one application, but today it's three SDKs, three retry policies, three observability stacks. atomr-infer treats every runtime as just-another-ModelRunner. The gateway, request actor, and routing CRDT don't know — and don't care — whether a request lands on a local GPU or a remote API.

Cost, latency, and reliability are coupled. A pipeline that classifies cheaply on a local model and escalates to GPT-4o for hard cases is also the pipeline that needs to fall back to Anthropic when OpenAI is saturated and shed traffic when the hourly budget hits. Threading those concerns by hand produces brittle glue. atomr-infer encodes them as composable actors — InferenceCascade, RateLimiterActor (CRDT-backed), CircuitBreakerActor, Budget { on_exceeded: Reject } — under one supervision tree with one trace and one backpressure story.

Granular efficiency. Rust gives us deterministic resource use, zero-cost abstractions, and ownership-as-concurrency-safety. Per-actor footprint stays small; per-message cost stays low. The remote-network tier is HTTP/2 + SSE + connection pooling with structured retry; the local-GPU tier rides on top of atomr-accel's two-tier device supervision. A cargo build --features remote-only produces a binary with zero cudarc, zero atomr-accel, zero candle, zero pyo3 in the dependency graph — the layered crate split makes the invariant load-bearing, not aspirational.

What's in the box

Crate	What it does
`atomr-infer`	Umbrella facade re-exporting the public surface, feature-flag-driven
`atomr-infer-core`	`Deployment` value object, `ModelRunner` trait, typed `InferenceError`, batch primitives
`atomr-infer-runtime`	Gateway, request actor, dp-coordinator, engine-core, two-tier worker, placement, deployment manager, metrics
`atomr-infer-remote-core`	Distributed rate limiter (CRDT), circuit breaker, retry/backoff, SSE parser, session lifecycle
`atomr-infer-runtime-{openai,anthropic,gemini,litellm}`	Per-provider `ModelRunner` against `api.openai.com`, `api.anthropic.com`, Vertex AI / AI Studio, and the LiteLLM proxy
`atomr-infer-runtime-{vllm,tensorrt,ort,mistralrs}`	Per-backend `ModelRunner` for local Rust-native and FFI runtimes; feature-gated so absent system libs don't break the workspace
`atomr-infer-pipeline`	`atomr-streams` integration plus `DynamicBatchingServer` / `InferenceCascade` / `ModelReplicaPool` / `FairShareScheduler` / `ModelHotSwapServer` / `SpeculativeDecoder` blueprints
`atomr-infer-testkit`	`MockRunner` + `wiremock`-backed provider mocks (`inject_429`, `inject_5xx`, …)
`atomr-infer-cli`	`atomr-infer serve --config <toml>`
`atomr-infer-py-bindings`	PyO3 bindings for `Cluster` / `Deployment`
`atomr-infer-python-bridge`	`PythonGpuBridge` + python-pinned dispatcher for vLLM-style runners

Plus a Python facade — pip install atomr-infer — that exposes the same Cluster.connect(...).deploy(deployment) shape from Python.

Quick start (Rust)

The umbrella crate is published on crates.io as atomr-infer:

[dependencies]
atomr-infer = { version = "0.3", features = ["openai", "anthropic", "pipeline"] }

Or pull in subsystem crates directly — atomr-infer-core, atomr-infer-runtime, atomr-infer-remote-core, and the four atomr-infer-runtime-{openai,anthropic,gemini,litellm} providers are all on crates.io.

use atomr_infer::prelude::*;

# async fn run() -> Result<(), Box<dyn std::error::Error>> {
let cluster = Cluster::create("inference", Config::empty()).await?;
cluster.deploy(Deployment {
    name: "gpt-4o-mini".into(),
    model: "gpt-4o-mini".into(),
    replicas: 1,
    ..Default::default()
}).await?;
cluster.serve("0.0.0.0:8080").await?;
# Ok(()) }

# OpenAI-compatible gateway over real (or mocked) providers.
cargo run -p atomr-infer-cli --features all-remote -- serve --config demo.toml

# End-to-end demo (happy path / 429 retry / circuit-open) without
# spending a cent — wiremock under the hood.
cargo run --bin remote_only_demo

# Pure-remote binary, zero GPU deps in the graph.
cargo build -p atomr-infer --no-default-features --features remote-only

Quick start (Python)

python -m venv .venv && source .venv/bin/activate
pip install atomr-infer

from atomr_infer import Cluster, Deployment

cluster = Cluster.connect("inproc://app")
cluster.deploy(Deployment(name="gpt-4o-mini", model="gpt-4o-mini", replicas=1))

The 0.3 surface is intentionally narrow — Deployment value objects and Cluster.connect(...).deploy(...). Decorators and direct ActorRef escape hatches land as the underlying Rust surface stabilises.

Building from source

# Rust
cargo build --workspace
cargo test --workspace
cargo build -p atomr-infer --no-default-features --features remote-only  # zero-GPU build

# Python bindings (requires maturin + a Python dev toolchain)
maturin develop --release
pytest python/tests -v

Crate-layer picker

The workspace splits into layers so a remote-only egress server pulls no GPU dependencies whatsoever, while a heterogeneous cluster pulls exactly the runtimes it serves. Three preset shapes:

Preset	What you get	What you skip
`remote-only`	OpenAI + Anthropic + Gemini + LiteLLM + pipeline + rate-limiting / circuit-breaker / cost tracking	All GPU code
`default-prod`	vLLM + TensorRT + ORT + OpenAI + Anthropic + pipeline	Other GPU runtimes; LiteLLM; Gemini
`all-runtimes`	Everything	—

Detailed feature matrix: docs/feature-matrix.md.

Layout

crates/                       Rust workspace
  atomr-infer-core/           foundation: traits, types, no actor / GPU / HTTP deps
  atomr-infer-runtime/        gateway, request, dp-coordinator, two-tier worker
  atomr-infer-remote-core/    rate limiter (CRDT), circuit breaker, retry, SSE
  atomr-infer-runtime-*/      per-provider / per-backend ModelRunner
  atomr-infer-pipeline/       atomr-streams + batching/cascade/replica blueprints
  atomr-infer-testkit/        MockRunner + wiremock-backed provider mocks
  atomr-infer-cli/            `atomr-infer serve --config <toml>`
  atomr-infer-py-bindings/    PyO3 bindings
  atomr-infer/                rollup
ai-skills/                    Claude / Cursor / Codex / Gemini SKILL.md bundle
docs/                         Architecture (RFC v4), feature matrix, deployment guide
examples/remote_only_demo/    end-to-end happy-path / 429 / circuit-open demo
xtask/                        Cargo xtask (audit, bump, verify, release-checklist)

AI-assisted development

If you're using Claude Code, Cursor, or another AI coding assistant on a project that depends on atomr-infer, install our ai-skills bundle — seven skills covering quickstart, choosing a runtime, wiring remote providers, composing pipelines, deployment, typed-error troubleshooting, and extending with a new backend.

/plugin marketplace add rustakka/atomr-infer
/plugin install atomr-infer-ai-skills@atomr-infer

Each SKILL.md is a thin router into the canonical docs. Other harnesses (Cursor, Codex CLI, Gemini CLI, Aider, etc.) have install instructions in ai-skills/README.md.

Companion bundles for the broader stack:

atomr ai-skills — actor design, supervision, persistence, clustering, Python bindings.
atomr-accel ai-skills — DeviceActor, kernel selection, two-tier GPU supervision, backend choice.

Learn more

docs/feature-matrix.md — every feature flag, what it pulls into the dep graph, when to enable it.
docs/rustakka-inference-architecture-v4.md — full RFC v4 design (1,400+ lines): supervision tree, routing CRDT, distributed rate limiter, hybrid pipelines.
RELEASING.md — versioning, allowlist, secrets, emergency-release runbook.
CONTRIBUTING.md — dev setup, conventional commits, audit baselines.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.cargo		.cargo
.claude-plugin		.claude-plugin
.claude		.claude
.github		.github
ai-skills		ai-skills
crates		crates
docs		docs
examples/remote_only_demo		examples/remote_only_demo
python		python
xtask		xtask
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
RELEASING.md		RELEASING.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

atomr-infer

Why multi-runtime inference, in Rust, now

What's in the box

Quick start (Rust)

Quick start (Python)

Building from source

Crate-layer picker

Layout

AI-assisted development

Learn more

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

atomr-infer

Why multi-runtime inference, in Rust, now

What's in the box

Quick start (Rust)

Quick start (Python)

Building from source

Crate-layer picker

Layout

AI-assisted development

Learn more

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages