Skip to content

rustakka/atomr-infer

Repository files navigation

atomr-infer

A native Rust multi-runtime inference layer built as a supervised actor topology on top of atomr. atomr-infer gives you a single mental model — one Deployment value object, one routing CRDT, one supervision tree — that scales from a single OpenAI-key script to a heterogeneous cluster blending owned GPU hardware with managed APIs. The same actor_ref.tell(msg) lands a request on an H100 two racks away or in another company's data center.

use atomr_infer::prelude::*;

// Same value object describes a vLLM-on-4×H100 replica or a Gemini
// Vertex deployment. The `runtime` field is the only thing that
// changes — and it's auto-inferred from the model name when omitted.
let dep = Deployment {
    name: "gpt-4o-mini".into(),
    model: "gpt-4o-mini".into(),
    runtime: None,
    runtime_config: None,
    gpus: None,
    replicas: 1,
    serving: Serving::default(),
    budget: None,
    idempotent: true,
};

Why multi-runtime inference, in Rust, now

Production AI rarely runs only on owned hardware. Frontier models, burst capacity, and compliance edge cases all push work onto managed APIs. Bolting a provider SDK onto a separate retry / rate-limit / observability stack from your local GPU pool fragments the system — and the cracks are exactly where 3 a.m. pages come from.

Heterogeneous workloads are the norm, not the exception. vLLM on a DGX node, a Candle CPU model in a sidecar, an OpenAI call for the long tail of hard prompts, an Anthropic fallback when OpenAI rate-limits — that's one application, but today it's three SDKs, three retry policies, three observability stacks. atomr-infer treats every runtime as just-another-ModelRunner. The gateway, request actor, and routing CRDT don't know — and don't care — whether a request lands on a local GPU or a remote API.

Cost, latency, and reliability are coupled. A pipeline that classifies cheaply on a local model and escalates to GPT-4o for hard cases is also the pipeline that needs to fall back to Anthropic when OpenAI is saturated and shed traffic when the hourly budget hits. Threading those concerns by hand produces brittle glue. atomr-infer encodes them as composable actors — InferenceCascade, RateLimiterActor (CRDT-backed), CircuitBreakerActor, Budget { on_exceeded: Reject } — under one supervision tree with one trace and one backpressure story.

Granular efficiency. Rust gives us deterministic resource use, zero-cost abstractions, and ownership-as-concurrency-safety. Per-actor footprint stays small; per-message cost stays low. The remote-network tier is HTTP/2 + SSE + connection pooling with structured retry; the local-GPU tier rides on top of atomr-accel's two-tier device supervision. A cargo build --features remote-only produces a binary with zero cudarc, zero atomr-accel, zero candle, zero pyo3 in the dependency graph — the layered crate split makes the invariant load-bearing, not aspirational.

What's in the box

Crate What it does
atomr-infer Umbrella facade re-exporting the public surface, feature-flag-driven
atomr-infer-core Deployment value object, ModelRunner trait, typed InferenceError, batch primitives
atomr-infer-runtime Gateway, request actor, dp-coordinator, engine-core, two-tier worker, placement, deployment manager, metrics
atomr-infer-remote-core Distributed rate limiter (CRDT), circuit breaker, retry/backoff, SSE parser, session lifecycle
atomr-infer-runtime-{openai,anthropic,gemini,litellm} Per-provider ModelRunner against api.openai.com, api.anthropic.com, Vertex AI / AI Studio, and the LiteLLM proxy
atomr-infer-runtime-{vllm,tensorrt,ort,mistralrs} Per-backend ModelRunner for local Rust-native and FFI runtimes; feature-gated so absent system libs don't break the workspace
atomr-infer-pipeline atomr-streams integration plus DynamicBatchingServer / InferenceCascade / ModelReplicaPool / FairShareScheduler / ModelHotSwapServer / SpeculativeDecoder blueprints
atomr-infer-testkit MockRunner + wiremock-backed provider mocks (inject_429, inject_5xx, …)
atomr-infer-cli atomr-infer serve --config <toml>
atomr-infer-py-bindings PyO3 bindings for Cluster / Deployment
atomr-infer-python-bridge PythonGpuBridge + python-pinned dispatcher for vLLM-style runners

Plus a Python facade — pip install atomr-infer — that exposes the same Cluster.connect(...).deploy(deployment) shape from Python.

Quick start (Rust)

The umbrella crate is published on crates.io as atomr-infer:

[dependencies]
atomr-infer = { version = "0.3", features = ["openai", "anthropic", "pipeline"] }

Or pull in subsystem crates directly — atomr-infer-core, atomr-infer-runtime, atomr-infer-remote-core, and the four atomr-infer-runtime-{openai,anthropic,gemini,litellm} providers are all on crates.io.

use atomr_infer::prelude::*;

# async fn run() -> Result<(), Box<dyn std::error::Error>> {
let cluster = Cluster::create("inference", Config::empty()).await?;
cluster.deploy(Deployment {
    name: "gpt-4o-mini".into(),
    model: "gpt-4o-mini".into(),
    replicas: 1,
    ..Default::default()
}).await?;
cluster.serve("0.0.0.0:8080").await?;
# Ok(()) }
# OpenAI-compatible gateway over real (or mocked) providers.
cargo run -p atomr-infer-cli --features all-remote -- serve --config demo.toml

# End-to-end demo (happy path / 429 retry / circuit-open) without
# spending a cent — wiremock under the hood.
cargo run --bin remote_only_demo

# Pure-remote binary, zero GPU deps in the graph.
cargo build -p atomr-infer --no-default-features --features remote-only

Quick start (Python)

python -m venv .venv && source .venv/bin/activate
pip install atomr-infer
from atomr_infer import Cluster, Deployment

cluster = Cluster.connect("inproc://app")
cluster.deploy(Deployment(name="gpt-4o-mini", model="gpt-4o-mini", replicas=1))

The 0.3 surface is intentionally narrow — Deployment value objects and Cluster.connect(...).deploy(...). Decorators and direct ActorRef escape hatches land as the underlying Rust surface stabilises.

Building from source

# Rust
cargo build --workspace
cargo test --workspace
cargo build -p atomr-infer --no-default-features --features remote-only  # zero-GPU build

# Python bindings (requires maturin + a Python dev toolchain)
maturin develop --release
pytest python/tests -v

Crate-layer picker

The workspace splits into layers so a remote-only egress server pulls no GPU dependencies whatsoever, while a heterogeneous cluster pulls exactly the runtimes it serves. Three preset shapes:

Preset What you get What you skip
remote-only OpenAI + Anthropic + Gemini + LiteLLM + pipeline + rate-limiting / circuit-breaker / cost tracking All GPU code
default-prod vLLM + TensorRT + ORT + OpenAI + Anthropic + pipeline Other GPU runtimes; LiteLLM; Gemini
all-runtimes Everything

Detailed feature matrix: docs/feature-matrix.md.

Layout

crates/                       Rust workspace
  atomr-infer-core/           foundation: traits, types, no actor / GPU / HTTP deps
  atomr-infer-runtime/        gateway, request, dp-coordinator, two-tier worker
  atomr-infer-remote-core/    rate limiter (CRDT), circuit breaker, retry, SSE
  atomr-infer-runtime-*/      per-provider / per-backend ModelRunner
  atomr-infer-pipeline/       atomr-streams + batching/cascade/replica blueprints
  atomr-infer-testkit/        MockRunner + wiremock-backed provider mocks
  atomr-infer-cli/            `atomr-infer serve --config <toml>`
  atomr-infer-py-bindings/    PyO3 bindings
  atomr-infer/                rollup
ai-skills/                    Claude / Cursor / Codex / Gemini SKILL.md bundle
docs/                         Architecture (RFC v4), feature matrix, deployment guide
examples/remote_only_demo/    end-to-end happy-path / 429 / circuit-open demo
xtask/                        Cargo xtask (audit, bump, verify, release-checklist)

AI-assisted development

If you're using Claude Code, Cursor, or another AI coding assistant on a project that depends on atomr-infer, install our ai-skills bundle — seven skills covering quickstart, choosing a runtime, wiring remote providers, composing pipelines, deployment, typed-error troubleshooting, and extending with a new backend.

/plugin marketplace add rustakka/atomr-infer
/plugin install atomr-infer-ai-skills@atomr-infer

Each SKILL.md is a thin router into the canonical docs. Other harnesses (Cursor, Codex CLI, Gemini CLI, Aider, etc.) have install instructions in ai-skills/README.md.

Companion bundles for the broader stack:

  • atomr ai-skills — actor design, supervision, persistence, clustering, Python bindings.
  • atomr-accel ai-skills — DeviceActor, kernel selection, two-tier GPU supervision, backend choice.

Learn more

License

Apache-2.0. See LICENSE.

About

Multi-runtime GPU + remote inference as a supervised actor system on atomr. OpenAI / Anthropic / Gemini / LiteLLM remote runtimes + vLLM / TensorRT / ORT / mistral.rs local; remote-only build compiles zero GPU deps.

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors