A large release — roughly fifty PRs since v0.7.0 — landing three bodies of work plus a security-hardening pass:
- HTTP admin gateway. The new
astrid-gatewaycrate fronts the kernel admin/request IPC surfaces over HTTP: principals, caps, quotas, groups, invites, per-principal capsule env, audit SSE + historical audit queries, agent-prompt SSE, OpenAPI spec emission, native rustls TLS, CORS, Prometheus metrics — with a bus-direct admin path (285× admin throughput) and invite/keypair CLI verbs for operator parity. - Runtime concurrency overhaul. The orchestration concurrency cliff is closed end-to-end: per-(capsule, topic, principal) routed IPC with DRR fairness and byte budgets, async Wasmtime (guest calls no longer pin tokio workers), dynamic host-sized per-capsule instance pools, split blocking/async-I/O host-call semaphores, and per-principal CPU-fuel + peak-memory ledgers with rate enforcement and usage reporting.
- Host process + introspection surface. The
astrid:process@1.0.0persistent-process tier is implemented (background children that outlive the pooled WASM instance, id-keyed reattach/stdin/log-cursor ops, operatorallow_persistentopt-in), capability introspection lands (enumerate-capabilities+ a completedcheck-capsule-capability),astrid mcp serveexposes capsule tools + capability consent to any MCP client, andastrid-emitgives agent hook processes an agent-agnostic stdio→bus pipe. - Security. The macOS 15+ native-subprocess sandbox is no longer silently disabled; capsule audit-feed subscriptions are principal-scoped by default; failed token redeems now audit as failures;
self:agent:listno longer leaks the principal roster; both unauthenticated redeem routes share per-IP rate limiting; bearers gain revocation-on-principal-delete (wire format v2 — in-flight v0.7.0 bearers must re-redeem).
Breaking changes to note: Capsule.toml moves to [publish] / [subscribe] tables as the only IPC-intent surface ([[interceptor]], the ipc_publish / ipc_subscribe arrays, and [[topic]] are removed), the bearer wire format is v2, MSRV is 1.95, and the astrid-openclaw build path is removed.
Breaking
Capsule.toml: the[[interceptor]]block is removed — interceptor bindings are now[subscribe]entries with ahandlerand an optionalpriority. A capsule used to bind an interceptor either via a[[interceptor]]block (event+action+ optionalpriority) or a[subscribe]entry with ahandler— the two overlapped, and only the legacy block could set apriority(the[subscribe]form hardcoded the default 100).[subscribe]now carries an optionalpriority(lower fires first; default 100; apriorityon a handler-less ACL-only entry is rejected at parse time), making it the single interceptor-binding form, and[[interceptor]]is gone. A manifest still declaring[[interceptor]]no longer binds those handlers — the block is ignored. Migrate each block to a[subscribe]entry:"<event>" = { wit = "<typed-payload>", handler = "<action>", priority = <n> }. Note[subscribe]requires a typedwitpayload reference that[[interceptor]]did not, so each migrated binding declares its payload type (typed IPC everywhere). Closes #858.Capsule.toml: thecapabilities.ipc_publish/ipc_subscribestring arrays are removed —[publish]/[subscribe]tables are the only IPC ACL. The tables already superseded the arrays (their keys are the ACL when present); the arrays are now gone, so[publish]/[subscribe]is the single way to declare what a capsule may publish or subscribe to (an empty table = may not publish/subscribe, fail-closed). A capsule still declaringipc_publish = [...]/ipc_subscribe = [...]under[capabilities]no longer grants any IPC ACL — the unknown keys are ignored. Migrate each pattern to a table entry: a publish pattern becomes[publish] "<topic>" = { wit = "<typed-payload>" }; an ACL-only subscribe pattern (no handler, e.g. an uplink proxy) becomes[subscribe] "<topic>" = { wit = "opaque" }. Closes #864.Capsule.toml: the[[topic]]table is removed; topic schemas are sourced from the[publish]/[subscribe]witrefs. The[[topic]]block let a capsule self-describe a topic with an inline JSON Schema file, baked intometa.jsonat install. That self-description (flaky to keep in sync) is dropped in favour of the typedwitpayload ref already on every[publish]/[subscribe]entry: the A2UI schema catalog (SchemaCatalog) now registers each topic from those tables and records itswitref for the A2UI bridge to resolve to a schema + description via the WIT registry. Removed: the[[topic]]table (TopicDef/TopicDirection), the install-timebake_topicspath, themeta.topics/BakedTopicinstall metadata, and theastrid capsule listbaked-topic display. No capsule declared[[topic]], so there is nothing to migrate. Closes #865.- Bearer wire format bumped to v2 (4 segments, was 3). The token now carries an
iat(issued-at epoch) claim alongsideprincipalandexp. Required by the new revocation machinery: withoutiatthe only revocation semantics available would be "blanket reject forever," which surprises an operator who later re-creates a principal with the same id. Dashboard sessions issued by the v0.7.0 gateway no longer verify after upgrade — clients must re-redeem. CLIastrid invite redeem, the existing pair-device flow, and every browser session that goes through/api/auth/redeemmint the new shape automatically; only pre-existing in-flight bearers are affected. Format spec:b64url(principal) "." b64url(iat) "." b64url(exp) "." hex(sig), with sig overprincipal:iat:exp. Closes #772. - MSRV bumped to 1.95.0.
surrealdb 3.0.0-beta.3'skv-memfeature pulled insurrealmx v0.21.0→ferntree v0.7.0, which usesstd::hint::cold_pathstabilised in Rust 1.95. Upstream declared norust-version, so cargo's resolver silently picks 0.7 even though the workspace MSRV says 1.94. Bumping our MSRV is the smallest fix that keeps CI deterministic without committingCargo.lock(which is intentionally gitignored). Affectscargo install astridconsumers — installers on 1.94 will see a clear "requires rustc 1.95" error rather than the crypticcold_pathfailure.
Added
-
astrid mcp serve— an MCP server surface exposing Astrid capsule tools + capability consent to any MCP client. Astrid already had an MCP client (astrid-mcpmanages external tool servers) and a tool broker (sage-mcpdiscovers capsule tools and shapes MCP descriptors), but no way for an MCP client — the managedclaude -p, or Codex/Gemini/any client — to consume Astrid's capsule tools over the standard MCP wire protocol.astrid mcp serve(a subcommand of theastridCLI, not a separate binary) is a thin rmcp stdioServerHandlerthat delegates to the existingsage-mcpbroker over the already-allowlistedastrid.v1.request.mcp.*/astrid.v1.response.<req_id>topics — the crypto/audit/enforcement stay in Astrid; the shim only terminates the MCP wire protocol.get_infoadvertises tools +tools.list_changed;list_tools/call_toolpublish on an uplink and await the single-segment reply (one-response invariant, so the shim never hangs); the calling principal is stamped from--principal(default = active/default principal). Capability approvals are relayed into the client's own UI via MCP elicitation (ctx.peer.elicit::<ApprovalChoice>{ApproveOnce,ApproveSession,ApproveAlways,Deny}, gated on the client advertising the elicitation capability; decline/cancel → Deny; never elicits secrets). Hot-reload subscribesastrid.v1.capsules_loaded, re-enumerates, diffs, andpeer.notify_tool_list_changed()(no-op notifications suppressed).mcp serveowns stdout for the JSON-RPC stream, so logging is forced off stdout (to file, else stderr) for this command regardless of operator config, so a stray frame cannot corrupt the protocol stream. This is the foundation for registering a namedsageserver in the managedclaude -p(somcp__sage__*resolves natively andmcp_toolhooks / channels can bind by name) and for the agent-neutral backplane. rmcp0.15 → 1.7.0workspace-wide; the existingastrid-mcpclient (ClientHandler/RoleClient) is migrated to the 1.7 API. No kernel / WIT / allowlist change. Closes #879. -
astrid:process@1.0.0persistent-process tier — implemented. A capsule can now spawn a background child that outlives the pooled, stateless WASM instance that started it. Previously an ephemeralprocess-handleis reaped when its instance resets on return to the dynamic pool, so a process started in one tool invocation could not survive to the next — the splitspawn → read → stoppattern was impossible. A new host-ownedPersistentProcessRegistry— cloned into every pooledHostStateexactly like the cancellationProcessTracker, so aprocess-idsurvives instance churn — owns the child (spawned on the daemon runtime under the samebwrap/Seatbelt sandbox as the ephemeral tier), its per-stream log rings, and its stdin pipe. Implemented:spawn-persistent(returns a 256-bit host-minted CSPRNGprocess-id, lowercase base32 so it doubles as an IPC topic suffix; the registry stores only a keyed BLAKE3 hash, never the raw token),status/status-many/list-processes,read-logs(drain) +read-since(non-draining, cursor-addressed, byte-faithfullist<u8>),signal(incl.stop/cont), boundedwait,stop(SIGTERM→grace→SIGKILL, frees the slot),release-process, andwrite-stdin/close-stdin(viakeep-stdin-opencapture). Every id-keyed call re-resolves the live(principal, capsule)and checks it against the recorded creator, so a leaked id is inert across the principal/capsule boundary — unknown / wrong-owner / wrong-capsule / reaped all collapse tono-such-processwith no oracle;spawn-persistentrefuses the owner-fallback principal (persist-unsupported) so tenants never share adefaultnamespace. Lifecycle is enforced by a per-capsule reaper task: per-principal concurrent + retained-id caps, idle / max-lifetime / exit-retention TTLs (guest values clamped DOWN to host ceilings), and a kill-all on capsule unload / daemon graceful shutdown. Works on Linux and macOS (the macOS caveat — a daemon hard crash, not a graceful shutdown, can orphan a still-sandboxed child because Seatbelt has nodie-with-parent— is a weaker cleanup guarantee, not a containment gap). Still deferred, honestly:attach(the resource-handle composition sugar; the id-keyed ops are its documentedattach(id)?.method()equivalent),watch/unwatch(host-published lifecycle events — an OPEN publish-authority question in RFC host_abi, withstatus+ boundedwaitpolling as the working alternative), and the WIT's own(NOT YET …)items (resource-limit enforcement,cpu-ms/mem-bytes-peak, instance-local pollables). Contract: unicity-astrid/wit#12. Design: unicity-astrid/rfcs#22. Closes #866. -
astrid:sys@1.0.0capability introspection —enumerate-capabilitiesimplemented,check-capsule-capabilitycompleted. A capsule can now read its OWN held capability names via the new infallibleenumerate-capabilities() -> list<string>: the capability categories declared in its[capabilities]manifest block (host_process,net_connect,fs_read, …) — the names, not the scoped arguments within them (allowlists,host:port, paths). This lets a reusable supervisor binary deployed under different manifests ground its behaviour in what it can actually do instead of hard-coding it, and lets any capsule avoid code-vs-manifest drift. Because the WIT is infallible (a barelist<string>, noresult), it reads an owned, lock-free snapshot taken once at load (CapabilitiesDef::held_names) and stored onHostStaterather than thecapsule_registry— there is noregistry-unavailablefailure mode to surface, and an empty list is the valid "no capabilities" answer; capsule capabilities are fixed at load (the grant/revoke model is principal-scoped, a separate axis), so the snapshot is correct for the capsule's whole lifetime and across pooled instances. In the same change,check-capsule-capability— previously a stub that answered onlyallow_prompt_injectionand returnedfalsefor every other capability — is completed onto the same namespace: both host fns (held_names()the list,has(name)the per-name dual) are DERIVED from the struct's serialized fields rather than a hand-maintained list, so a capability added toCapabilitiesDefflows through both automatically and the two cannot drift —nappears inheld_names()iffhas(n), and unknown names fail closed. Both are ungated, read-only, and audit-not-recorded: capability posture is structural metadata, not a secret (enforce-don't-conceal) — knowing a capability conveys no ability to use it. Lifecycle hooks and theastrid-hookshost, which run outside the capsule manifest/security-gate lifecycle, report an empty set (fail-closed). Contract: unicity-astrid/wit#13. Closes #868. -
Per-principal peak memory in usage reporting.
ResourceUsagegainsmemory_bytes_peak_total: Option<u64>— the cross-capsule high-water linear memory a principal has driven (max across every capsule it invokes), read from the shared memory ledger and surfaced by the adminUsageGet,GET /api/sys/principals/{id}/usage(a new field on the OpenAPIResourceUsageView), andastrid quota show(a new "memory peak" row). This fills the memory side of per-principal usage, which previously reported only the per-instance ceiling. Under pooled, shared Stores a live "current" total is not cleanly attributable, so the peak is the reported signal — the principal that grows a Store owns the peak;memory_bytes_current_totalstaysNone. Refs #816. -
Operator overrides for capsule runtime sizing (config + env + CLI). New
[capsule]config section withhost_blocking_concurrency,host_io_concurrency, andinstance_pool_size(all optional; unset → the host-derived default), the matchingASTRID_CAPSULE_HOST_BLOCKING_CONCURRENCY/ASTRID_CAPSULE_HOST_IO_CONCURRENCY/ASTRID_CAPSULE_INSTANCE_POOL_SIZEenv vars, andastrid-daemon --host-blocking-concurrency/--host-io-concurrency/--instance-pool-sizeflags. Precedence is CLI flag > config file > env > host-derived default; the daemon resolves the values once at boot and the kernel forwards them, unmodified, to every capsule'sWasmEngine(the same handle-plumbing shape asFuelLedger). A zero override is rejected at config-validation time (it would wedge a host-call class or leave a capsule with no instance to lease, rather than throttle). Refs #816. -
Admin contract for per-principal resource-usage reporting. Adds
AdminRequestKind::UsageGet { principal },AdminResponseBody::Usage(ResourceUsage), and theResourceUsagepayload (cross-capsule CPU total + configured ceilings + anexemptflag), scoped exactly likeQuotaGet(self:quota:get/quota:get) so a principal reads its own usage and only an admin reads another's. The quota/usage admin handlers move into a newadmin/quota.rssubmodule. Refs #820. -
Per-principal CPU-rate budget is now enforced. A reactive 1-second windowed throttle (
FuelRateLimiter) denies a principal's interceptor calls once it overruns itsmax_cpu_fuel_per_secbudget, until the window rolls — never permanently bricking it. Keyed on the invoking principal, sharded + per-principal-locked (no global lock), fail-secure (exemption fails closed, the window math fails open); admin /system:resources:unbounded/net_bind/uplinkholders are exempt. Refs #819. -
UsageGetnow reports live per-principal usage. The stub is replaced with the real cross-capsule CPU total (read from the sharedFuelLedger) and a realexemptflag computed with the same capability predicate enforcement uses, so displayed-exempt always matches enforced-exempt. Refs #820. -
GET /api/sys/principals/{id}/usage. Exposes per-principal resource usage over HTTP, mirroring the quotas route's self/admin scope and bearer-verified caller. Also fixes theQuotasViewOpenAPI mirror, which omittedmax_cpu_fuel_per_sec. Refs #820. -
astrid quota shownow reports usage vs budget. Adds a usage-vs-budget section (consumed CPU, ceiling, exempt flag) and the previously-missing CPU-rate ceiling line. Refs #820. -
astrid-emit— an agent-agnostic stdio→bus pipe. New companion binary (its owncrates/astrid-emitcrate, co-installed alongsideastridexactly likeastrid-daemon/astrid-build) that any agent's hook process can shell out to. It reads stdin to EOF as a UTF-8 string, wraps it in a fixed six-field, sage-validated envelope (hook,payload,correlation_id(alwaysnull),principal_id,session_id,token), connects to the daemon as an uplink, and publishes the envelope viapublish-ason the topic it was handed — accepted both positionally (astrid-emit sage.v1.hook.before_tool_call, what shipped sage writes intosettings.local.json) and via--topic <name>(exactly one required). Thehookfield is derived as the topic's trailing dot-segment so it matches sage's validator, andpayloadis forwarded as a verbatim string (never base64) so canonical subscribers receive the original hook JSON. The three transport identifiers are read from the requiredASTRID_PRINCIPAL_ID/ASTRID_SESSION_ID/ASTRID_HOOK_TOKENenvironment variables (set on the agent child at spawn); any missing-or-empty value is a soft failure. The binary is deliberately tiny and carries zero agent-protocol knowledge — no hook-name map, no stdin parsing, no verdict shaping, no merge, no await/response, no fail-closed behaviour: sage is the trust anchor, validator, and republisher, and the canonical hook-name map lives there.astrid-emitalways writes{"continue":true}to stdout (success and failure alike) and exits0on a successful publish or1on any failure (missing env, connect failure, send failure) — it never exits2. A newastrid --emit-pathdiscovery flag (handled before banner/config so it works on a half-configured host) prints the absolute path to the co-installedastrid-emitso hook-bridge installers can wire commands without guessing the install layout. Refs #814. -
Process-level + build-provenance metrics on the gateway
/metricsendpoint. The Prometheus exposition now ships the standardprocess_*family (process_cpu_seconds_total,process_resident_memory_bytes,process_threads,process_open_fds,process_start_time_seconds) via themetrics-processcollector, plus anastrid_build_info{version,git_sha,rustc}gauge for build-provenance joins.rate(process_cpu_seconds_total)is the direct, graphable detector for the idle 200–300% CPU class of incident that previously could only be inferred fromtop. The collector is pull-based — the/metricshandler callscollect()once per scrape, deliberately adding no background thread (you don't observe a background-CPU bug by spawning a polling thread to watch it).git_sha(short) andrustcare captured at compile time by a newcrates/astrid-gateway/build.rs, each falling back to"unknown"so a source-tarball build still populates the metric; build constants are exposed on the unauthenticated endpoint deliberately, since they carry no principal, path, or secret. This is the foundation slice of the host-internal operational-metrics layer; the full design, catalogue, and phased roadmap — and its division of responsibility from the #705 capsule-telemetry layer — land indocs/metrics.md. Refs #791. -
Event-bus + daemon saturation metrics. The runtime's missing saturation signal, recorded into the daemon-wide Prometheus recorder the gateway exports:
astrid_bus_events_published_total{event_kind}(publish throughput, labelled by the boundedAstridEvent::event_typeset — IPC traffic collapses toipc),astrid_bus_receiver_lagged_total{subscriber}(events a subscriber dropped by falling behind — a non-zerorate()is the signature of a feedback storm),astrid_daemon_active_connections(gauge) +astrid_daemon_connections_opened_total/_closed_total, andastrid_daemon_background_ticks_total{loop}across the idle / capsule-health / react-watchdog / bus-monitor loops (a flatrate()is a parked loop, a runawayrate()is a spin loop — the direct signal for the idle 200–300% CPU class of incident, complementing #787's storm log andprocess_cpu_seconds_total). Thesubscriberlabel is a fixed, code-assigned set: the long-lived daemon consumers (kernel_router,admin_router,connection_tracker,capsule_dispatcher,bus_monitor,revocation_watcher) are tagged via the newEventBus::subscribe_as/subscribe_topic_as; everything else (incl. dynamic capsule-supplied topic subscriptions) collapses tountagged, so a buggy capsule can't inflate cardinality. Adds themetricsfacade toastrid-eventsandastrid-kernel. The kernel-side uplink socket-accept counter is deferred to a follow-up (the accept happens inside a capsule via theastrid:nethost shim, a separate crate). Resolves theastrid_bus_*vsastrid_router_*naming question (§7.9) indocs/metrics.md. Refs #791. -
astrid buildinjects the getrandom custom-backend cfg forwasm32-unknown-unknowncapsules. Every capsule on that target must compile with--cfg=getrandom_backend="custom"souuidv4 /HashMapseeding link againstastrid-sys's host-routed RNG (astrid:sys/host.random-bytes); getrandom 0.4 requires the cfg on the final binary crate, so a capsule whose.cargo/config.tomlomitted it failed to build with a cryptic getrandomcompile_error!.astrid-buildnow injects the cfg itself via a mergedCARGO_ENCODED_RUSTFLAGS— because host ≠ wasm this is a cross-compile, so the flag reaches only the wasm artifacts, never host build scripts / proc-macros. The injection merges rather than clobbers — it concatenates inheritedRUSTFLAGS/CARGO_ENCODED_RUSTFLAGS, the capsule's own configrustflags(array or string form), and the getrandom cfg, de-duplicating only the getrandom cfg so multi-token flags like-C opt-level=3stay intact — so a capsule already carrying the flag is unaffected. This is a safety net for the canonical build tool; capsules still keep the flag in config because a plaincargo build/cargo testnever runs throughastrid-build. Closes #800. -
Passive event-bus storm diagnostics. A new bus-activity monitor (
astrid-kernel/src/bus_monitor.rs) subscribes to all events on its own subscriber, aggregates publish counts per topic over a rolling 5s window, and logs aWARNnaming the hottest topics whenever the sustained rate crosses 100 events/s (DEBUGotherwise); events dropped to broadcast lag are folded into the tally viadrain_laggedso an overflow spike still surfaces. A truly idle daemon publishes well under one event per second, so a sustained triple-digit rate with no client attached is the signature of a feedback loop / event storm — the failure mode that pegs CPU by waking every broadcast subscriber and re-invoking WASM interceptors. The monitor is a pure observer (it counts on its own subscriber, not insideEventBus::publish), so it adds zero hot-path overhead and keeps reporting even while a storm saturates the dispatcher; this makes the next such incident self-diagnosing instead of requiring a live profiler attach. BumpsINTERNAL_SUBSCRIBER_COUNT4 → 5. Closes #786. -
Typed OpenAPI schemas for gateway responses sourced from kernel types. Response payloads that flow through
astrid-coretypes (AgentSummary,CapabilityInfo,Quotas,GroupSummary,InviteIssued,InviteSummary) previously surfaced in the OpenAPI spec as opaqueserde_json::Value(any), so generated clients got no field-level types. Each now has a gateway-local*Viewschema mirror (astrid-corestays utoipa-free) that thevalue_typeresolves to, giving codegen real typed shapes. Fixing this also surfaced and corrected drift in the old hand-written shape comments:InviteIssuedcarries nofingerprint,InviteSummary's field istoken_fingerprint(plus an always-presentissued_at_epoch), andGroupSummary's flag isbuiltin(notbuilt_in). A regression test pins the mirrors so a payload can't silently revert toany.env.defaultanddistribution.brandingstayValue— they're legitimately polymorphic. Closes #783. -
Gateway default listen port changed
7777→2787("ASTR").7777collides in practice with Terraria and assorted dev tooling;2787is the phone-keypad mnemonic for ASTR (A=2, S=7, T=8, R=7) and is effectively free in the wild. The gateway is opt-in and the runbook has operators setlistenexplicitly, so impact is limited to anyone relying on the default. Closes #784. -
Gateway API client generation guide. New
docs/gateway-client.mddocuments how to build a client against the gateway HTTP admin API: generate from the OpenAPI spec atGET /api/openapi.json(openapi-typescript / progenitor), with hand-written layers only where the spec can't help — the two SSE streams (POST /api/agent/prompt,GET /api/events), the#[schema(value_type = String)]stand-in fields whose real types live inastrid-core, and the redeem/refresh auth lifecycle (the bearer is treated as opaque — no client-side signature verification). Records the placement decision: a gateway client is native/browser code and does not belong in the wasm-targeted capsule SDKs (sdk-rust/sdk-js); it earns its own repo when it graduates to a maintained library. Cross-linked fromcore/README.mdunder "Operator documentation". -
Gateway deployment runbook. New
docs/gateway-deployment.mdcovers the full operator surface: quickstart, behind a reverse proxy (nginx + Caddy snippets,trust-forwarded-fromdiscipline), native TLS via the[tls]block, monitoring (every counter / histogram explained, sample alert PromQL), authentication flow + bearer lifecycle, CORS allowlist grammar, key rotation, backup + restore, and a troubleshooting section for the common failure modes (401 cascade, rate-limiter lockout, CORS preflight mismatch, audit-history 502). Cross-linked fromcore/README.mdunder a new "Operator documentation" section. Closes #777. -
GET /api/sys/audit— historical-query endpoint over the persistent audit log. Companion to the liveGET /api/eventsSSE feed: SSE only delivers events from the moment the connection opens, so a dashboard rendering "the last 24 h of admin activity" had no way to backfill. Now it does. Paginated with?since/?until(epoch seconds),?method=AgentDelete,?principal=alice,?limit(default 100, cap 1000),?cursor(opaque). Same trust shape as the SSE handler —audit:read_allholders see the firehose; everyone else is silently scoped to their own principal regardless of what they pass in?principal. Reads viatokio::task::spawn_blockingso the underlyingSurrealKVquery doesn't stall the runtime worker. The new endpoint is plumbed throughastrid-daemon::spawn_gatewayalongsideevent_busso the gateway gets anArc<AuditLog>+SessionIdat boot; standalone-builderGatewayState::newreturns 502 honestly when those handles are absent. OpenAPI annotation + drift test pin the new route. Closes #779. -
Latency histograms + per-request structured logs for the gateway, backed by the
metricsfacade. The/metricsexposition now ships anastrid_gateway_request_duration_secondsPrometheus histogram per(method, route, status)with the standard HTTP buckets (5 ms → 10 s ++Inf), andastrid_gateway_requests_totalcarries a newstatuslabel so a 4xx/5xx spike decomposes separately from the 2xx traffic on the same route. Every request also emits one structuredtracingevent withmethod,route(matched template, never the raw URL — keeps cardinality bounded),status, andduration_ms;/healthzand/metricsdemote to DEBUG so the high-frequency liveness probes don't drown the INFO stream. Recording uses themetricscrate facade +metrics-exporter-prometheusinstead of a hand-rolled counter/histogram pair — decouples recording from export format so kernel-side or capsule-side observability can land later via the samecounter!()/histogram!()macros without each subsystem reinventing a metrics layer.astrid_gateway_auth_failures_total,astrid_gateway_redeem_attempts_total, andastrid_gateway_redeem_rate_limited_totalare registered at boot and render at0so dashboards have a stable series shape before the corresponding instrumentation lands. Closes #778. -
Native TLS termination in the gateway via rustls. New optional
[tls]block inetc/gateway-http.toml(cert-path,key-path) flips the gateway from plain HTTP to rustls-terminated HTTPS — useful for single-box installs, Tailscale-fronted deployments, and anyone running without a reverse-proxy ops layer. Without the block, the daemon behaves exactly as v0.7.0 (plain HTTP, TLS-upstream expected). Backed byaxum-server 0.7+aws-lc-rs(no openssl). NewGatewayConfig::validate()runs at daemon boot: missing cert/key paths produce a clear "refusing to boot" error so misconfig fails fast instead of returning malformed TLS handshakes at runtime. Group/world-readable key files generate aWARNlog at boot suggestingchmod 0600.rustls::crypto::CryptoProvider::install_default()is called idempotently intls::load_rustls_configso rustls 0.23's multi-provider deferral doesn't panic the first handshake. HSTS (Strict-Transport-Security: max-age=63072000; includeSubDomains) is layered onto every TLS response — only on the TLS dispatch path, since RFC 6797 forbids HSTS over plain HTTP. Binding a non-loopback address without a[tls]block now logs aWARNat boot suggesting the operator either enable TLS or confirm a reverse proxy is fronting the listener. Three new integration tests incrates/astrid-integration-tests/tests/gateway_tls.rsmint a self-signed cert at test time withrcgenand prove the round-trip end-to-end including HSTS presence on TLS / absence on plain HTTP. ACME / Let's Encrypt automation, mTLS client-cert auth, and HTTP/2 / h2 ALPN remain out of scope for v0.7 and tracked as follow-ups on the closed issue. Closes #773. -
HTTP admin gateway (
astrid-gateway). New crate that fronts the kernel's existingastrid.v1.admin.*+astrid.v1.request.*IPC surfaces over HTTP for browser dashboards. Reads~/.astrid/run/system.token, handshakes with the daemon over the same Unix socket the CLI uses, and stampsIpcMessage.principalfrom an ed25519-signed bearer it verifies against its boot-time public key — never from the request body. Full route surface, sufficient for a dashboard to provision principals, set quotas, manage groups + caps + invites, and configure capsule env:- Discovery (unauthenticated):
GET /api/distribution,GET /api/distribution/onboarding - Auth:
POST /api/auth/redeem,GET /api/auth/me,POST /api/auth/refresh(extends an existing session without forcing a re-redeem),POST /api/auth/pair-device(authenticated — mint a pair-token for the caller's own principal),POST /api/auth/pair-device/redeem(unauthenticated — the new device sends its ed25519 public key; the kernel appends it to the principal'sAuthConfig.public_keysand the gateway returns a fresh bearer for that principal) - Principals (agent CRUD):
GET/POST /api/sys/principals,GET/PATCH/DELETE /api/sys/principals/{id},POST .../enable,POST .../disable - Caps:
POST /api/sys/principals/{id}/caps(grant),DELETE /api/sys/principals/{id}/caps(revoke) - Quotas:
GET/PUT /api/sys/principals/{id}/quotas - Groups:
GET/POST /api/sys/groups,PATCH/DELETE /api/sys/groups/{name} - Invites:
POST/GET /api/sys/invites,DELETE /api/sys/invites/{fingerprint} - Capabilities catalog:
GET /api/sys/capabilities(drift-checked against the kernel-side tables at test time) returns structured entries — each withid,label,description,category(agent / caps / quota / group / invite / capsule / system / approval),scope(self/global), anddanger(safe/normal/elevated/extreme) — so dashboards can render Discord-style permissions panels with confirmation prompts on dangerous toggles, without hardcoding per-capability metadata client-side. Source of truth isastrid_core::capability_grammar::CAPABILITY_CATALOG; the kernel's drift tests pin every static cap-string against the catalog at test time. - Capsules:
GET /api/capsules,POST /api/capsules(cap-gated bycapsule:install; kernel handler is a stub today but the route is forward-compatible),GET /api/capsules/{id},GET /api/capsules/{id}/topics - Capsule env (per-principal config):
GET /api/capsules/{id}/env(schema fromCapsule.toml),POST /api/capsules/{id}/env/{field}(writes toFileSecretStoreforsecret-typed fields, env JSON under the principal's home fortext/select/array) - Audit stream:
GET /api/events— Server-Sent Events feed of audit entries. Operators withaudit:read_allsee the firehose; everyone else sees only entries whose principal matches their own. Kernel publishes a flat JSON event (ts_epoch, method, required_capability, principal, target_principal, params, outcome) to topicastrid.v1.audit.entryon everyrecord_admin_auditcall. 15s SSE keep-alive so NAT/proxy state survives idle stretches. - System:
GET /api/sys/status,POST /api/sys/capsules/reload - Ops probes (unauthenticated):
GET /healthz(200 iff daemon socket reachable; ~zero IPC cost so safe for high-frequency liveness probes) andGET /metrics(Prometheus text-exposition format —astrid_gateway_requests_total{method,route},astrid_gateway_auth_failures_total,astrid_gateway_redeem_attempts_total,astrid_gateway_redeem_rate_limited_total). Restrict access via reverse proxy / firewall.
localhost-bound by default; TLS expected upstream. Spawned by
astrid-daemonwhenetc/gateway-http.tomlhasenabled = true; missing/disabled = no-op so single-tenant deployments are unaffected. ThePOST /api/capsulesroute is wired through toKernelRequest::InstallCapsuleand the kernel-side handler ships in this release — see theastrid-capsule-installentry below.ApprovalRequiredresponses from the kernel are passed through structurally so dashboards can render approval prompts when capability gating triggers them. (#756) - Discovery (unauthenticated):
-
Invite-token primitives in the kernel. Four new
AdminRequestKindvariants —InviteIssue,InviteRedeem,InviteList,InviteRevoke— plus matchingAdminResponseBody::{Invite, InviteRedeemed, InviteList}. Backed by a file-at-etc/invites.tomlstore that persists onlyhex(sha256(token))(never the raw token) with 0600 perms and atomic write-then-rename.InviteRedeembypasses the cap-gate by explicit match in the dispatcher (the token IS the auth) and reuses the existingAgentCreatemachinery underadmin_write_lockso concurrent redeems can't double-spend. The audit sanitiser redacts both the raw token and the ed25519 public key from persisted audit rows, replacing each with its SHA-256 fingerprint. Newinvite:issue/invite:list/invite:revokecapabilities; the built-inadmingroup's*covers them. (#756) -
astrid invite {issue, list, revoke, redeem}CLI verbs. Operator parity with the gateway HTTP routes — useful for scripting and for end-to-end testing the invite flow without the HTTP front.redeemconnects asPrincipalId::default()so a fresh-machine onboarding flow doesn't need a pre-existingcli-context.toml.redeemaccepts either--public-key <hex>or--keypair <name>(reading the pubkey from the local keystore) and an optional--switchflag that auto-updatescli-context.tomlto the freshly-minted principal. -
astrid keypair {generate, list, show, pubkey, delete}CLI verbs. Multi-key local ed25519 keystore at~/.astrid/keys/local/<name>.{ed25519, pub.hex, meta.toml}— 0700 parent dir, 0600 private file, 0644 public, atomic write-then-rename throughout, secret bytes zeroized on drop viaed25519-dalek's zeroize feature. Each keypair carries a TOML metadata sidecar (schema_version, fingerprint, created_at, optional note, optionalbound_principalset on a successful invite redeem).pubkey --format opensshemitsssh-ed25519 AAAA…so the same key can be reused with SSH-style tooling. Forward-compatible with a futureAuthMethod::HardwareKeybackend — only the metabackendfield changes. -
astrid_core::capability_grammar::KNOWN_CAPABILITIES. Canonical static slice of every capability identifier the kernel recognises. The HTTP gateway's/api/sys/capabilitiesroute references it; new kernel-side tests inkernel_router/capability_catalog_tests.rsenumerate every string returned byrequired_capability/required_capability_for_admin_requestand assert each appears in the catalog — so adding a kernel cap without updating the catalog now fails CI. Compile-timeKNOWN_CAPABILITIES_COUNTpin catches off-by-one omissions. -
astrid-capsule-installcrate — kernel-sideKernelRequest::InstallCapsulehandler. Install machinery (file layout, content-addressing of WASM/WIT intobin/<hash>.wasm/wit/<hash>.wit, lifecycle hooks, topic baking,meta.jsonwrites,.capsulearchive unpacking with traversal/symlink defense) extracted fromastrid-cliso the daemon and the CLI reach disk through the same code path. The kernel handler is path-only by construction: network sources (@org/repo, GitHub URLs,openclaw:…,gh:, raw HTTPS) are rejected with a structured error pointing callers at the gateway's registry route — the daemon never fetches arbitrary bytes during install. On success the handler triggersload_all_capsulesso the new capsule is live without a daemon restart, and returns a flat JSONInstallOutput(target_dir, phase, installed_version, previous_version, wasm_hash, env_path, env_needs_prompt, missing_imports, export_conflicts) the dashboard can render directly. CLI behaviour is unchanged —astrid capsule installstill handles source resolution (GitHub release-asset download with clone+build fallback, OpenClaw transpile, archive auto-detect, Rust-source auto-build), and the post-resolution install delegates to the new crate. (#756) -
Bus-direct admin path — 285× admin throughput increase, 1000s-of-agents ready. Gateway admin routes used to dial the kernel over the Unix socket through the
astrid-capsule-cliproxy capsule, which capped at ~19 RPS regardless of concurrency (MAX_ACTIVE_STREAMS = 8+ per-stream 50ms poll budget). For a deployment hosting thousands of agents that was a hard wall. Newastrid_gateway::bus_admin::BusAdminClientpublishes admin requests directly onto the in-processkernel.event_bus(which the gateway already holds for the SSE audit + agent routes) and subscribes to the response topic locally. Same envelope shape, samerequest_idcorrelation, same kernel-side dispatcher — three fewer hops. Every admin route now goes throughstate.admin_client(caller)rather thanAdminClient::connect(...).awaitover the socket. Measured ceiling: 5,400 RPS reads at c=50 (p99=15ms) vs the previous 19 RPS / 1054ms ceiling. Writes also serialised through the admin dispatcher's single-task loop — parallelised that too (tokio::spawnper request inspawn_admin_router) so reads no longer block behind writes; the existingadmin_write_lockcontinues to serialise the actual write handlers. Writes now ~46 RPS peak (admin_write_lock + TOML rewrite bound — future scalability item, file an issue for an LSM-backed invite/profile store when it bites). All 34 e2e stories continue to pass on the new path; the socket-basedAdminClientis unchanged and still serves the CLI's external uplink use case. (#756) -
POST /api/agent/prompt— agent invocation over HTTP, SSE response stream. Dashboard clients can now talk to the AI through the gateway, not only via the CLI'suser.v1.promptsocket path. The route publishesIpcPayload::UserInput { text, session_id, context }directly onto the in-process kernel event bus (no proxy round-trip — sidesteps the capsule-cli stability issue tracked at unicity-astrid/capsule-cli#18) and returns a Server-Sent Events stream emittingevent: ready(correlation handshake),event: delta(incremental tokens fromagent.v1.stream.delta),event: response(final reply onagent.v1.response, closes the stream),event: session_changed(agent.v1.session_changed), andevent: elicit(forwarded fromastrid.v1.elicit.*for follow-up-input prompts). Per-session filtering at the consumer end so multiple concurrent dashboards don't see each other's chunks. 5-minute upper bound on the stream — dashboards re-POST to continue. Newagenttag in the OpenAPI spec; newPromptRequest+PromptReadyschemas. Verified end-to-end against the live daemon:readyevent arrives with the caller's principal echoed back. Full LLM round-trip requires the react/identity/LLM-provider capsules to be installed. (#756) -
Path-parameter routes fix — every
{id}-style route was silently broken. Gateway is built onaxum 0.7.9which uses:idsyntax for path captures;{id}isaxum 0.8+. The routes module registered every parameterised path with{id}literally, so requests to/api/sys/principals/alice,/api/sys/principals/alice/quotas,/api/capsules/foo/env/api_key,/api/sys/groups/ops-team,/api/sys/invites/<fp>, and every other parameterised admin route returned a 404 — only the no-params routes worked. The OpenAPI spec uses{id}(correct, that's the OpenAPI standard) so the discrepancy didn't surface in the drift test. Caught by running the multi-story e2e against a live daemon; every parameterised route now routes correctly. The OpenAPI spec is unchanged (it's already correct for that format). (#756) -
End-to-end multi-perspective story test plan. 34-assertion script (
scripts/e2e-stories.sh, gated on a live daemon — not part of the cargo test matrix) walks the full admin gateway from three perspectives: bootstrap admin (issue invites, create custom groups, grant caps), team operator (redeems into a custom group, can issue invites but not delete principals, can't pair-device because the team group lacksself:auth:pair), and regular agent (redeems into the built-inagentgroup, can pair devices, sees only themselves in/api/sys/principals, can't issue invites or read system status until admin grants the cap, can read their own quota after admin sets it). Also covers/api/openapi.json(29 paths + 32 schemas),/api/sys/capabilities(34 caps, 8 categories), bearer refresh, unauthenticated 401 paths, and the agent SSE handshake. All 34 stories pass against the live daemon as of this commit. (#756) -
Gateway end-to-end smoke test +
ConnectInfofix. Newastrid-integration-tests/tests/gateway_e2e.rsboots a realKernelandGatewayStateagainst a tempdir$ASTRID_HOMEand proves the boot artefacts (Unix socket, persistent KV directory, gateway signing key atkeys/gateway.ed25519) land on disk, the unauthenticated routes (/api/distribution,/api/openapi.json,/healthz) return 200 against the live state, bearers round-trip via the same signing material the middleware verifies, and the kernel-bus → SSE wiring delivers audit events to subscribers. The full/api/auth/redeemsocket loop is gated on theastrid-capsule-cliproxy capsule (out of this crate's scope to load) and was exercised manually against a built daemon. Bug found by manual exercise:axum::serve(listener, router)withoutinto_make_service_with_connect_info::<SocketAddr>()means theConnectInfo<SocketAddr>request extension is missing, soPOST /api/auth/redeem(which extracts it for per-IP rate limiting) returned 500Missing request extension. Fixed by switching the serve shape — every tower-style in-process test still works becauseoneshotdoesn't require a real connection, but the production daemon does. (#756) -
GET /api/openapi.json— OpenAPI 3.x spec emission viautoipa. Every gateway handler now carries a#[utoipa::path(...)]annotation and every request/response type derivesToSchema. The aggregatedApiDoclists all 35 routes underpaths(...)and 32 schemas undercomponents(schemas(...)). The spec declares abearerAuthHTTP security scheme (description spells out the verification posture) as the default requirement; unauthenticated routes (/api/distribution,/api/distribution/onboarding,/api/auth/redeem,/api/auth/pair-device/redeem,/healthz,/metrics,/api/openapi.jsonitself) clear it viasecurity(()). Twelvetags(...)group routes by family (auth, principals, caps, quotas, groups, invites, capsules, env, audit, system, discovery, ops) so Swagger UI / Redoc / Scalar render coherent sections. Type-system boundary: a handful of response types originate inastrid-core(PrincipalId,Quotas,AgentSummary,GroupSummary,InviteIssued,InviteSummary,DaemonStatus,CapabilityInfo) and don't carryToSchema— pulling utoipa across the kernel-side dep graph would balloon the build for one observability concern. Those fields use#[schema(value_type = serde_json::Value)]to render as generic objects in the spec with prose describing the exact field shape in the surrounding text. New drift test intests/router.rsenumerates every registered route inroutes::buildand asserts each appears in the spec — adding a route without an annotation fails CI. NewErrorBodystruct inerror.rsdocuments the unified failure shape every status code emits ({ error, reason?, retry_after_secs? }); the IntoResponse impl still writes the same wire format viajson!()macros so no body changes for existing clients. Drop the URL intoopenapi-typescript/openapi-generator/kiotato get a typed client; drop it into Swagger UI / Redoc / Scalar for browsable docs. (#756) -
Source-direct content-addressing during install. Previously the install copied the entire capsule tree into the target directory, then read the
.wasmback out, BLAKE3-hashed it, wrotebin/<hash>.wasm, and deleted the per-capsule copy — same dance forwit/. The new lib hashes WASM and WIT from the source directly, writes to the content store once (atomic temp+rename so concurrent writers on identical bytes converge harmlessly), and the per-capsule directory copy excludes*.wasmand the top-levelwit/by construction. No transient staging copy. The runtime contract is unchanged — loader still reads viaresolve_content_addressed_wasm— only the install path is cleaner. Pre-flight hashing also means content-addressing failures (bad source, hash collision, disk full onbin/) leave the existing install intact: no rollback needed becausetarget_dirhasn't been touched yet. (#756) -
Distro.toml[invites]and[branding]sections (additive).[invites] { issuers, default-group, default-expires, max-principals }declares the deployment's onboarding policy (emptyissuers= single-tenant, no registration UI).[branding] { icon, primary-color, accent-color }carries dashboard visual hints. Parser-validated: non-emptyissuersrequiresdefault-group; durations parse asNs/Nm/Nh/Nd; colours match#RGBor#RRGGBB; icons are capped at 64 KiB on parse.
Changed
- Guest
ERROR-level logs now surface in the daemon log, not only the per-capsule log file. When a run-loop capsule'srun()returnsErrbefore signaling ready, the daemon log showed only a contextlessCapsule run loop exited before signaling ready— for the sole-socket-owner uplink (the cli proxy) that takes the whole daemon down, and diagnosing it meant hours of guessing. The reason was never actually lost: the SDK#[astrid::run]macro logsrun loop exited with error: {e}at ERROR level before returning, butsys::logwrites guest logs to a per-capsule file (effective_capsule_log()) and only fell back to the daemon's tracing subscriber when there was no per-capsule file — so the reason landed in a separate file operators don't check during a crash. ERROR-level guest logs are now emitted to the daemon's tracing subscriber in addition to the per-capsule file (lower levels stay file-only when captured, preserving the daemon log's signal-to-noise). The silent run-loop crash is now a one-line diagnosis (ERROR …host::sys: run loop exited with error: … plugin=<capsule>) sitting right next to the kernel's message. No ABI/WIT change. Refs #884. - Publish/subscribe ACL wildcard matching now uses the route-layer subtree semantics — a declared trailing
*covers the whole namespace. ACL authorization used the stricttopic::topic_matches(a*matches exactly one segment) while event delivery usesastrid_events::TopicMatcher(a trailing*is a subtree wildcard matching one or more segments at any depth). That divergence forced capsule manifests to enumerate wildcard depth (astrid.v1.admin.*/*.*/*.*.*) merely to authorize publishing topics whose depth varies or is unknown (uuid suffixes, variable sub-paths) — the same matcher asymmetry behind the cli run-loop subscribe confusion, now on the publish side. The two ACL checks (publish_inner,check_subscribe_acl) now authorize via a single allocation-freeastrid_events::topic_pattern_matches— extracted as the one source of truth now used by routed delivery (TopicMatcher), broadcast delivery (EventReceiver), and ACL authorization alike — so a declaredastrid.v1.admin.*authorizes every admin topic at any depth and the three paths can never silently diverge again (this also folds away a duplicated matcher and fixes a latent equal-segment-count bug in the broadcast path's hand-rolled iterator form). Scope is the two ACL sites only — interceptor dispatch keeps strict matching, and the runtime "wildcard must be terminal" subscribe gate is unchanged. The change is permissive (authorizes more, denies nothing previously allowed; delivery was already subtree); breadth is the operator's decision at install, not something the matcher enforces by forcing enumeration. Refs #882. - The
astrid:processWIT now documents the id-keyed persistentwrite-stdin/close-stdinas implemented. These landed in #867 — the persistent-process registry retains the child's stdin pipe host-side across pooled-instance resets (1 MiB-capped, backpressured, ownership-checked, audited) — but the WIT still tagged them(NOT YET IMPLEMENTED)and carried two stalePERSISTENT TIER"stubbed until the registry lands" banners. The tags are dropped and the banners corrected; function signatures are unchanged (doc-only). The ephemeralProcessHandleform,attach, andwatch/unwatchremain genuinely deferred and stay tagged. An acceptance test now locks the id-keyed path: deliver bytes → a by-id re-write (needing only registry + id, exactly what a post-reset instance holds) reaches the same child →close-stdinyields a clean EOF exit → over-cap returnstoo-large→ wrong-owner returnsno-such-process→ post-close returnsclosed. Refs #870. - Host-call concurrency is split into separate blocking vs async-I/O semaphores — the LLM-path throughput gate. A single
host_semaphoresized atcores - 2previously gated every host call: both the blocking ones thatblock_in_place+block_onand pin a tokio worker for the whole permit-held wait (KV, identity, sys, fs, the net/process security gates, DNS, sockets) and the async-I/O ones that.awaitreal I/O and free the worker (HTTP request/stream,ipc::recv). Thecores - 2cap is right for the blocking class — it must not approach the worker-pool size or blocking host work starves the scheduler — but it throttles outbound I/O far below what the host sustains, capping the LLM/HTTP path.HostStatenow carries two gates:blocking_semaphore(host-derivedcores - 2, the same ceiling — but no longer contended by I/O calls, so strictly more headroom) andio_semaphore(host-derived, cores-scaled and clamped by half the processRLIMIT_NOFILE— floored at 1, so on an fd-scarce host the descriptor budget wins and the ceiling can fall below theIO_MINfloor, preventingEMFILE— since each in-flight async call may hold a socket fd). The four genericbounded_*host helpers are unchanged; each call site passes the semaphore matching its class (theblock_onhelpers take blocking, theawaithelpers take I/O). Net stays on the blocking semaphore for now — reclassifying its socket/accept/DNS paths, which do real I/O but currently use the blockingblock_onhelpers, to the async semaphore is a follow-up. Refs #816. - Per-Store memory limiting now also meters per-principal peak usage. The plain
wasmtime::StoreLimitsthat capped each capsule instance's linear memory is replaced by aStoreMemoryMeterthat keeps the same enforcement (the per-invocationmax_memory_bytesceiling) and records, into a kernel-owned sharedMemoryLedger, the high-water linear-memory size each invoking principal grows a Store to — the RAM analogue of the per-principalFuelLedger. It is keyed by the same invoking principal (caller → owner → default) the fuel ledger charges, and re-targeted per invocation since a pooled Store is leased across principals. Sharded + atomic like the fuel ledger, so recording stays off the lock — and unlike the fuel ledger it is bounded (capped at a max principal count, evicting the lowest-peak entry when full) so a flood of ephemeral sub-agent principals cannot grow it without limit (astrid#827). Telemetry only — no deny path. Refs #816. - The per-capsule instance pool is now dynamic and host-sized — the fixed
INSTANCE_POOL_SIZE = 16is gone. Each non-run-loop capsule used to eagerly instantiate exactly 16(Store, Instance)pairs at load regardless of machine size or actual concurrency — 16× the linear-memory footprint up front even for an idle capsule, and a hard concurrency ceiling of 16 on a large host. The pool now warm-starts at a smallmin_idle, grows lazily (a checkout that finds no warm instance mints a fresh one while holding a permit, so the total never exceeds the max) toward a host-derived max (cores-scaled, replacing the magic 16), and an idle-eviction timer trims the warm set back down tomin_idleafter a burst subsides, reclaiming the memory of instances built under load. Net effect: less resting memory for idle capsules, more peak concurrency on big hosts. The max is operator-overridable via[capsule].instance_pool_size,ASTRID_CAPSULE_INSTANCE_POOL_SIZE, orastrid-daemon --instance-pool-size. Run-loop andhost_processcapsules stay pinned to a single Store — thehost_processcarve-out can never lease a second one (enforced byallow_grow = false, belt-and-suspenders over its always-warm single instance). Free-checkout soundness is unchanged: a lazily-grown instance is built by the sameHostStatefactory as an eager one and reset identically on return. A RAM-budget-derived max lands with the per-principal memory ledger. Refs #816. - Per-principal interceptor CPU is now summed cross-capsule into one kernel-owned
FuelLedger. Each capsule'sWasmEnginepreviously kept its own per-principal fuelHashMap, fragmenting a principal that drove N capsules into N independent sub-totals; the ledger is now a single kernel-ownedArc<DashMap<PrincipalId, AtomicU64>>cloned into every engine, summing a principal's interceptor CPU across all capsules. Sharded + atomic (never a global mutex), so it does not reserialise the hot interceptor path. Telemetry only — no read/deny path in this change. Refs #819. - Per-capsule WASM instance pool: principals' interceptors now run concurrently instead of serialising through one
Store. Each non-run-loop capsule was a singleStore<HostState>behind one mutex, soinvoke_interceptorprocessed exactly one invocation at a time per capsule — the throughput floor that kept the #813 orchestration cliff open even after the async-Wasmtime work removed worker-pinning (one LLM turn every ~3 s, invariant to concurrency across a 50× range, measured directly as a 2000+ deep invocation backlog on oneStore).WasmEnginenow holds apool::CapsuleInstancePoolof N(Store, Instance)pairs built from one sharedInstancePre;invoke_interceptorleases a free instance (semaphore-gatedcheckout().await) so up to N principals execute on independentStores at once. Lease lifecycle is one RAII guard (PoolCheckout) whoseDropfolds the Phase-3 CLEAR (resetting everyinvocation_*field) and the return-to-pool onto every exit path — normal return,?, panic-unwind, and future-drop on caller cancellation — replacing the old inlineClearOnDrop. Free checkout is sound because the per-capsule pool-safety audit confirmed no pooled capsule relies on in-WASM-memory state surviving across invocations (interceptor capsules use wasmtime resources within a single invocation). Pool size is a host-derived dynamic max (the originally-fixedINSTANCE_POOL_SIZE = 16is superseded by the dynamic, host-sized pool — see the dedicated entry above), or 1 for run-loop capsules (they keep their dedicatedStore, owned by the run loop) and forhost_processcapsules — a fail-secure carve-out forastrid-capsule-shell, whose background-process handles live across invocations and must stay single-Store; keyed off the existing capability, so no manifest/contract change.ipc_limiteris now a sharedArcacross a capsule's instances so the per-capsule IPC throughput budget is not multiplied by pool size; the resource-table-mirror counters (net_stream_count,subscription_count) stay per-Store(correctly scoped to each instance's table). Refs #816. - Async Wasmtime: guest invocations no longer pin tokio workers (
block_in_placeremoved from the WASM hot path). Component-Model async is now on across the kernel — every guest entry goes throughLinker::instantiate_async/TypedFunc::call_async, and the per-capsuleStoreis now wrapped in atokio::sync::Mutexinstead of astd::sync::Mutex. The orchestration cliff fix from #813 protected the per-(capsule, topic, principal) routing layer; this finishes the job at the wasmtime layer by ensuring a parallelinvoke_interceptorwaiter.awaits on the lock rather than holding a worker viablock_in_placefor the full lifetime of the currently-running guest call (#816). Per-capsule serialisation is preserved (one guest call at a time per capsule, same as before), but the executor is free to schedule other capsules while one is queued.ExecutionEngine::invoke_interceptorandCapsule::invoke_interceptorare nowasync fn; the dispatcher / kernel callers.awaitaccordingly.engine::wasm::run_lifecycleisasync fntoo, withastrid-capsule-install::lifecycledriving it through the available runtime handle. Cancellation safety: the existingClearOnDropRAII guard already runs on the future-drop path, so a cancelled interceptor still clearscaller_context,interceptor_active, and everyinvocation_*field onHostStatebefore releasing the store lock —tokio::sync::Mutexhas no poisoning to recover from, so the previous "poisoned lock" branches are gone. Host trait impls remain synchronous: this lands the runtime infrastructure for async hosts but does not yet flip the bindgen-sideimports: { default: async }flag — the slow-blocking host fns (ipc::recv,elicit,net.read/write,http::request) still usebounded_block_on_cancellableinternally, scoped to a single host call rather than the entire guest invocation. Migrating those to truly async host fns is a follow-up. Refs #816. SocketClientandAdminClientmoved out ofastrid-cliinto a newastrid-uplinkcrate. Both the CLI and the new HTTP gateway are kernel uplinks with the same trust shape (readsystem.token, handshake, stampIpcMessage.principal); the framing / handshake / admin-request correlation logic now lives in one place. CLI keeps thin shims atcrate::{socket,admin}_client::*so verb modules don't change. Behaviour-preserving refactor.SocketClient::send_inputnow takes an explicitcaller: &PrincipalIdinstead of looking up the CLI's active-agent context internally — the CLI passes its context-resolved principal through asend_input_as_active_agenthelper.astrid-uplink::KernelClient. Sibling ofAdminClientfor theKernelRequest/KernelResponsefamily that flows overastrid.v1.request.*(capsule list, status, reload, etc.). Embeds a per-call UUID in the topic suffix so concurrent HTTP requests can't pick up each other's responses on the shared bus.
Deprecated
astrid_capsule_pending_tail_overflow_total— removed. Replaced byastrid_capsule_route_byte_evictions_total{capsule, principal_class}andastrid_capsule_route_quantum_starved_total{capsule, principal_class}from the new publish-side routing demux. Operator dashboards alerting on the old counter must migrate.astrid_capsule_interceptor_permit_wait_seconds_total— removed. The per-capsule interceptor semaphore is gone (see Removed above); there is no permit wait to measure. Operator dashboards alerting on the histogram must drop the series.
Removed
astrid-openclawcrate and the entire OpenClaw plugin build path (#829). The TypeScript/JavaScript-to-WASM compiler that absorbed the OpenClaw plugin ecosystem is gone — it was broken three ways and was the last extism vestige incore. (1) The Tier 1 QuickJS kernel was built from a personal fork of extism's js-pdk (github.com/nicholasgasior/extism-js@v1.6.0) that no longer exists (HTTP 404), whilebuild.rsseparately pointed at upstreamextism/js-pdk— the two build paths already disagreed. (2)kernel/engine.wasmwas never committed (only its BLAKE3 hash), and the committed hash corresponded to a fork build that can't be reproduced from upstream, so a clean checkout had no path back to a working kernel. (3) The compiler emitted extism-PDK core-module exports (__invoke_i32, named exports) while the capsule host migrated off Extism to the wasmtime Component Model / WIT — so its output could no longer load. It was also a CI breaker (the QuickJS auto-build + wasi-sdk exhausted the runner disk → SIGBUS). Removed alongside the crate:astrid-build'sopenclawbuilder, the hidden--wizer-internalsubcommand, theopenclawproject type andopenclaw.plugin.jsonautodetection; the CLIopenclaw:install source; the kernel-routeropenclaw:remote-source rejection; the capsule watcher'sopenclaw.plugin.jsondetection;packages/openclaw-mcp-bridge(the Tier 2 Node bridge);scripts/build-quickjs-kernel.shandscripts/compile-test-plugin.sh; and the now-orphanedoxc/oxc_allocator/wizerworkspace dependencies. The live channel-registration infrastructure that carried the legacy name is renamed on this branch:UplinkSource::OpenClaw→UplinkSource::Bridge(constructornew_openclaw→new_bridge, serde wire tagopen_claw→bridge,Displayopenclaw(…)→bridge(…)). The serde wire-tag rename stays backward-compatible via a#[serde(alias = "open_claw")]on theBridgevariant: new writes serialize asbridge, but anyUplinkDescriptorpersisted or transmitted under the legacyopen_clawtag still deserializes. (UplinkSource/UplinkDescriptorderiveSerialize/DeserializeandUplinkDescriptoris documented as deserializable for trusted persistence, so the alias is the safe path rather than assuming the value is in-memory-only.) Breaking:astrid build --type openclaw,astrid capsule install openclaw:…, and installing a bareopenclaw.plugin.jsondirectory are no longer supported.- Per-capsule interceptor semaphore (
MAX_CONCURRENT_INTERCEPTORS) removed. The WasmtimeStoremutex atengine/wasm/mod.rs:1286already serializes invocations insidetokio::task::block_in_place, so the per-capsule cap-of-4 could never produce more than 1 concurrent guest execution — it was a redundant ceiling left over from the pre-#813 SET/CALL/CLEAR race, which is now closed byClearOnDrop(Layer 1) and the per-(capsule, principal) chain mutex (Layer 3). Run-loop capsules don't construct aStoreat all, so for them the semaphore was pure overhead on every interceptor call. TheCapsule::interceptor_semaphoretrait method, theCompositeCapsule.interceptor_semaphorefield, and the twoacquire_owned().awaitsites indispatcher.rsare gone.astrid_capsule_interceptor_permit_wait_seconds_totalis removed with no replacement — there is no permit wait to measure. Closes #813.
Fixed
- Capsule tools reach the LLM again on the OpenAI-compatible path (
react→prompt-builder→ provider). The pipeline was handing the model zero tools.prompt-builder'scollect_tool_schemas()fans outtool.v1.request.describeand drainstool.v1.response.describe.*, but its manifest granted neither topic, so the fan-out failedCapabilityDeniedbefore it ever fired — a code/manifest mismatch left by the per-domain-WIT migration. And even once unblocked, the KV-persisted__tool_schema_cachewas never invalidated: its only trigger is aprompt_builder.v1.invalidate_tool_cacheevent that nothing publishes, so a stale (or empty) tool list survived across daemon restarts and masked freshly installed tool capsules. Fixed incapsule-prompt-builder(unicity-astrid/capsule-prompt-builder#19) by declaring the two describe-fan-out ACLs the code already uses and invalidating the cache on the kernel'sastrid.v1.capsules_loadedbroadcast. This is the consumer half of the tool-discovery restoration; the producer half — the#[capsule]-generatedtool_describearm publishing its descriptor instead of returning it (an interceptor return is not fanned out under the current ABI) — ships in astrid-sdk 0.7.1. All three (ACL grant + cache invalidation + a tool-capsule rebuild against 0.7.1) are required for tools to reach the model; verified end-to-end with a fake-LLM harness (0 → 16 tools across system/fs/http/skills). Refs #892, #625. - Per-principal quotas now apply on the
ipc::recvpath, not just the interceptor path. A run+recv capsule resolved the invoking principal'sPrincipalProfileonly on the dispatcher-driven interceptor path (invoke_interceptor); the guest-pulledipc::recvpath (install_recv_invocation_context) leftinvocation_profile = None, soeffective_profile()fell back to the process-global default profile. Every principal driving a run-loop capsule therefore ran under the default quota (max_background_processes, IPC throughput, HTTP streams) regardless of its operator-configured per-principal profile — a documented gap (host_state.rs: "quotas remain the capsule owner's … move to a real lookup when per-principal quota enforcement is needed"). The recv path now resolves the publishing principal's profile through the sharedPrincipalProfileCache(threaded intoHostState) and installs it, so per-principal ceilings apply on both invocation paths. The profile is resolved for every publisher, the capsule owner included —effective_profile()'s fallback is the process-global default, never the owner's configured profile, so (matching the interceptor path) an owner-published message must resolve the owner's profile or its on-disk quotas would be silently ignored. Best-effort on the recv path, which has no deny channel: a failed load logs and falls back to the default. The pooled-interceptor and lifecycle-hook constructions receive the cache handle; the one-shotrun_lifecyclepath passesNone(it runs no recv loop). Fail-safe and bounded before this fix (the default quota is a real ceiling, never unbounded), but the per-principal value was not honoured. Closes #877. - Dispatcher no longer drops events when a per-topic consumer is reclaimed under a burst. Each
(capsule, topic, principal)route has its own idle-evicting mpsc consumer. Under a concurrent burst the consumer could leave a closed sender in the queue map — the idle-evict TOCTOU (a stale clone outliving thesender_strong_count == 1check) or a consumer task that ended.get_or_spawn_consumerthen handed that dead sender back, so every subsequenttry_sendfailedClosedand the event was dropped permanently, stalling all delivery for that route (observed as areact/user.v1.promptprompt stall) even though the capsule itself stayed healthy. The fix never returns a closed sender:get_or_spawn_consumerremoves a mapped entry whose receiver is gone (is_closed()) and re-spawns a fresh consumer — the explicit remove also closes a leak in the degrade-to-shared path, where the re-keyed insert would otherwise never overwrite the stale(capsule, Some(principal))entry.dispatch_singlekeeps a re-spawn-and-retry backstop for the narrow window where a sender closes between return and send;Full(a live consumer with a saturated bounded queue) stays an intentional, per-principal-bounded shed-load drop — recoverable via the requester's IPC/SSE timeout — and is distinguished fromClosed(a bug, never dropped, now flagged as asecurity_eventif it ever recurs post-respawn). A live 100-wide concurrent-prompt smoke that previously delivered 4/100 with 39 silent channel-closed drops now delivers 95/100 with zero drops. Closes #837. - Dispatcher mpsc partitioned per-(CapsuleId, PrincipalKey). Layer 1's bus-side routing fix narrowed the publish surface to
PrincipalKeygranularity, but the capsule dispatcher continued to key its mpsc consumer queues and chain mutexes on the 3-bucketPrincipalClassenum — so 1000 distinct user-class principals still collapsed onto a single 256-slot queue and the cliff persisted at the dispatcher layer. Each queue is now keyed on the fullPrincipalKey(Option<String>) with capacity 64, a 60-second idle-eviction grace that returns the queue map to the working set, and aMAX_DISPATCHER_QUEUES_PER_CAPSULE = 10_000cap that degrades to a single shared(capsule, None)queue (with audit-logged eviction) on pathological fan-in.chain_lockswidens to the same key so chains for distinct principals on the same capsule run concurrently. Closes #813. - Gateway SSE migrated to
subscribe_topic_routed. The four agent SSE topics (agent.v1.response,agent.v1.stream.delta,agent.v1.session_changed,astrid.v1.elicit) and the audit firehose (astrid.v1.audit.entry) now subscribe via the bus's routed surface (publish-side per-(topic, principal) DRR fairness + byte-budget eviction) instead of the broadcast channel. Eliminates the ~4 s disconnect under N=100 fan-in that surfaced asevent:readyfollowed by silence: a routed subscription naturally fans across every principal that matches the topic — the per-principal DRR machinery provides fairness within the route, so no wildcard-principal API is added or needed. The post-receivesession_idfilter inagent.rsis retained — session is a payload concern, not addressing. A newGatewayState.gateway_route_uuid(fresh per boot) pairs all gateway SSE routes undercapsule="gateway"for telemetry. Accepted regression: there's no explicitLaggedsignal from a routed receiver, so silent publish-side eviction replaces the broadcast-channel lag-disconnect; the 5-minuteSTREAM_TIMEOUTremains the only stalled-stream catch (open follow-up: expose a publish-side eviction signal viadrain_lagged()). Closes #813. - Native per-(capsule, topic, principal) IPC routing on the kernel event bus. Closes the structural root cause of #813's "concurrency cliff" rather than relying on the per-recv pending-bucket workaround.
EventBusgains an internalroutes: Arc<parking_lot::RwLock<HashMap<RouteKey, Mutex<RouteEntry>>>>(keyed on(capsule_uuid, topic_pattern, subscription_rep)) populated by a newEventBus::subscribe_topic_routedentrypoint that capsule guests use in place of the broadcast-shapedsubscribe_topic. EachRouteEntryowns a demand-allocatedHashMap<PrincipalKey, PrincipalQueue>so a bus with 5000 idle principals costs zero per-principal entries; only active principals materialise a sub-queue, ~96 bytes each plus payload bytes. Publish-side fan-out happens AFTERbroadcast::send(so a slow routed enqueue can never delay untargeted consumers —kernel_router,admin_router,bus_monitor,connection_trackerstay on the broadcast subscribe path) and applies deficit-round-robin (DRR) drain with a 1 MiB per-subscription byte budget and a 4 KiB quantum floor — 5000 active principals still make per-round progress. Under sustained byte pressure the bucket whose head was enqueued earliest gives up its head message until the new payload fits; streaming response terminators are preserved by construction (they're always the tail of their principal's queue, head-eviction trims the prefix not the tail). Each eviction emitstracing::error!(target: "astrid.audit.ipc", security_event = true, capsule, principal, evicted_topic, …)and bumpsastrid_capsule_route_byte_evictions_total{capsule, principal_class}; sustained per-round back pressure surfaces viaastrid_capsule_route_quantum_starved_total{capsule, principal_class}. The legacySubscriptionEntry::pending/principal_orderrequeue path incrates/astrid-capsule/src/engine/wasm/host/ipc.rsis deleted — routed receivers never see mixed-principal batches in the first place because the demux happens publish-side, not consumer-side. Dispatcher gains a siblingchain_locks: HashMap<(CapsuleId, PrincipalClass), Arc<tokio::sync::Mutex<()>>>held across each chain step so cross-class invocations on the same capsule run concurrently while same-class invocations serialise FIFO; thedispatch_singlempsc queue map is keyed on(CapsuleId, PrincipalClass)for the same reason.PrincipalClassis a new bounded enum (System/User/Agent) extracted intocrates/astrid-capsule/src/principal_class.rsso dispatcher.rs and the routing demux agree on label cardinality (3 buckets × capsule_count). WIT publish/recv signatures unchanged. Closes #813. - Capsule orchestration concurrency cliff resolved (
ipc::recvtruncation +invoke_interceptorlock-window collapse + per-capsule semaphore wiring). Three independent defects collapsed throughput at ~100 concurrent principals and silently dropped cross-principal IPC traffic. (1)ipc::poll/ipc::recvtruncated any mixed-principal batch at the first principal boundary — the tail was dropped, not deferred — because the per-recv principal context is keyed off a single publisher. The drain now partitions the tail into per-principalpendingbuckets on the subscription resource (cap 64 messages per bucket, 8 distinct principals per subscription, total 512 queued ≈ 0.5x the broadcast capacity; oldest-first drop on overflow, least-recently-pushed bucket eviction on the 9th principal), each drop logged astracing::error!(target: "astrid.audit.ipc", security_event = true)and counted via a newastrid_capsule_pending_tail_overflow_total{capsule, principal_class}metric (principal_class ∈system/user/agent— never rawPrincipalId, to keep label cardinality bounded). The nextrecv/pollround-robin-drains the pending buckets before reading fresh events, so cross-principal traffic surfaces instead of vanishing. (2)invoke_interceptorheld a three-phase lock window (SET → drop → CALL → drop → CLEAR) which let a parallel chain dispatch observe another principal'scaller_contextbetween SET and CALL — the cross-principal race the 100-wide collapse turned on. The window is now collapsed to a singlestore.lock()held across SET + CALL + CLEAR, with aClearOnDropRAII guard so the per-invocation fields are wiped on every exit path (normal return, early?, panic-unwind throughfunc.call). Trade-off: chain steps for the same capsule now serialize on Store contention (they already did via the per-capsule mpsc on single-match dispatch); the existing per-capsule semaphore (cap 4) protects burst memory. (3) The per-capsule interceptor semaphore was declared but never acquired by the dispatcher; chain and single-match invokes both nowacquire_owned()a permit immediately before eachinvoke_interceptorcall, withErrfrom a closed semaphore treated as "capsule unloading" (debug log + return for chains,continuefor the consumer loop so hot-reload doesn't tear down the replacement capsule). Wait time recorded viaastrid_capsule_interceptor_permit_wait_seconds_total{capsule}so the cap-of-4 is observable from day one. Companion: theEventReceiver::recvLaggedarm upgrades fromwarn!totracing::error!(target: "astrid.bus", security_event = true)to match the audit-pipeline keying used byastrid-capabilities/kernel-router. Companion capsule fix incapsule-react:TurnState::loadandload_active_sessionsswitch fromkv::get_jsontokv::get_json_opt, treating cold-startOk(None)as the default instead of an error case, which previously generated ERROR floods on every fresh session. Closes #813. - Per-invocation env overlay reaches capsule
env::var(...)calls. The gateway'sPOST /api/capsules/{id}/env/{field}route writes operator-supplied env values to$ASTRID_HOME/home/<principal>/.config/env/<capsule>.env.json— but the kernel'sget_confighost-fn was reading onlyself.config(the manifest defaults loaded once at capsule boot from the load-time principal's home). For any principal other thandefaultthe env-write route was effectively write-only: an operator settingbase_url = http://localhost:1234onastrid-capsule-openai-compatfor a gateway-minted bearer still saw their LLM request hitapi.openai.com(the manifest default). The dispatcher now loads<home>/.config/env/<capsule>.env.jsoninto a newHostState::invocation_env_overlaywhenever an interceptor is dispatched under a non-load-time principal (and the recv-context installer mirrors the load on every fresh inbound principal in a run-loop subscription);get_configchecks the overlay first, falls through toself.configon miss, so the manifest default still wins for keys the operator hasn't overridden. Three regression tests pin overlay-wins-over-default, fall-through-on-miss, and absent-overlay-falls-back-to-default. Defensive size cap of 1 MiB on the env JSON read keeps a misconfigured file from blocking every dispatch on a slow disk read. - Caller-context principal preserved across empty inner recvs. Capsules that follow the "receive an event, fan out to plugin hooks, drain responses, finish" pattern (prompt-builder, registry — any
#[astrid::run]capsule that nestsipc::recvinside its event-handling code) were silently flipping every post-hook publish to the capsule's load-time principal (default) under any non-default caller. The host'sipc::recv/ipc::pollpaths calledclear_recv_invocation_contexton empty drains, wiping the caller_context that recv had installed when the original event arrived — so when the fan-out timed out and the run loop continued with its follow-up publishes (e.g.session.v1.request.get_messages,prompt_builder.v1.response.assemble), each one stampeddefaultand the kernel's mixed-principal recv-batch security gate truncated downstream consumers. End-to-end effect: every gateway-minted bearer (which always yields a new non-default principal) hit areadySSE event onPOST /api/agent/promptand then nothing — nodelta, noresponse, no LM Studio traffic at all. The fix has two parts. (a) Emptyrecv/polldrains no longer touchcaller_context— they only update it when a new message arrives viainstall_recv_invocation_context, so a follow-up publish between recvs continues to stamp the most recently received message's principal. (b) A newinterceptor_activeflag onHostStateshorts the recv-driven install path while an interceptor (#[astrid::interceptor]) is dispatching — the interceptor's caller is owned byWasmEngine::invoke_interceptor, not by recv, so a nested recv inside the handler can no longer rewrite the outer caller. Three unit tests pin the invariants. The now-redundantclear_recv_invocation_contextis removed. - Event-bus subscribers yield more cooperatively under a storm.
EventReceiver::recvfilters by topic, andbroadcast::recv().awaitreturns buffered items without yielding — so a filtered subscriber draining a backlog of non-matching events could hold its worker for up to 100 synchronous iterations, and onLagged(the broadcast buffer overran the receiver — a storm) it caught up with no yield at all. The non-matching drain now yields every 32 events (namedYIELD_AFTER_SKIPPED), and theLaggedarm yields before catching up. Dampens worker monopolization during an event storm; the storm's root cause is surfaced separately by the bus-activity diagnostics. Closes #805. - Connection topic aligned to the WIT contract (
client.v1.connect). The connection-tracker work above (and the capsule-cli proxy) usedclient.v1.connected(past tense), but the WITclientinterface (astrid-bus:client@1.0.0,wit/interfaces/client.wit) specifies uplinks publishclient.v1.connecton attach. A WIT-conforming uplink publishing the contract topic would not have been counted. The tracker now matchesclient.v1.connect; theclient.v1.disconnectside already matched. Companion: capsule-cli aligns its publish. Closes #793. - Connection tracker now recognises
client.v1.*topics, not just typed payloads. The kernel's connection tracker matched only the typedIpcPayload::Connect/Disconnectvariants — which no capsule can ever produce, because the SDK publish surface is JSON-only (publish/publish_json/publish_as/publish_json_as) and never exposes those variants. So uplink capsules (the CLI proxy) could not populateactive_connections, leavingtotal_connection_count()structurally pinned at zero: the ephemeral idle-shutdown gate andastrid whosaw no connections regardless of reality, and the typedDisconnectthe CLI client sends over the socket was flattened toRawJsonby the proxy and lost. The tracker now classifies connection lifecycle by topic (client.v1.connect/client.v1.disconnect) in addition to the typed payload, via a pure, unit-testedconnection_signal()helper; native producers' typed payloads continue to work. Stale comments that described the non-existent wiring are corrected. The capsule-cli companion has the proxy publish those topics carrying the authenticated principal, with stream-close as the authoritative disconnect. Closes #788. - Daemon singleton guard: real
flocklock + fail-closed socket-path resolution. Two idle daemons were observed running at once. The guard was advisory-only:kernel_socket_path()silently fell back to/tmp/.astrid/run/system.sockwhenAstridHome::resolve()failed, so two processes with divergent env bound different sockets and coexisted (split-brain), andprepare_socket_pathhad a connect-probe →bindTOCTOU window with no lock. The kernel now (1) resolves the bind path strictly — a daemon whoseASTRID_HOMEcan't be resolved refuses to boot rather than using/tmp(fail-closed, matchinggenerate_session_token); and (2) holds an exclusive non-blocking advisory lock (std::fs::File::try_lock) on<run>/system.lockfor the process lifetime, so a second daemon fails fast and exits. The OS releases the lock on crash, so a restart is never wedged. Behaviour change: unresolvableASTRID_HOMEis now a fatal boot error. Closes #790. cors_allow_originsis actually wired into the router now. The gateway shipped in v0.7.0 with acors_allow_origins: Vec<String>field that the router never consumed —tower-http::CorsLayerwas on the dep list, but no layer was applied. An operator setting the allowlist saw nothing happen at runtime; browsers fell back to same-origin (which was the correct secure default but not the configured one).routes::buildnow applies aCorsLayerwhen the allowlist is non-empty:Access-Control-Allow-Origin(per-request match),…-Allow-Methods(GET/POST/PUT/PATCH/DELETE/OPTIONS),…-Allow-Headers(authorization/content-type/accept),Vary: Origin, and a 1-hour preflight cache. Empty allowlist stays no-CORS (browsers refuse cross-origin) so single-tenant deployments don't grow aVaryheader. NewGatewayConfig::validaterejects malformed origins (scheme other than http/https, trailing slash, path/query/fragment, unparseable strings, embedded userinfo that browsers strip, and raw IDNs that wouldn't match the Punycode form browsers send) at daemon boot so misconfig fails fast instead of silently no-op-ing. Five new end-to-end tests incrates/astrid-gateway/tests/cors.rscover preflight accept/reject, actual-request ACAO, empty-allowlist secure default, and per-origin echo for multi-origin allowlists. Closes #771.- Symlink-escape defence in
copy_capsule_dir(gateway/CLI install path). Previously the install copier dereferenced symlinks viafs::metadata(). A malicious capsule tree could shipevil -> /etc/shadowand have the host secret copied bytewise into the per-capsule directory (which the capsule's WASM sandbox could then read viahome://), and a directory symlink would either infinite-loop on an ancestor target or balloon the copy across an unrelated tree. The copier now canonicalizes the source root once at entry, threads it through recursion, refuses directory symlinks outright (npm only produces file symlinks undernode_modules/.bin/, so no real use case is lost), and validates that any file symlink resolves inside the canonical root viaPath::starts_with. Two new tests pin both threats:copy_capsule_dir_refuses_file_symlink_pointing_outside_rootandcopy_capsule_dir_refuses_directory_symlink. (#756, Gemini r2) - PID-only temp filenames in three install paths were race-prone under sibling tokio tasks.
astrid-capsule-install::wasm::write_atomic,astrid-capsule-install::wit::write_atomic, andastrid-gateway::routes::env::write_env_stringall usedstd::process::id()as the temp-file disambiguator. Sibling tasks in the same daemon process share that PID, so two concurrent installs of different content-addressed WASM blobs (or two concurrent env writes from different dashboards) could collide on the same temp path and clobber each other beforerename. Each call now usesuuid::Uuid::new_v4().simple()so the temp path is unique even across in-process concurrency. (#756, Gemini r2) POST /api/capsules/{id}/env/{field}was reading the manifest from the wrong directory. The handler looked upCapsule.tomlunderhome.root().join("capsules").join(...)— a path that doesn't exist in the FHS layout (installed manifests live under each principal's home atprincipal_home(p).capsules_dir()). Every env-write request fell through to a "manifest not found" 500 against a real install. Fixed to use the principal-scoped path resolver. (#756, Gemini r2)write_envwas doing synchronous file I/O on the tokio worker thread. At the gateway's measured 5,400 RPS read ceiling, a single slow fsync on the env path would have stalled every other in-flight HTTP request. The secret-store write, thetext/selectwrite, and thearrayappend now run insidetokio::task::spawn_blockingso the worker stays free. (#756, Gemini r2)- Audit-stream cap probe used the socket-based
AdminClientinstead of the bus-direct path.caller_holdson theGET /api/eventshandler dialled the daemon over the Unix socket throughastrid-capsule-clito ask whether the caller heldaudit:read_all, adding the ~50ms proxy handshake to every SSE open — exactly the latency the rest of the gateway routes were rewritten to skip. Now usesstate.admin_client(principal)like every other route, so SSE first-byte time is microseconds. (#756, Gemini r2) POST /api/auth/redeemrate limiter was per-proxy, not per-client, behind a reverse proxy. The redeem path keyedRateLimiteronConnectInfo<SocketAddr>::ip(), which is the proxy's IP under a typical nginx/Caddy/cloud-LB deployment — one abusive client would lock out every legitimate user globally. NewGatewayConfig.trust_forwarded_from: Vec<IpAddr>lists the reverse-proxy IPs the gateway trusts; when the immediate peer is on the list, the limiter resolves the real client fromX-Forwarded-For(first hop) thenX-Real-IP, falling back to peer. Empty list = no forwarded-header trust (peer IP is used directly), preserving the previous behaviour for direct-internet deployments. Operators must set this when the gateway is behind a proxy; the docstring on the config field calls this out explicitly. (#756, Gemini r2)cargo publish -p astridescaped the published tarball via a workspace-rootedinclude_str!path.astrid setupembedded the bundled AppArmor profile withinclude_str!("../../../../dist/apparmor/astrid")— a path that resolves outside the crate duringcargo publish, so packaging the CLI failed the verifier compile (same class as the wit-staging / chrono publish fixes folded into 0.7.0). The profile moved tocrates/astrid-cli/apparmor/astrid(content unchanged) and the include is now crate-relative; the workspace-rooteddist/directory is gone. Closes #765.
Security
- Capsule audit-feed subscriptions are now scoped to the subscriber's own principal by default. A capsule that declared
ipc_subscribe = ["astrid.v1.audit.entry"](or anyastrid.v1.*superset) in its manifest previously received every principal's audit entries — the cross-principal firehose was the default, gated by no capability and guarded by no reserved namespace (check_subscribe_aclmatched topics syntactically, and the per-principal bus demux is fairness-only, not access control). An audit-covering subscription is now self-scoped at enqueue to the capsule's load-time owner principal, so a foreign principal's entries never enter the route's byte budget and a noisy co-principal can never head-evict the owner's own entries. The cataloguedaudit:read_allcapability — resolved against the owner's profile and the live group config, never the self-declared manifest, so a capsule cannot grant itself the firehose — lifts the scope back to the full firehose, matching the gateway SSE model. Unscoped subscriptions are byte-identical to before; wildcard supersets are detected via the route-layer topic matcher. No WIT change, no new capability. Refs #850. - Persistent host-process spawning now requires an explicit operator
allow_persistentsub-grant.astrid:process.spawn-persistentwas gated only on thehost_processcapability plus an authenticated principal, even though the WIT documents it as additionally requiring an operatorallow_persistentopt-in — the sub-grant did not exist, so the gate the contract promised was never enforced.CapabilitiesDefgains anallow_persistentmanifest field (bool,#[serde(default)], fail-closed):host_processalone grants the ephemeralspawn/spawn-backgroundtiers, while a persistent child — which outlives the pooled instance and, on macOS, has nodie-with-parent— additionally requires this opt-in. Without it,spawn-persistentreturnscapability-denied(audited); the ephemeral tiers are unaffected. The grant is surfaced byenumerate-capabilitiesthrough the serde-derivedheld_names/has, so a capsule can introspect it. No WIT change — the contract already specified the gate; this makes the host conform. Capsules that spawn persistent processes (e.g.astrid-capsule-shell) must addallow_persistentto their manifest[capabilities]. Refs #872. - macOS native-subprocess sandbox is no longer silently disabled on macOS 15+ (Darwin >= 24).
SandboxCommand::wrap— the macOS arm of thehost_processspawn path — carried a version guard that returned the spawned subprocess completely unsandboxed on every current Mac (Darwin 24 = macOS 15 Sequoia, Darwin 25 = macOS 26), logging only atracing::warn!. The stated premise (thatsandbox-execis deprecated and therefore unusable on macOS 15+) was wrong:sandbox-execis deprecated but still enforces on current macOS. The SIGABRT that originally motivated the guard (introduced in #603) was a fail-closed profile defect, not an OS incompatibility —wrap's inline profile was a stale duplicate that omitted the(allow file-read* (literal "/"))rule a dynamically-linked binary (e.g.node) needs to stat the filesystem root at startup, so Seatbelt correctly aborted the process and the guard turned that fail-closed signal into a silent fail-open passthrough. Onhost_process-capable capsules (astrid-capsule-shell) every native subprocess then inherited the host user's full filesystem reach (~/.ssh, dotfiles, arbitrary writes). The version guard and the stale inline profile are both removed;wrap's macOS arm now routes throughbuild_seatbelt_prefix, the single profile that already carries(literal "/")+(allow mach*)(added for Node.js in #534) and that the MCP spawn path has been running on macOS 15+ all along — so both spawn paths share one profile and containment no longer depends on which path a capsule happens to use. Ifsandbox-execgenuinely cannot run, the subprocess spawn now fails rather than launching unsandboxed. A macOS-only test spawns a realnodeunder the generated profile and asserts it runs, with a contrast proving the same profile without(literal "/")fails closed. Closes #855. POST /api/auth/pair-device/redeemis now per-IP rate-limited. The device-pairing redeem route is unauthenticated and public (the pair-token is the auth), but unlike its siblingPOST /api/auth/redeemit had no brute-force fence in front of the kernel's constant-time pair-token scan — an attacker could enumerate pair-tokens as fast as the network allowed. It now runs the same per-IP throttle as invite-redeem, deliberately sharing the oneredeem_limiterso the per-IP budget is spent across both unauthenticated redeem routes and cannot be dodged by alternating between them.X-Forwarded-Foris honoured only behind a configured trusted proxy (resolve_client_ip), and the429(withretry_after_secs) is now part of the route's documented OpenAPI contract. A new router test pre-seeds the shared limiter and asserts a pair-redeem from that IP is rejected before it ever reaches the daemon. Surfaced by the CLI/API sysadmin parity audit.- Failed token redeems now record a failure-outcome audit row.
InviteRedeemandPairDeviceRedeembypass the capability preamble (the token is the auth, since the redeemer's principal does not exist yet), so the kernel admin dispatcher special-cased them — but it stamped the audit rowAuthorizationProof::System+AuditOutcome::successbefore dispatching the handler. A redeem the handler then rejected (invalid / expired / consumed / forged token, or an internal store error) still left a success row, so brute-force / forged-token attempts were invisible in the audit log itself and surfaced only as a tracingwarn!(security_event=true). The dispatcher now records the audit row after dispatch with the real outcome — a rejected token writesDenied+Failure(reason), a mint writesSystem+Success— which is the "allow OR deny" the surrounding comment always promised. The decision is a small pure helper (redeem_audit_proof) with a unit test pinning the failure-on-Errormapping. A security team relying on audit rows (not tracing) can now detect token brute-forcing. No WIT/contract change. Surfaced by the CLI/API sysadmin parity audit. self:agent:listno longer leaks the full principal roster.AgentListalways resolves toAuthorityScope::Self_, so the required capability isself:agent:list— which theagentbuiltin holds viaself:*. That lowering is deliberate (it lets an agent resolve its own group-inherited caps without being handed the admin-tieragent:list), but the kernel handler returned every principal's profile regardless of the caller — an information-disclosure: any ordinaryagentcould enumerate every other principal's id, groups, grants and revokes viaGET /api/sys/principals. The gateway already documented the intended behaviour ("operators withagent:listsee everyone; anagentwithself:agent:listsees only themselves; the kernel filters server-side") — the kernel just never implemented it.agent_listnow filters to the caller's own row unless the caller also holds the globalagent:listcapability;self:*does not matchagent:list, so self-scoped callers are correctly narrowed. Both halves are required for the full roster — theAgentListpreamble independently requiresself:agent:list, which a bare globalagent:listgrant does not satisfy (the grammar does not make a global cap imply its self-scoped form) — so in practice only theadmingroup's*(matching both) sees everyone. Fail-secure: an unresolvable caller profile yields self-only. The CLI is unaffected (itsdefaultprincipal is admin-seeded).GroupListis intentionally left a full read — it is system config, not per-principal data, and an agent needs it to resolve group→capability inheritance. Behavioural change for API callers holding onlyself:agent:list. Surfaced by the CLI/API sysadmin parity audit.- Host
random-bytesnow sources from the OS CSPRNG. Theastrid:sys/host.random-bytesimplementation filled buffers fromrand::thread_rng()— a userspace ChaCha CSPRNG seeded from OS entropy — while the WIT contract and the Rust/JS SDKs all advertise the bytes as coming from "the host's OS-level CSPRNG". It now fills fromrand::rngs::OsRngso the implementation matches the documented guarantee, and usestry_fill_bytes(notfill_bytes) so a practically-impossible entropy-source failure fails secure aserror-code::unknown(...)instead of panicking inside the host call. No WIT/contract change required. Closes #799. - Security-response-headers stack applied to every gateway response. The gateway returns JSON, SSE, plain text, and Prometheus — never HTML — but every response now carries
X-Content-Type-Options: nosniff,X-Frame-Options: DENY,Referrer-Policy: no-referrer, andContent-Security-Policy: default-src 'none'; frame-ancestors 'none'. The headers areif_not_presentso a handler that intentionally sets one wins; defaults fill in everywhere else. Defends against MIME-confusion (nosniff), clickjacking against any HTML that lands in the surface later (DENY + CSP frame-ancestors), andRefererleakage of principal-id-bearing URLs to third-party origins (no-referrer). Two new e2e tests pin headers on success and 401 paths. Closes #771. - Bearer revocation on principal delete. Session bearers shipped in v0.7.0 as 8-hour ed25519-signed tokens with no revocation mechanism — an admin who deleted a compromised principal still had to wait out the bearer lifetime (or rotate the gateway signing key, which logs out every other user too). The gateway now subscribes to the kernel's audit-event firehose (
astrid.v1.audit.entry); when anAgentDeleteadmin op records asuccessoutcome, the target principal is added to a per-gatewayrevoked_atmap (persisted atomically to$ASTRID_HOME/etc/gateway-revocations.json) and every bearer with aniatat-or-before the recorded epoch is rejected byverify_bearer. Survives daemon restart. Resilient to clock skew between the kernel and the gateway because the timestamp baked into the revocation entry comes from the kernel's own audit envelope, not gateway-local wall clock. Closes #772.
Install
From source (requires Rust 1.95+):
cargo install astrid
Pre-built binaries:
Download the archive for your platform, extract, and add to PATH:
tar xzf astrid-*-$(uname -m)-*.tar.gz
sudo mv astrid-*/astrid astrid-*/astrid-daemon astrid-*/astrid-build astrid-*/astrid-emit /usr/local/bin/
Then run astrid init to set up capsules.
With many thanks from the following Astrinauts 🚀
- Joshua J. Bouw