Skip to content

v0.8.0

Latest

Choose a tag to compare

@github-actions github-actions released this 11 Jun 14:44
· 29 commits to main since this release
0360425

A large release — roughly fifty PRs since v0.7.0 — landing three bodies of work plus a security-hardening pass:

  • HTTP admin gateway. The new astrid-gateway crate fronts the kernel admin/request IPC surfaces over HTTP: principals, caps, quotas, groups, invites, per-principal capsule env, audit SSE + historical audit queries, agent-prompt SSE, OpenAPI spec emission, native rustls TLS, CORS, Prometheus metrics — with a bus-direct admin path (285× admin throughput) and invite/keypair CLI verbs for operator parity.
  • Runtime concurrency overhaul. The orchestration concurrency cliff is closed end-to-end: per-(capsule, topic, principal) routed IPC with DRR fairness and byte budgets, async Wasmtime (guest calls no longer pin tokio workers), dynamic host-sized per-capsule instance pools, split blocking/async-I/O host-call semaphores, and per-principal CPU-fuel + peak-memory ledgers with rate enforcement and usage reporting.
  • Host process + introspection surface. The astrid:process@1.0.0 persistent-process tier is implemented (background children that outlive the pooled WASM instance, id-keyed reattach/stdin/log-cursor ops, operator allow_persistent opt-in), capability introspection lands (enumerate-capabilities + a completed check-capsule-capability), astrid mcp serve exposes capsule tools + capability consent to any MCP client, and astrid-emit gives agent hook processes an agent-agnostic stdio→bus pipe.
  • Security. The macOS 15+ native-subprocess sandbox is no longer silently disabled; capsule audit-feed subscriptions are principal-scoped by default; failed token redeems now audit as failures; self:agent:list no longer leaks the principal roster; both unauthenticated redeem routes share per-IP rate limiting; bearers gain revocation-on-principal-delete (wire format v2 — in-flight v0.7.0 bearers must re-redeem).

Breaking changes to note: Capsule.toml moves to [publish] / [subscribe] tables as the only IPC-intent surface ([[interceptor]], the ipc_publish / ipc_subscribe arrays, and [[topic]] are removed), the bearer wire format is v2, MSRV is 1.95, and the astrid-openclaw build path is removed.

Breaking

  • Capsule.toml: the [[interceptor]] block is removed — interceptor bindings are now [subscribe] entries with a handler and an optional priority. A capsule used to bind an interceptor either via a [[interceptor]] block (event + action + optional priority) or a [subscribe] entry with a handler — the two overlapped, and only the legacy block could set a priority (the [subscribe] form hardcoded the default 100). [subscribe] now carries an optional priority (lower fires first; default 100; a priority on a handler-less ACL-only entry is rejected at parse time), making it the single interceptor-binding form, and [[interceptor]] is gone. A manifest still declaring [[interceptor]] no longer binds those handlers — the block is ignored. Migrate each block to a [subscribe] entry: "<event>" = { wit = "<typed-payload>", handler = "<action>", priority = <n> }. Note [subscribe] requires a typed wit payload reference that [[interceptor]] did not, so each migrated binding declares its payload type (typed IPC everywhere). Closes #858.
  • Capsule.toml: the capabilities.ipc_publish / ipc_subscribe string arrays are removed — [publish] / [subscribe] tables are the only IPC ACL. The tables already superseded the arrays (their keys are the ACL when present); the arrays are now gone, so [publish] / [subscribe] is the single way to declare what a capsule may publish or subscribe to (an empty table = may not publish/subscribe, fail-closed). A capsule still declaring ipc_publish = [...] / ipc_subscribe = [...] under [capabilities] no longer grants any IPC ACL — the unknown keys are ignored. Migrate each pattern to a table entry: a publish pattern becomes [publish] "<topic>" = { wit = "<typed-payload>" }; an ACL-only subscribe pattern (no handler, e.g. an uplink proxy) becomes [subscribe] "<topic>" = { wit = "opaque" }. Closes #864.
  • Capsule.toml: the [[topic]] table is removed; topic schemas are sourced from the [publish] / [subscribe] wit refs. The [[topic]] block let a capsule self-describe a topic with an inline JSON Schema file, baked into meta.json at install. That self-description (flaky to keep in sync) is dropped in favour of the typed wit payload ref already on every [publish] / [subscribe] entry: the A2UI schema catalog (SchemaCatalog) now registers each topic from those tables and records its wit ref for the A2UI bridge to resolve to a schema + description via the WIT registry. Removed: the [[topic]] table (TopicDef / TopicDirection), the install-time bake_topics path, the meta.topics / BakedTopic install metadata, and the astrid capsule list baked-topic display. No capsule declared [[topic]], so there is nothing to migrate. Closes #865.
  • Bearer wire format bumped to v2 (4 segments, was 3). The token now carries an iat (issued-at epoch) claim alongside principal and exp. Required by the new revocation machinery: without iat the only revocation semantics available would be "blanket reject forever," which surprises an operator who later re-creates a principal with the same id. Dashboard sessions issued by the v0.7.0 gateway no longer verify after upgrade — clients must re-redeem. CLI astrid invite redeem, the existing pair-device flow, and every browser session that goes through /api/auth/redeem mint the new shape automatically; only pre-existing in-flight bearers are affected. Format spec: b64url(principal) "." b64url(iat) "." b64url(exp) "." hex(sig), with sig over principal:iat:exp. Closes #772.
  • MSRV bumped to 1.95.0. surrealdb 3.0.0-beta.3's kv-mem feature pulled in surrealmx v0.21.0ferntree v0.7.0, which uses std::hint::cold_path stabilised in Rust 1.95. Upstream declared no rust-version, so cargo's resolver silently picks 0.7 even though the workspace MSRV says 1.94. Bumping our MSRV is the smallest fix that keeps CI deterministic without committing Cargo.lock (which is intentionally gitignored). Affects cargo install astrid consumers — installers on 1.94 will see a clear "requires rustc 1.95" error rather than the cryptic cold_path failure.

Added

  • astrid mcp serve — an MCP server surface exposing Astrid capsule tools + capability consent to any MCP client. Astrid already had an MCP client (astrid-mcp manages external tool servers) and a tool broker (sage-mcp discovers capsule tools and shapes MCP descriptors), but no way for an MCP client — the managed claude -p, or Codex/Gemini/any client — to consume Astrid's capsule tools over the standard MCP wire protocol. astrid mcp serve (a subcommand of the astrid CLI, not a separate binary) is a thin rmcp stdio ServerHandler that delegates to the existing sage-mcp broker over the already-allowlisted astrid.v1.request.mcp.* / astrid.v1.response.<req_id> topics — the crypto/audit/enforcement stay in Astrid; the shim only terminates the MCP wire protocol. get_info advertises tools + tools.list_changed; list_tools / call_tool publish on an uplink and await the single-segment reply (one-response invariant, so the shim never hangs); the calling principal is stamped from --principal (default = active/default principal). Capability approvals are relayed into the client's own UI via MCP elicitation (ctx.peer.elicit::<ApprovalChoice>{ApproveOnce,ApproveSession,ApproveAlways,Deny}, gated on the client advertising the elicitation capability; decline/cancel → Deny; never elicits secrets). Hot-reload subscribes astrid.v1.capsules_loaded, re-enumerates, diffs, and peer.notify_tool_list_changed() (no-op notifications suppressed). mcp serve owns stdout for the JSON-RPC stream, so logging is forced off stdout (to file, else stderr) for this command regardless of operator config, so a stray frame cannot corrupt the protocol stream. This is the foundation for registering a named sage server in the managed claude -p (so mcp__sage__* resolves natively and mcp_tool hooks / channels can bind by name) and for the agent-neutral backplane. rmcp 0.15 → 1.7.0 workspace-wide; the existing astrid-mcp client (ClientHandler / RoleClient) is migrated to the 1.7 API. No kernel / WIT / allowlist change. Closes #879.

  • astrid:process@1.0.0 persistent-process tier — implemented. A capsule can now spawn a background child that outlives the pooled, stateless WASM instance that started it. Previously an ephemeral process-handle is reaped when its instance resets on return to the dynamic pool, so a process started in one tool invocation could not survive to the next — the split spawn → read → stop pattern was impossible. A new host-owned PersistentProcessRegistry — cloned into every pooled HostState exactly like the cancellation ProcessTracker, so a process-id survives instance churn — owns the child (spawned on the daemon runtime under the same bwrap/Seatbelt sandbox as the ephemeral tier), its per-stream log rings, and its stdin pipe. Implemented: spawn-persistent (returns a 256-bit host-minted CSPRNG process-id, lowercase base32 so it doubles as an IPC topic suffix; the registry stores only a keyed BLAKE3 hash, never the raw token), status / status-many / list-processes, read-logs (drain) + read-since (non-draining, cursor-addressed, byte-faithful list<u8>), signal (incl. stop/cont), bounded wait, stop (SIGTERM→grace→SIGKILL, frees the slot), release-process, and write-stdin / close-stdin (via keep-stdin-open capture). Every id-keyed call re-resolves the live (principal, capsule) and checks it against the recorded creator, so a leaked id is inert across the principal/capsule boundary — unknown / wrong-owner / wrong-capsule / reaped all collapse to no-such-process with no oracle; spawn-persistent refuses the owner-fallback principal (persist-unsupported) so tenants never share a default namespace. Lifecycle is enforced by a per-capsule reaper task: per-principal concurrent + retained-id caps, idle / max-lifetime / exit-retention TTLs (guest values clamped DOWN to host ceilings), and a kill-all on capsule unload / daemon graceful shutdown. Works on Linux and macOS (the macOS caveat — a daemon hard crash, not a graceful shutdown, can orphan a still-sandboxed child because Seatbelt has no die-with-parent — is a weaker cleanup guarantee, not a containment gap). Still deferred, honestly: attach (the resource-handle composition sugar; the id-keyed ops are its documented attach(id)?.method() equivalent), watch / unwatch (host-published lifecycle events — an OPEN publish-authority question in RFC host_abi, with status + bounded wait polling as the working alternative), and the WIT's own (NOT YET …) items (resource-limit enforcement, cpu-ms / mem-bytes-peak, instance-local pollables). Contract: unicity-astrid/wit#12. Design: unicity-astrid/rfcs#22. Closes #866.

  • astrid:sys@1.0.0 capability introspection — enumerate-capabilities implemented, check-capsule-capability completed. A capsule can now read its OWN held capability names via the new infallible enumerate-capabilities() -> list<string>: the capability categories declared in its [capabilities] manifest block (host_process, net_connect, fs_read, …) — the names, not the scoped arguments within them (allowlists, host:port, paths). This lets a reusable supervisor binary deployed under different manifests ground its behaviour in what it can actually do instead of hard-coding it, and lets any capsule avoid code-vs-manifest drift. Because the WIT is infallible (a bare list<string>, no result), it reads an owned, lock-free snapshot taken once at load (CapabilitiesDef::held_names) and stored on HostState rather than the capsule_registry — there is no registry-unavailable failure mode to surface, and an empty list is the valid "no capabilities" answer; capsule capabilities are fixed at load (the grant/revoke model is principal-scoped, a separate axis), so the snapshot is correct for the capsule's whole lifetime and across pooled instances. In the same change, check-capsule-capability — previously a stub that answered only allow_prompt_injection and returned false for every other capability — is completed onto the same namespace: both host fns (held_names() the list, has(name) the per-name dual) are DERIVED from the struct's serialized fields rather than a hand-maintained list, so a capability added to CapabilitiesDef flows through both automatically and the two cannot drift — n appears in held_names() iff has(n), and unknown names fail closed. Both are ungated, read-only, and audit-not-recorded: capability posture is structural metadata, not a secret (enforce-don't-conceal) — knowing a capability conveys no ability to use it. Lifecycle hooks and the astrid-hooks host, which run outside the capsule manifest/security-gate lifecycle, report an empty set (fail-closed). Contract: unicity-astrid/wit#13. Closes #868.

  • Per-principal peak memory in usage reporting. ResourceUsage gains memory_bytes_peak_total: Option<u64> — the cross-capsule high-water linear memory a principal has driven (max across every capsule it invokes), read from the shared memory ledger and surfaced by the admin UsageGet, GET /api/sys/principals/{id}/usage (a new field on the OpenAPI ResourceUsageView), and astrid quota show (a new "memory peak" row). This fills the memory side of per-principal usage, which previously reported only the per-instance ceiling. Under pooled, shared Stores a live "current" total is not cleanly attributable, so the peak is the reported signal — the principal that grows a Store owns the peak; memory_bytes_current_total stays None. Refs #816.

  • Operator overrides for capsule runtime sizing (config + env + CLI). New [capsule] config section with host_blocking_concurrency, host_io_concurrency, and instance_pool_size (all optional; unset → the host-derived default), the matching ASTRID_CAPSULE_HOST_BLOCKING_CONCURRENCY / ASTRID_CAPSULE_HOST_IO_CONCURRENCY / ASTRID_CAPSULE_INSTANCE_POOL_SIZE env vars, and astrid-daemon --host-blocking-concurrency / --host-io-concurrency / --instance-pool-size flags. Precedence is CLI flag > config file > env > host-derived default; the daemon resolves the values once at boot and the kernel forwards them, unmodified, to every capsule's WasmEngine (the same handle-plumbing shape as FuelLedger). A zero override is rejected at config-validation time (it would wedge a host-call class or leave a capsule with no instance to lease, rather than throttle). Refs #816.

  • Admin contract for per-principal resource-usage reporting. Adds AdminRequestKind::UsageGet { principal }, AdminResponseBody::Usage(ResourceUsage), and the ResourceUsage payload (cross-capsule CPU total + configured ceilings + an exempt flag), scoped exactly like QuotaGet (self:quota:get / quota:get) so a principal reads its own usage and only an admin reads another's. The quota/usage admin handlers move into a new admin/quota.rs submodule. Refs #820.

  • Per-principal CPU-rate budget is now enforced. A reactive 1-second windowed throttle (FuelRateLimiter) denies a principal's interceptor calls once it overruns its max_cpu_fuel_per_sec budget, until the window rolls — never permanently bricking it. Keyed on the invoking principal, sharded + per-principal-locked (no global lock), fail-secure (exemption fails closed, the window math fails open); admin / system:resources:unbounded / net_bind / uplink holders are exempt. Refs #819.

  • UsageGet now reports live per-principal usage. The stub is replaced with the real cross-capsule CPU total (read from the shared FuelLedger) and a real exempt flag computed with the same capability predicate enforcement uses, so displayed-exempt always matches enforced-exempt. Refs #820.

  • GET /api/sys/principals/{id}/usage. Exposes per-principal resource usage over HTTP, mirroring the quotas route's self/admin scope and bearer-verified caller. Also fixes the QuotasView OpenAPI mirror, which omitted max_cpu_fuel_per_sec. Refs #820.

  • astrid quota show now reports usage vs budget. Adds a usage-vs-budget section (consumed CPU, ceiling, exempt flag) and the previously-missing CPU-rate ceiling line. Refs #820.

  • astrid-emit — an agent-agnostic stdio→bus pipe. New companion binary (its own crates/astrid-emit crate, co-installed alongside astrid exactly like astrid-daemon / astrid-build) that any agent's hook process can shell out to. It reads stdin to EOF as a UTF-8 string, wraps it in a fixed six-field, sage-validated envelope (hook, payload, correlation_id (always null), principal_id, session_id, token), connects to the daemon as an uplink, and publishes the envelope via publish-as on the topic it was handed — accepted both positionally (astrid-emit sage.v1.hook.before_tool_call, what shipped sage writes into settings.local.json) and via --topic <name> (exactly one required). The hook field is derived as the topic's trailing dot-segment so it matches sage's validator, and payload is forwarded as a verbatim string (never base64) so canonical subscribers receive the original hook JSON. The three transport identifiers are read from the required ASTRID_PRINCIPAL_ID / ASTRID_SESSION_ID / ASTRID_HOOK_TOKEN environment variables (set on the agent child at spawn); any missing-or-empty value is a soft failure. The binary is deliberately tiny and carries zero agent-protocol knowledge — no hook-name map, no stdin parsing, no verdict shaping, no merge, no await/response, no fail-closed behaviour: sage is the trust anchor, validator, and republisher, and the canonical hook-name map lives there. astrid-emit always writes {"continue":true} to stdout (success and failure alike) and exits 0 on a successful publish or 1 on any failure (missing env, connect failure, send failure) — it never exits 2. A new astrid --emit-path discovery flag (handled before banner/config so it works on a half-configured host) prints the absolute path to the co-installed astrid-emit so hook-bridge installers can wire commands without guessing the install layout. Refs #814.

  • Process-level + build-provenance metrics on the gateway /metrics endpoint. The Prometheus exposition now ships the standard process_* family (process_cpu_seconds_total, process_resident_memory_bytes, process_threads, process_open_fds, process_start_time_seconds) via the metrics-process collector, plus an astrid_build_info{version,git_sha,rustc} gauge for build-provenance joins. rate(process_cpu_seconds_total) is the direct, graphable detector for the idle 200–300% CPU class of incident that previously could only be inferred from top. The collector is pull-based — the /metrics handler calls collect() once per scrape, deliberately adding no background thread (you don't observe a background-CPU bug by spawning a polling thread to watch it). git_sha (short) and rustc are captured at compile time by a new crates/astrid-gateway/build.rs, each falling back to "unknown" so a source-tarball build still populates the metric; build constants are exposed on the unauthenticated endpoint deliberately, since they carry no principal, path, or secret. This is the foundation slice of the host-internal operational-metrics layer; the full design, catalogue, and phased roadmap — and its division of responsibility from the #705 capsule-telemetry layer — land in docs/metrics.md. Refs #791.

  • Event-bus + daemon saturation metrics. The runtime's missing saturation signal, recorded into the daemon-wide Prometheus recorder the gateway exports: astrid_bus_events_published_total{event_kind} (publish throughput, labelled by the bounded AstridEvent::event_type set — IPC traffic collapses to ipc), astrid_bus_receiver_lagged_total{subscriber} (events a subscriber dropped by falling behind — a non-zero rate() is the signature of a feedback storm), astrid_daemon_active_connections (gauge) + astrid_daemon_connections_opened_total/_closed_total, and astrid_daemon_background_ticks_total{loop} across the idle / capsule-health / react-watchdog / bus-monitor loops (a flat rate() is a parked loop, a runaway rate() is a spin loop — the direct signal for the idle 200–300% CPU class of incident, complementing #787's storm log and process_cpu_seconds_total). The subscriber label is a fixed, code-assigned set: the long-lived daemon consumers (kernel_router, admin_router, connection_tracker, capsule_dispatcher, bus_monitor, revocation_watcher) are tagged via the new EventBus::subscribe_as/subscribe_topic_as; everything else (incl. dynamic capsule-supplied topic subscriptions) collapses to untagged, so a buggy capsule can't inflate cardinality. Adds the metrics facade to astrid-events and astrid-kernel. The kernel-side uplink socket-accept counter is deferred to a follow-up (the accept happens inside a capsule via the astrid:net host shim, a separate crate). Resolves the astrid_bus_* vs astrid_router_* naming question (§7.9) in docs/metrics.md. Refs #791.

  • astrid build injects the getrandom custom-backend cfg for wasm32-unknown-unknown capsules. Every capsule on that target must compile with --cfg=getrandom_backend="custom" so uuid v4 / HashMap seeding link against astrid-sys's host-routed RNG (astrid:sys/host.random-bytes); getrandom 0.4 requires the cfg on the final binary crate, so a capsule whose .cargo/config.toml omitted it failed to build with a cryptic getrandom compile_error!. astrid-build now injects the cfg itself via a merged CARGO_ENCODED_RUSTFLAGS — because host ≠ wasm this is a cross-compile, so the flag reaches only the wasm artifacts, never host build scripts / proc-macros. The injection merges rather than clobbers — it concatenates inherited RUSTFLAGS/CARGO_ENCODED_RUSTFLAGS, the capsule's own config rustflags (array or string form), and the getrandom cfg, de-duplicating only the getrandom cfg so multi-token flags like -C opt-level=3 stay intact — so a capsule already carrying the flag is unaffected. This is a safety net for the canonical build tool; capsules still keep the flag in config because a plain cargo build / cargo test never runs through astrid-build. Closes #800.

  • Passive event-bus storm diagnostics. A new bus-activity monitor (astrid-kernel/src/bus_monitor.rs) subscribes to all events on its own subscriber, aggregates publish counts per topic over a rolling 5s window, and logs a WARN naming the hottest topics whenever the sustained rate crosses 100 events/s (DEBUG otherwise); events dropped to broadcast lag are folded into the tally via drain_lagged so an overflow spike still surfaces. A truly idle daemon publishes well under one event per second, so a sustained triple-digit rate with no client attached is the signature of a feedback loop / event storm — the failure mode that pegs CPU by waking every broadcast subscriber and re-invoking WASM interceptors. The monitor is a pure observer (it counts on its own subscriber, not inside EventBus::publish), so it adds zero hot-path overhead and keeps reporting even while a storm saturates the dispatcher; this makes the next such incident self-diagnosing instead of requiring a live profiler attach. Bumps INTERNAL_SUBSCRIBER_COUNT 4 → 5. Closes #786.

  • Typed OpenAPI schemas for gateway responses sourced from kernel types. Response payloads that flow through astrid-core types (AgentSummary, CapabilityInfo, Quotas, GroupSummary, InviteIssued, InviteSummary) previously surfaced in the OpenAPI spec as opaque serde_json::Value (any), so generated clients got no field-level types. Each now has a gateway-local *View schema mirror (astrid-core stays utoipa-free) that the value_type resolves to, giving codegen real typed shapes. Fixing this also surfaced and corrected drift in the old hand-written shape comments: InviteIssued carries no fingerprint, InviteSummary's field is token_fingerprint (plus an always-present issued_at_epoch), and GroupSummary's flag is builtin (not built_in). A regression test pins the mirrors so a payload can't silently revert to any. env.default and distribution.branding stay Value — they're legitimately polymorphic. Closes #783.

  • Gateway default listen port changed 77772787 ("ASTR"). 7777 collides in practice with Terraria and assorted dev tooling; 2787 is the phone-keypad mnemonic for ASTR (A=2, S=7, T=8, R=7) and is effectively free in the wild. The gateway is opt-in and the runbook has operators set listen explicitly, so impact is limited to anyone relying on the default. Closes #784.

  • Gateway API client generation guide. New docs/gateway-client.md documents how to build a client against the gateway HTTP admin API: generate from the OpenAPI spec at GET /api/openapi.json (openapi-typescript / progenitor), with hand-written layers only where the spec can't help — the two SSE streams (POST /api/agent/prompt, GET /api/events), the #[schema(value_type = String)] stand-in fields whose real types live in astrid-core, and the redeem/refresh auth lifecycle (the bearer is treated as opaque — no client-side signature verification). Records the placement decision: a gateway client is native/browser code and does not belong in the wasm-targeted capsule SDKs (sdk-rust / sdk-js); it earns its own repo when it graduates to a maintained library. Cross-linked from core/README.md under "Operator documentation".

  • Gateway deployment runbook. New docs/gateway-deployment.md covers the full operator surface: quickstart, behind a reverse proxy (nginx + Caddy snippets, trust-forwarded-from discipline), native TLS via the [tls] block, monitoring (every counter / histogram explained, sample alert PromQL), authentication flow + bearer lifecycle, CORS allowlist grammar, key rotation, backup + restore, and a troubleshooting section for the common failure modes (401 cascade, rate-limiter lockout, CORS preflight mismatch, audit-history 502). Cross-linked from core/README.md under a new "Operator documentation" section. Closes #777.

  • GET /api/sys/audit — historical-query endpoint over the persistent audit log. Companion to the live GET /api/events SSE feed: SSE only delivers events from the moment the connection opens, so a dashboard rendering "the last 24 h of admin activity" had no way to backfill. Now it does. Paginated with ?since/?until (epoch seconds), ?method=AgentDelete, ?principal=alice, ?limit (default 100, cap 1000), ?cursor (opaque). Same trust shape as the SSE handler — audit:read_all holders see the firehose; everyone else is silently scoped to their own principal regardless of what they pass in ?principal. Reads via tokio::task::spawn_blocking so the underlying SurrealKV query doesn't stall the runtime worker. The new endpoint is plumbed through astrid-daemon::spawn_gateway alongside event_bus so the gateway gets an Arc<AuditLog> + SessionId at boot; standalone-builder GatewayState::new returns 502 honestly when those handles are absent. OpenAPI annotation + drift test pin the new route. Closes #779.

  • Latency histograms + per-request structured logs for the gateway, backed by the metrics facade. The /metrics exposition now ships an astrid_gateway_request_duration_seconds Prometheus histogram per (method, route, status) with the standard HTTP buckets (5 ms → 10 s + +Inf), and astrid_gateway_requests_total carries a new status label so a 4xx/5xx spike decomposes separately from the 2xx traffic on the same route. Every request also emits one structured tracing event with method, route (matched template, never the raw URL — keeps cardinality bounded), status, and duration_ms; /healthz and /metrics demote to DEBUG so the high-frequency liveness probes don't drown the INFO stream. Recording uses the metrics crate facade + metrics-exporter-prometheus instead of a hand-rolled counter/histogram pair — decouples recording from export format so kernel-side or capsule-side observability can land later via the same counter!() / histogram!() macros without each subsystem reinventing a metrics layer. astrid_gateway_auth_failures_total, astrid_gateway_redeem_attempts_total, and astrid_gateway_redeem_rate_limited_total are registered at boot and render at 0 so dashboards have a stable series shape before the corresponding instrumentation lands. Closes #778.

  • Native TLS termination in the gateway via rustls. New optional [tls] block in etc/gateway-http.toml (cert-path, key-path) flips the gateway from plain HTTP to rustls-terminated HTTPS — useful for single-box installs, Tailscale-fronted deployments, and anyone running without a reverse-proxy ops layer. Without the block, the daemon behaves exactly as v0.7.0 (plain HTTP, TLS-upstream expected). Backed by axum-server 0.7 + aws-lc-rs (no openssl). New GatewayConfig::validate() runs at daemon boot: missing cert/key paths produce a clear "refusing to boot" error so misconfig fails fast instead of returning malformed TLS handshakes at runtime. Group/world-readable key files generate a WARN log at boot suggesting chmod 0600. rustls::crypto::CryptoProvider::install_default() is called idempotently in tls::load_rustls_config so rustls 0.23's multi-provider deferral doesn't panic the first handshake. HSTS (Strict-Transport-Security: max-age=63072000; includeSubDomains) is layered onto every TLS response — only on the TLS dispatch path, since RFC 6797 forbids HSTS over plain HTTP. Binding a non-loopback address without a [tls] block now logs a WARN at boot suggesting the operator either enable TLS or confirm a reverse proxy is fronting the listener. Three new integration tests in crates/astrid-integration-tests/tests/gateway_tls.rs mint a self-signed cert at test time with rcgen and prove the round-trip end-to-end including HSTS presence on TLS / absence on plain HTTP. ACME / Let's Encrypt automation, mTLS client-cert auth, and HTTP/2 / h2 ALPN remain out of scope for v0.7 and tracked as follow-ups on the closed issue. Closes #773.

  • HTTP admin gateway (astrid-gateway). New crate that fronts the kernel's existing astrid.v1.admin.* + astrid.v1.request.* IPC surfaces over HTTP for browser dashboards. Reads ~/.astrid/run/system.token, handshakes with the daemon over the same Unix socket the CLI uses, and stamps IpcMessage.principal from an ed25519-signed bearer it verifies against its boot-time public key — never from the request body. Full route surface, sufficient for a dashboard to provision principals, set quotas, manage groups + caps + invites, and configure capsule env:

    • Discovery (unauthenticated): GET /api/distribution, GET /api/distribution/onboarding
    • Auth: POST /api/auth/redeem, GET /api/auth/me, POST /api/auth/refresh (extends an existing session without forcing a re-redeem), POST /api/auth/pair-device (authenticated — mint a pair-token for the caller's own principal), POST /api/auth/pair-device/redeem (unauthenticated — the new device sends its ed25519 public key; the kernel appends it to the principal's AuthConfig.public_keys and the gateway returns a fresh bearer for that principal)
    • Principals (agent CRUD): GET/POST /api/sys/principals, GET/PATCH/DELETE /api/sys/principals/{id}, POST .../enable, POST .../disable
    • Caps: POST /api/sys/principals/{id}/caps (grant), DELETE /api/sys/principals/{id}/caps (revoke)
    • Quotas: GET/PUT /api/sys/principals/{id}/quotas
    • Groups: GET/POST /api/sys/groups, PATCH/DELETE /api/sys/groups/{name}
    • Invites: POST/GET /api/sys/invites, DELETE /api/sys/invites/{fingerprint}
    • Capabilities catalog: GET /api/sys/capabilities (drift-checked against the kernel-side tables at test time) returns structured entries — each with id, label, description, category (agent / caps / quota / group / invite / capsule / system / approval), scope (self / global), and danger (safe / normal / elevated / extreme) — so dashboards can render Discord-style permissions panels with confirmation prompts on dangerous toggles, without hardcoding per-capability metadata client-side. Source of truth is astrid_core::capability_grammar::CAPABILITY_CATALOG; the kernel's drift tests pin every static cap-string against the catalog at test time.
    • Capsules: GET /api/capsules, POST /api/capsules (cap-gated by capsule:install; kernel handler is a stub today but the route is forward-compatible), GET /api/capsules/{id}, GET /api/capsules/{id}/topics
    • Capsule env (per-principal config): GET /api/capsules/{id}/env (schema from Capsule.toml), POST /api/capsules/{id}/env/{field} (writes to FileSecretStore for secret-typed fields, env JSON under the principal's home for text/select/array)
    • Audit stream: GET /api/events — Server-Sent Events feed of audit entries. Operators with audit:read_all see the firehose; everyone else sees only entries whose principal matches their own. Kernel publishes a flat JSON event (ts_epoch, method, required_capability, principal, target_principal, params, outcome) to topic astrid.v1.audit.entry on every record_admin_audit call. 15s SSE keep-alive so NAT/proxy state survives idle stretches.
    • System: GET /api/sys/status, POST /api/sys/capsules/reload
    • Ops probes (unauthenticated): GET /healthz (200 iff daemon socket reachable; ~zero IPC cost so safe for high-frequency liveness probes) and GET /metrics (Prometheus text-exposition format — astrid_gateway_requests_total{method,route}, astrid_gateway_auth_failures_total, astrid_gateway_redeem_attempts_total, astrid_gateway_redeem_rate_limited_total). Restrict access via reverse proxy / firewall.

    localhost-bound by default; TLS expected upstream. Spawned by astrid-daemon when etc/gateway-http.toml has enabled = true; missing/disabled = no-op so single-tenant deployments are unaffected. The POST /api/capsules route is wired through to KernelRequest::InstallCapsule and the kernel-side handler ships in this release — see the astrid-capsule-install entry below. ApprovalRequired responses from the kernel are passed through structurally so dashboards can render approval prompts when capability gating triggers them. (#756)

  • Invite-token primitives in the kernel. Four new AdminRequestKind variants — InviteIssue, InviteRedeem, InviteList, InviteRevoke — plus matching AdminResponseBody::{Invite, InviteRedeemed, InviteList}. Backed by a file-at-etc/invites.toml store that persists only hex(sha256(token)) (never the raw token) with 0600 perms and atomic write-then-rename. InviteRedeem bypasses the cap-gate by explicit match in the dispatcher (the token IS the auth) and reuses the existing AgentCreate machinery under admin_write_lock so concurrent redeems can't double-spend. The audit sanitiser redacts both the raw token and the ed25519 public key from persisted audit rows, replacing each with its SHA-256 fingerprint. New invite:issue / invite:list / invite:revoke capabilities; the built-in admin group's * covers them. (#756)

  • astrid invite {issue, list, revoke, redeem} CLI verbs. Operator parity with the gateway HTTP routes — useful for scripting and for end-to-end testing the invite flow without the HTTP front. redeem connects as PrincipalId::default() so a fresh-machine onboarding flow doesn't need a pre-existing cli-context.toml. redeem accepts either --public-key <hex> or --keypair <name> (reading the pubkey from the local keystore) and an optional --switch flag that auto-updates cli-context.toml to the freshly-minted principal.

  • astrid keypair {generate, list, show, pubkey, delete} CLI verbs. Multi-key local ed25519 keystore at ~/.astrid/keys/local/<name>.{ed25519, pub.hex, meta.toml} — 0700 parent dir, 0600 private file, 0644 public, atomic write-then-rename throughout, secret bytes zeroized on drop via ed25519-dalek's zeroize feature. Each keypair carries a TOML metadata sidecar (schema_version, fingerprint, created_at, optional note, optional bound_principal set on a successful invite redeem). pubkey --format openssh emits ssh-ed25519 AAAA… so the same key can be reused with SSH-style tooling. Forward-compatible with a future AuthMethod::HardwareKey backend — only the meta backend field changes.

  • astrid_core::capability_grammar::KNOWN_CAPABILITIES. Canonical static slice of every capability identifier the kernel recognises. The HTTP gateway's /api/sys/capabilities route references it; new kernel-side tests in kernel_router/capability_catalog_tests.rs enumerate every string returned by required_capability / required_capability_for_admin_request and assert each appears in the catalog — so adding a kernel cap without updating the catalog now fails CI. Compile-time KNOWN_CAPABILITIES_COUNT pin catches off-by-one omissions.

  • astrid-capsule-install crate — kernel-side KernelRequest::InstallCapsule handler. Install machinery (file layout, content-addressing of WASM/WIT into bin/<hash>.wasm / wit/<hash>.wit, lifecycle hooks, topic baking, meta.json writes, .capsule archive unpacking with traversal/symlink defense) extracted from astrid-cli so the daemon and the CLI reach disk through the same code path. The kernel handler is path-only by construction: network sources (@org/repo, GitHub URLs, openclaw:…, gh:, raw HTTPS) are rejected with a structured error pointing callers at the gateway's registry route — the daemon never fetches arbitrary bytes during install. On success the handler triggers load_all_capsules so the new capsule is live without a daemon restart, and returns a flat JSON InstallOutput (target_dir, phase, installed_version, previous_version, wasm_hash, env_path, env_needs_prompt, missing_imports, export_conflicts) the dashboard can render directly. CLI behaviour is unchanged — astrid capsule install still handles source resolution (GitHub release-asset download with clone+build fallback, OpenClaw transpile, archive auto-detect, Rust-source auto-build), and the post-resolution install delegates to the new crate. (#756)

  • Bus-direct admin path — 285× admin throughput increase, 1000s-of-agents ready. Gateway admin routes used to dial the kernel over the Unix socket through the astrid-capsule-cli proxy capsule, which capped at ~19 RPS regardless of concurrency (MAX_ACTIVE_STREAMS = 8 + per-stream 50ms poll budget). For a deployment hosting thousands of agents that was a hard wall. New astrid_gateway::bus_admin::BusAdminClient publishes admin requests directly onto the in-process kernel.event_bus (which the gateway already holds for the SSE audit + agent routes) and subscribes to the response topic locally. Same envelope shape, same request_id correlation, same kernel-side dispatcher — three fewer hops. Every admin route now goes through state.admin_client(caller) rather than AdminClient::connect(...).await over the socket. Measured ceiling: 5,400 RPS reads at c=50 (p99=15ms) vs the previous 19 RPS / 1054ms ceiling. Writes also serialised through the admin dispatcher's single-task loop — parallelised that too (tokio::spawn per request in spawn_admin_router) so reads no longer block behind writes; the existing admin_write_lock continues to serialise the actual write handlers. Writes now ~46 RPS peak (admin_write_lock + TOML rewrite bound — future scalability item, file an issue for an LSM-backed invite/profile store when it bites). All 34 e2e stories continue to pass on the new path; the socket-based AdminClient is unchanged and still serves the CLI's external uplink use case. (#756)

  • POST /api/agent/prompt — agent invocation over HTTP, SSE response stream. Dashboard clients can now talk to the AI through the gateway, not only via the CLI's user.v1.prompt socket path. The route publishes IpcPayload::UserInput { text, session_id, context } directly onto the in-process kernel event bus (no proxy round-trip — sidesteps the capsule-cli stability issue tracked at unicity-astrid/capsule-cli#18) and returns a Server-Sent Events stream emitting event: ready (correlation handshake), event: delta (incremental tokens from agent.v1.stream.delta), event: response (final reply on agent.v1.response, closes the stream), event: session_changed (agent.v1.session_changed), and event: elicit (forwarded from astrid.v1.elicit.* for follow-up-input prompts). Per-session filtering at the consumer end so multiple concurrent dashboards don't see each other's chunks. 5-minute upper bound on the stream — dashboards re-POST to continue. New agent tag in the OpenAPI spec; new PromptRequest + PromptReady schemas. Verified end-to-end against the live daemon: ready event arrives with the caller's principal echoed back. Full LLM round-trip requires the react/identity/LLM-provider capsules to be installed. (#756)

  • Path-parameter routes fix — every {id}-style route was silently broken. Gateway is built on axum 0.7.9 which uses :id syntax for path captures; {id} is axum 0.8+. The routes module registered every parameterised path with {id} literally, so requests to /api/sys/principals/alice, /api/sys/principals/alice/quotas, /api/capsules/foo/env/api_key, /api/sys/groups/ops-team, /api/sys/invites/<fp>, and every other parameterised admin route returned a 404 — only the no-params routes worked. The OpenAPI spec uses {id} (correct, that's the OpenAPI standard) so the discrepancy didn't surface in the drift test. Caught by running the multi-story e2e against a live daemon; every parameterised route now routes correctly. The OpenAPI spec is unchanged (it's already correct for that format). (#756)

  • End-to-end multi-perspective story test plan. 34-assertion script (scripts/e2e-stories.sh, gated on a live daemon — not part of the cargo test matrix) walks the full admin gateway from three perspectives: bootstrap admin (issue invites, create custom groups, grant caps), team operator (redeems into a custom group, can issue invites but not delete principals, can't pair-device because the team group lacks self:auth:pair), and regular agent (redeems into the built-in agent group, can pair devices, sees only themselves in /api/sys/principals, can't issue invites or read system status until admin grants the cap, can read their own quota after admin sets it). Also covers /api/openapi.json (29 paths + 32 schemas), /api/sys/capabilities (34 caps, 8 categories), bearer refresh, unauthenticated 401 paths, and the agent SSE handshake. All 34 stories pass against the live daemon as of this commit. (#756)

  • Gateway end-to-end smoke test + ConnectInfo fix. New astrid-integration-tests/tests/gateway_e2e.rs boots a real Kernel and GatewayState against a tempdir $ASTRID_HOME and proves the boot artefacts (Unix socket, persistent KV directory, gateway signing key at keys/gateway.ed25519) land on disk, the unauthenticated routes (/api/distribution, /api/openapi.json, /healthz) return 200 against the live state, bearers round-trip via the same signing material the middleware verifies, and the kernel-bus → SSE wiring delivers audit events to subscribers. The full /api/auth/redeem socket loop is gated on the astrid-capsule-cli proxy capsule (out of this crate's scope to load) and was exercised manually against a built daemon. Bug found by manual exercise: axum::serve(listener, router) without into_make_service_with_connect_info::<SocketAddr>() means the ConnectInfo<SocketAddr> request extension is missing, so POST /api/auth/redeem (which extracts it for per-IP rate limiting) returned 500 Missing request extension. Fixed by switching the serve shape — every tower-style in-process test still works because oneshot doesn't require a real connection, but the production daemon does. (#756)

  • GET /api/openapi.json — OpenAPI 3.x spec emission via utoipa. Every gateway handler now carries a #[utoipa::path(...)] annotation and every request/response type derives ToSchema. The aggregated ApiDoc lists all 35 routes under paths(...) and 32 schemas under components(schemas(...)). The spec declares a bearerAuth HTTP security scheme (description spells out the verification posture) as the default requirement; unauthenticated routes (/api/distribution, /api/distribution/onboarding, /api/auth/redeem, /api/auth/pair-device/redeem, /healthz, /metrics, /api/openapi.json itself) clear it via security(()). Twelve tags(...) group routes by family (auth, principals, caps, quotas, groups, invites, capsules, env, audit, system, discovery, ops) so Swagger UI / Redoc / Scalar render coherent sections. Type-system boundary: a handful of response types originate in astrid-core (PrincipalId, Quotas, AgentSummary, GroupSummary, InviteIssued, InviteSummary, DaemonStatus, CapabilityInfo) and don't carry ToSchema — pulling utoipa across the kernel-side dep graph would balloon the build for one observability concern. Those fields use #[schema(value_type = serde_json::Value)] to render as generic objects in the spec with prose describing the exact field shape in the surrounding text. New drift test in tests/router.rs enumerates every registered route in routes::build and asserts each appears in the spec — adding a route without an annotation fails CI. New ErrorBody struct in error.rs documents the unified failure shape every status code emits ({ error, reason?, retry_after_secs? }); the IntoResponse impl still writes the same wire format via json!() macros so no body changes for existing clients. Drop the URL into openapi-typescript / openapi-generator / kiota to get a typed client; drop it into Swagger UI / Redoc / Scalar for browsable docs. (#756)

  • Source-direct content-addressing during install. Previously the install copied the entire capsule tree into the target directory, then read the .wasm back out, BLAKE3-hashed it, wrote bin/<hash>.wasm, and deleted the per-capsule copy — same dance for wit/. The new lib hashes WASM and WIT from the source directly, writes to the content store once (atomic temp+rename so concurrent writers on identical bytes converge harmlessly), and the per-capsule directory copy excludes *.wasm and the top-level wit/ by construction. No transient staging copy. The runtime contract is unchanged — loader still reads via resolve_content_addressed_wasm — only the install path is cleaner. Pre-flight hashing also means content-addressing failures (bad source, hash collision, disk full on bin/) leave the existing install intact: no rollback needed because target_dir hasn't been touched yet. (#756)

  • Distro.toml [invites] and [branding] sections (additive). [invites] { issuers, default-group, default-expires, max-principals } declares the deployment's onboarding policy (empty issuers = single-tenant, no registration UI). [branding] { icon, primary-color, accent-color } carries dashboard visual hints. Parser-validated: non-empty issuers requires default-group; durations parse as Ns/Nm/Nh/Nd; colours match #RGB or #RRGGBB; icons are capped at 64 KiB on parse.

Changed

  • Guest ERROR-level logs now surface in the daemon log, not only the per-capsule log file. When a run-loop capsule's run() returns Err before signaling ready, the daemon log showed only a contextless Capsule run loop exited before signaling ready — for the sole-socket-owner uplink (the cli proxy) that takes the whole daemon down, and diagnosing it meant hours of guessing. The reason was never actually lost: the SDK #[astrid::run] macro logs run loop exited with error: {e} at ERROR level before returning, but sys::log writes guest logs to a per-capsule file (effective_capsule_log()) and only fell back to the daemon's tracing subscriber when there was no per-capsule file — so the reason landed in a separate file operators don't check during a crash. ERROR-level guest logs are now emitted to the daemon's tracing subscriber in addition to the per-capsule file (lower levels stay file-only when captured, preserving the daemon log's signal-to-noise). The silent run-loop crash is now a one-line diagnosis (ERROR …host::sys: run loop exited with error: … plugin=<capsule>) sitting right next to the kernel's message. No ABI/WIT change. Refs #884.
  • Publish/subscribe ACL wildcard matching now uses the route-layer subtree semantics — a declared trailing * covers the whole namespace. ACL authorization used the strict topic::topic_matches (a * matches exactly one segment) while event delivery uses astrid_events::TopicMatcher (a trailing * is a subtree wildcard matching one or more segments at any depth). That divergence forced capsule manifests to enumerate wildcard depth (astrid.v1.admin.* / *.* / *.*.*) merely to authorize publishing topics whose depth varies or is unknown (uuid suffixes, variable sub-paths) — the same matcher asymmetry behind the cli run-loop subscribe confusion, now on the publish side. The two ACL checks (publish_inner, check_subscribe_acl) now authorize via a single allocation-free astrid_events::topic_pattern_matches — extracted as the one source of truth now used by routed delivery (TopicMatcher), broadcast delivery (EventReceiver), and ACL authorization alike — so a declared astrid.v1.admin.* authorizes every admin topic at any depth and the three paths can never silently diverge again (this also folds away a duplicated matcher and fixes a latent equal-segment-count bug in the broadcast path's hand-rolled iterator form). Scope is the two ACL sites only — interceptor dispatch keeps strict matching, and the runtime "wildcard must be terminal" subscribe gate is unchanged. The change is permissive (authorizes more, denies nothing previously allowed; delivery was already subtree); breadth is the operator's decision at install, not something the matcher enforces by forcing enumeration. Refs #882.
  • The astrid:process WIT now documents the id-keyed persistent write-stdin / close-stdin as implemented. These landed in #867 — the persistent-process registry retains the child's stdin pipe host-side across pooled-instance resets (1 MiB-capped, backpressured, ownership-checked, audited) — but the WIT still tagged them (NOT YET IMPLEMENTED) and carried two stale PERSISTENT TIER "stubbed until the registry lands" banners. The tags are dropped and the banners corrected; function signatures are unchanged (doc-only). The ephemeral ProcessHandle form, attach, and watch / unwatch remain genuinely deferred and stay tagged. An acceptance test now locks the id-keyed path: deliver bytes → a by-id re-write (needing only registry + id, exactly what a post-reset instance holds) reaches the same child → close-stdin yields a clean EOF exit → over-cap returns too-large → wrong-owner returns no-such-process → post-close returns closed. Refs #870.
  • Host-call concurrency is split into separate blocking vs async-I/O semaphores — the LLM-path throughput gate. A single host_semaphore sized at cores - 2 previously gated every host call: both the blocking ones that block_in_place + block_on and pin a tokio worker for the whole permit-held wait (KV, identity, sys, fs, the net/process security gates, DNS, sockets) and the async-I/O ones that .await real I/O and free the worker (HTTP request/stream, ipc::recv). The cores - 2 cap is right for the blocking class — it must not approach the worker-pool size or blocking host work starves the scheduler — but it throttles outbound I/O far below what the host sustains, capping the LLM/HTTP path. HostState now carries two gates: blocking_semaphore (host-derived cores - 2, the same ceiling — but no longer contended by I/O calls, so strictly more headroom) and io_semaphore (host-derived, cores-scaled and clamped by half the process RLIMIT_NOFILE — floored at 1, so on an fd-scarce host the descriptor budget wins and the ceiling can fall below the IO_MIN floor, preventing EMFILE — since each in-flight async call may hold a socket fd). The four generic bounded_* host helpers are unchanged; each call site passes the semaphore matching its class (the block_on helpers take blocking, the await helpers take I/O). Net stays on the blocking semaphore for now — reclassifying its socket/accept/DNS paths, which do real I/O but currently use the blocking block_on helpers, to the async semaphore is a follow-up. Refs #816.
  • Per-Store memory limiting now also meters per-principal peak usage. The plain wasmtime::StoreLimits that capped each capsule instance's linear memory is replaced by a StoreMemoryMeter that keeps the same enforcement (the per-invocation max_memory_bytes ceiling) and records, into a kernel-owned shared MemoryLedger, the high-water linear-memory size each invoking principal grows a Store to — the RAM analogue of the per-principal FuelLedger. It is keyed by the same invoking principal (caller → owner → default) the fuel ledger charges, and re-targeted per invocation since a pooled Store is leased across principals. Sharded + atomic like the fuel ledger, so recording stays off the lock — and unlike the fuel ledger it is bounded (capped at a max principal count, evicting the lowest-peak entry when full) so a flood of ephemeral sub-agent principals cannot grow it without limit (astrid#827). Telemetry only — no deny path. Refs #816.
  • The per-capsule instance pool is now dynamic and host-sized — the fixed INSTANCE_POOL_SIZE = 16 is gone. Each non-run-loop capsule used to eagerly instantiate exactly 16 (Store, Instance) pairs at load regardless of machine size or actual concurrency — 16× the linear-memory footprint up front even for an idle capsule, and a hard concurrency ceiling of 16 on a large host. The pool now warm-starts at a small min_idle, grows lazily (a checkout that finds no warm instance mints a fresh one while holding a permit, so the total never exceeds the max) toward a host-derived max (cores-scaled, replacing the magic 16), and an idle-eviction timer trims the warm set back down to min_idle after a burst subsides, reclaiming the memory of instances built under load. Net effect: less resting memory for idle capsules, more peak concurrency on big hosts. The max is operator-overridable via [capsule].instance_pool_size, ASTRID_CAPSULE_INSTANCE_POOL_SIZE, or astrid-daemon --instance-pool-size. Run-loop and host_process capsules stay pinned to a single Store — the host_process carve-out can never lease a second one (enforced by allow_grow = false, belt-and-suspenders over its always-warm single instance). Free-checkout soundness is unchanged: a lazily-grown instance is built by the same HostState factory as an eager one and reset identically on return. A RAM-budget-derived max lands with the per-principal memory ledger. Refs #816.
  • Per-principal interceptor CPU is now summed cross-capsule into one kernel-owned FuelLedger. Each capsule's WasmEngine previously kept its own per-principal fuel HashMap, fragmenting a principal that drove N capsules into N independent sub-totals; the ledger is now a single kernel-owned Arc<DashMap<PrincipalId, AtomicU64>> cloned into every engine, summing a principal's interceptor CPU across all capsules. Sharded + atomic (never a global mutex), so it does not reserialise the hot interceptor path. Telemetry only — no read/deny path in this change. Refs #819.
  • Per-capsule WASM instance pool: principals' interceptors now run concurrently instead of serialising through one Store. Each non-run-loop capsule was a single Store<HostState> behind one mutex, so invoke_interceptor processed exactly one invocation at a time per capsule — the throughput floor that kept the #813 orchestration cliff open even after the async-Wasmtime work removed worker-pinning (one LLM turn every ~3 s, invariant to concurrency across a 50× range, measured directly as a 2000+ deep invocation backlog on one Store). WasmEngine now holds a pool::CapsuleInstancePool of N (Store, Instance) pairs built from one shared InstancePre; invoke_interceptor leases a free instance (semaphore-gated checkout().await) so up to N principals execute on independent Stores at once. Lease lifecycle is one RAII guard (PoolCheckout) whose Drop folds the Phase-3 CLEAR (resetting every invocation_* field) and the return-to-pool onto every exit path — normal return, ?, panic-unwind, and future-drop on caller cancellation — replacing the old inline ClearOnDrop. Free checkout is sound because the per-capsule pool-safety audit confirmed no pooled capsule relies on in-WASM-memory state surviving across invocations (interceptor capsules use wasmtime resources within a single invocation). Pool size is a host-derived dynamic max (the originally-fixed INSTANCE_POOL_SIZE = 16 is superseded by the dynamic, host-sized pool — see the dedicated entry above), or 1 for run-loop capsules (they keep their dedicated Store, owned by the run loop) and for host_process capsules — a fail-secure carve-out for astrid-capsule-shell, whose background-process handles live across invocations and must stay single-Store; keyed off the existing capability, so no manifest/contract change. ipc_limiter is now a shared Arc across a capsule's instances so the per-capsule IPC throughput budget is not multiplied by pool size; the resource-table-mirror counters (net_stream_count, subscription_count) stay per-Store (correctly scoped to each instance's table). Refs #816.
  • Async Wasmtime: guest invocations no longer pin tokio workers (block_in_place removed from the WASM hot path). Component-Model async is now on across the kernel — every guest entry goes through Linker::instantiate_async / TypedFunc::call_async, and the per-capsule Store is now wrapped in a tokio::sync::Mutex instead of a std::sync::Mutex. The orchestration cliff fix from #813 protected the per-(capsule, topic, principal) routing layer; this finishes the job at the wasmtime layer by ensuring a parallel invoke_interceptor waiter .awaits on the lock rather than holding a worker via block_in_place for the full lifetime of the currently-running guest call (#816). Per-capsule serialisation is preserved (one guest call at a time per capsule, same as before), but the executor is free to schedule other capsules while one is queued. ExecutionEngine::invoke_interceptor and Capsule::invoke_interceptor are now async fn; the dispatcher / kernel callers .await accordingly. engine::wasm::run_lifecycle is async fn too, with astrid-capsule-install::lifecycle driving it through the available runtime handle. Cancellation safety: the existing ClearOnDrop RAII guard already runs on the future-drop path, so a cancelled interceptor still clears caller_context, interceptor_active, and every invocation_* field on HostState before releasing the store lock — tokio::sync::Mutex has no poisoning to recover from, so the previous "poisoned lock" branches are gone. Host trait impls remain synchronous: this lands the runtime infrastructure for async hosts but does not yet flip the bindgen-side imports: { default: async } flag — the slow-blocking host fns (ipc::recv, elicit, net.read/write, http::request) still use bounded_block_on_cancellable internally, scoped to a single host call rather than the entire guest invocation. Migrating those to truly async host fns is a follow-up. Refs #816.
  • SocketClient and AdminClient moved out of astrid-cli into a new astrid-uplink crate. Both the CLI and the new HTTP gateway are kernel uplinks with the same trust shape (read system.token, handshake, stamp IpcMessage.principal); the framing / handshake / admin-request correlation logic now lives in one place. CLI keeps thin shims at crate::{socket,admin}_client::* so verb modules don't change. Behaviour-preserving refactor. SocketClient::send_input now takes an explicit caller: &PrincipalId instead of looking up the CLI's active-agent context internally — the CLI passes its context-resolved principal through a send_input_as_active_agent helper.
  • astrid-uplink::KernelClient. Sibling of AdminClient for the KernelRequest / KernelResponse family that flows over astrid.v1.request.* (capsule list, status, reload, etc.). Embeds a per-call UUID in the topic suffix so concurrent HTTP requests can't pick up each other's responses on the shared bus.

Deprecated

  • astrid_capsule_pending_tail_overflow_total — removed. Replaced by astrid_capsule_route_byte_evictions_total{capsule, principal_class} and astrid_capsule_route_quantum_starved_total{capsule, principal_class} from the new publish-side routing demux. Operator dashboards alerting on the old counter must migrate.
  • astrid_capsule_interceptor_permit_wait_seconds_total — removed. The per-capsule interceptor semaphore is gone (see Removed above); there is no permit wait to measure. Operator dashboards alerting on the histogram must drop the series.

Removed

  • astrid-openclaw crate and the entire OpenClaw plugin build path (#829). The TypeScript/JavaScript-to-WASM compiler that absorbed the OpenClaw plugin ecosystem is gone — it was broken three ways and was the last extism vestige in core. (1) The Tier 1 QuickJS kernel was built from a personal fork of extism's js-pdk (github.com/nicholasgasior/extism-js @ v1.6.0) that no longer exists (HTTP 404), while build.rs separately pointed at upstream extism/js-pdk — the two build paths already disagreed. (2) kernel/engine.wasm was never committed (only its BLAKE3 hash), and the committed hash corresponded to a fork build that can't be reproduced from upstream, so a clean checkout had no path back to a working kernel. (3) The compiler emitted extism-PDK core-module exports (__invoke_i32, named exports) while the capsule host migrated off Extism to the wasmtime Component Model / WIT — so its output could no longer load. It was also a CI breaker (the QuickJS auto-build + wasi-sdk exhausted the runner disk → SIGBUS). Removed alongside the crate: astrid-build's openclaw builder, the hidden --wizer-internal subcommand, the openclaw project type and openclaw.plugin.json autodetection; the CLI openclaw: install source; the kernel-router openclaw: remote-source rejection; the capsule watcher's openclaw.plugin.json detection; packages/openclaw-mcp-bridge (the Tier 2 Node bridge); scripts/build-quickjs-kernel.sh and scripts/compile-test-plugin.sh; and the now-orphaned oxc / oxc_allocator / wizer workspace dependencies. The live channel-registration infrastructure that carried the legacy name is renamed on this branch: UplinkSource::OpenClawUplinkSource::Bridge (constructor new_openclawnew_bridge, serde wire tag open_clawbridge, Display openclaw(…)bridge(…)). The serde wire-tag rename stays backward-compatible via a #[serde(alias = "open_claw")] on the Bridge variant: new writes serialize as bridge, but any UplinkDescriptor persisted or transmitted under the legacy open_claw tag still deserializes. (UplinkSource/UplinkDescriptor derive Serialize/Deserialize and UplinkDescriptor is documented as deserializable for trusted persistence, so the alias is the safe path rather than assuming the value is in-memory-only.) Breaking: astrid build --type openclaw, astrid capsule install openclaw:…, and installing a bare openclaw.plugin.json directory are no longer supported.
  • Per-capsule interceptor semaphore (MAX_CONCURRENT_INTERCEPTORS) removed. The Wasmtime Store mutex at engine/wasm/mod.rs:1286 already serializes invocations inside tokio::task::block_in_place, so the per-capsule cap-of-4 could never produce more than 1 concurrent guest execution — it was a redundant ceiling left over from the pre-#813 SET/CALL/CLEAR race, which is now closed by ClearOnDrop (Layer 1) and the per-(capsule, principal) chain mutex (Layer 3). Run-loop capsules don't construct a Store at all, so for them the semaphore was pure overhead on every interceptor call. The Capsule::interceptor_semaphore trait method, the CompositeCapsule.interceptor_semaphore field, and the two acquire_owned().await sites in dispatcher.rs are gone. astrid_capsule_interceptor_permit_wait_seconds_total is removed with no replacement — there is no permit wait to measure. Closes #813.

Fixed

  • Capsule tools reach the LLM again on the OpenAI-compatible path (reactprompt-builder → provider). The pipeline was handing the model zero tools. prompt-builder's collect_tool_schemas() fans out tool.v1.request.describe and drains tool.v1.response.describe.*, but its manifest granted neither topic, so the fan-out failed CapabilityDenied before it ever fired — a code/manifest mismatch left by the per-domain-WIT migration. And even once unblocked, the KV-persisted __tool_schema_cache was never invalidated: its only trigger is a prompt_builder.v1.invalidate_tool_cache event that nothing publishes, so a stale (or empty) tool list survived across daemon restarts and masked freshly installed tool capsules. Fixed in capsule-prompt-builder (unicity-astrid/capsule-prompt-builder#19) by declaring the two describe-fan-out ACLs the code already uses and invalidating the cache on the kernel's astrid.v1.capsules_loaded broadcast. This is the consumer half of the tool-discovery restoration; the producer half — the #[capsule]-generated tool_describe arm publishing its descriptor instead of returning it (an interceptor return is not fanned out under the current ABI) — ships in astrid-sdk 0.7.1. All three (ACL grant + cache invalidation + a tool-capsule rebuild against 0.7.1) are required for tools to reach the model; verified end-to-end with a fake-LLM harness (0 → 16 tools across system/fs/http/skills). Refs #892, #625.
  • Per-principal quotas now apply on the ipc::recv path, not just the interceptor path. A run+recv capsule resolved the invoking principal's PrincipalProfile only on the dispatcher-driven interceptor path (invoke_interceptor); the guest-pulled ipc::recv path (install_recv_invocation_context) left invocation_profile = None, so effective_profile() fell back to the process-global default profile. Every principal driving a run-loop capsule therefore ran under the default quota (max_background_processes, IPC throughput, HTTP streams) regardless of its operator-configured per-principal profile — a documented gap (host_state.rs: "quotas remain the capsule owner's … move to a real lookup when per-principal quota enforcement is needed"). The recv path now resolves the publishing principal's profile through the shared PrincipalProfileCache (threaded into HostState) and installs it, so per-principal ceilings apply on both invocation paths. The profile is resolved for every publisher, the capsule owner included — effective_profile()'s fallback is the process-global default, never the owner's configured profile, so (matching the interceptor path) an owner-published message must resolve the owner's profile or its on-disk quotas would be silently ignored. Best-effort on the recv path, which has no deny channel: a failed load logs and falls back to the default. The pooled-interceptor and lifecycle-hook constructions receive the cache handle; the one-shot run_lifecycle path passes None (it runs no recv loop). Fail-safe and bounded before this fix (the default quota is a real ceiling, never unbounded), but the per-principal value was not honoured. Closes #877.
  • Dispatcher no longer drops events when a per-topic consumer is reclaimed under a burst. Each (capsule, topic, principal) route has its own idle-evicting mpsc consumer. Under a concurrent burst the consumer could leave a closed sender in the queue map — the idle-evict TOCTOU (a stale clone outliving the sender_strong_count == 1 check) or a consumer task that ended. get_or_spawn_consumer then handed that dead sender back, so every subsequent try_send failed Closed and the event was dropped permanently, stalling all delivery for that route (observed as a react/user.v1.prompt prompt stall) even though the capsule itself stayed healthy. The fix never returns a closed sender: get_or_spawn_consumer removes a mapped entry whose receiver is gone (is_closed()) and re-spawns a fresh consumer — the explicit remove also closes a leak in the degrade-to-shared path, where the re-keyed insert would otherwise never overwrite the stale (capsule, Some(principal)) entry. dispatch_single keeps a re-spawn-and-retry backstop for the narrow window where a sender closes between return and send; Full (a live consumer with a saturated bounded queue) stays an intentional, per-principal-bounded shed-load drop — recoverable via the requester's IPC/SSE timeout — and is distinguished from Closed (a bug, never dropped, now flagged as a security_event if it ever recurs post-respawn). A live 100-wide concurrent-prompt smoke that previously delivered 4/100 with 39 silent channel-closed drops now delivers 95/100 with zero drops. Closes #837.
  • Dispatcher mpsc partitioned per-(CapsuleId, PrincipalKey). Layer 1's bus-side routing fix narrowed the publish surface to PrincipalKey granularity, but the capsule dispatcher continued to key its mpsc consumer queues and chain mutexes on the 3-bucket PrincipalClass enum — so 1000 distinct user-class principals still collapsed onto a single 256-slot queue and the cliff persisted at the dispatcher layer. Each queue is now keyed on the full PrincipalKey (Option<String>) with capacity 64, a 60-second idle-eviction grace that returns the queue map to the working set, and a MAX_DISPATCHER_QUEUES_PER_CAPSULE = 10_000 cap that degrades to a single shared (capsule, None) queue (with audit-logged eviction) on pathological fan-in. chain_locks widens to the same key so chains for distinct principals on the same capsule run concurrently. Closes #813.
  • Gateway SSE migrated to subscribe_topic_routed. The four agent SSE topics (agent.v1.response, agent.v1.stream.delta, agent.v1.session_changed, astrid.v1.elicit) and the audit firehose (astrid.v1.audit.entry) now subscribe via the bus's routed surface (publish-side per-(topic, principal) DRR fairness + byte-budget eviction) instead of the broadcast channel. Eliminates the ~4 s disconnect under N=100 fan-in that surfaced as event:ready followed by silence: a routed subscription naturally fans across every principal that matches the topic — the per-principal DRR machinery provides fairness within the route, so no wildcard-principal API is added or needed. The post-receive session_id filter in agent.rs is retained — session is a payload concern, not addressing. A new GatewayState.gateway_route_uuid (fresh per boot) pairs all gateway SSE routes under capsule="gateway" for telemetry. Accepted regression: there's no explicit Lagged signal from a routed receiver, so silent publish-side eviction replaces the broadcast-channel lag-disconnect; the 5-minute STREAM_TIMEOUT remains the only stalled-stream catch (open follow-up: expose a publish-side eviction signal via drain_lagged()). Closes #813.
  • Native per-(capsule, topic, principal) IPC routing on the kernel event bus. Closes the structural root cause of #813's "concurrency cliff" rather than relying on the per-recv pending-bucket workaround. EventBus gains an internal routes: Arc<parking_lot::RwLock<HashMap<RouteKey, Mutex<RouteEntry>>>> (keyed on (capsule_uuid, topic_pattern, subscription_rep)) populated by a new EventBus::subscribe_topic_routed entrypoint that capsule guests use in place of the broadcast-shaped subscribe_topic. Each RouteEntry owns a demand-allocated HashMap<PrincipalKey, PrincipalQueue> so a bus with 5000 idle principals costs zero per-principal entries; only active principals materialise a sub-queue, ~96 bytes each plus payload bytes. Publish-side fan-out happens AFTER broadcast::send (so a slow routed enqueue can never delay untargeted consumers — kernel_router, admin_router, bus_monitor, connection_tracker stay on the broadcast subscribe path) and applies deficit-round-robin (DRR) drain with a 1 MiB per-subscription byte budget and a 4 KiB quantum floor — 5000 active principals still make per-round progress. Under sustained byte pressure the bucket whose head was enqueued earliest gives up its head message until the new payload fits; streaming response terminators are preserved by construction (they're always the tail of their principal's queue, head-eviction trims the prefix not the tail). Each eviction emits tracing::error!(target: "astrid.audit.ipc", security_event = true, capsule, principal, evicted_topic, …) and bumps astrid_capsule_route_byte_evictions_total{capsule, principal_class}; sustained per-round back pressure surfaces via astrid_capsule_route_quantum_starved_total{capsule, principal_class}. The legacy SubscriptionEntry::pending/principal_order requeue path in crates/astrid-capsule/src/engine/wasm/host/ipc.rs is deleted — routed receivers never see mixed-principal batches in the first place because the demux happens publish-side, not consumer-side. Dispatcher gains a sibling chain_locks: HashMap<(CapsuleId, PrincipalClass), Arc<tokio::sync::Mutex<()>>> held across each chain step so cross-class invocations on the same capsule run concurrently while same-class invocations serialise FIFO; the dispatch_single mpsc queue map is keyed on (CapsuleId, PrincipalClass) for the same reason. PrincipalClass is a new bounded enum (System/User/Agent) extracted into crates/astrid-capsule/src/principal_class.rs so dispatcher.rs and the routing demux agree on label cardinality (3 buckets × capsule_count). WIT publish/recv signatures unchanged. Closes #813.
  • Capsule orchestration concurrency cliff resolved (ipc::recv truncation + invoke_interceptor lock-window collapse + per-capsule semaphore wiring). Three independent defects collapsed throughput at ~100 concurrent principals and silently dropped cross-principal IPC traffic. (1) ipc::poll / ipc::recv truncated any mixed-principal batch at the first principal boundary — the tail was dropped, not deferred — because the per-recv principal context is keyed off a single publisher. The drain now partitions the tail into per-principal pending buckets on the subscription resource (cap 64 messages per bucket, 8 distinct principals per subscription, total 512 queued ≈ 0.5x the broadcast capacity; oldest-first drop on overflow, least-recently-pushed bucket eviction on the 9th principal), each drop logged as tracing::error!(target: "astrid.audit.ipc", security_event = true) and counted via a new astrid_capsule_pending_tail_overflow_total{capsule, principal_class} metric (principal_class ∈ system/user/agent — never raw PrincipalId, to keep label cardinality bounded). The next recv/poll round-robin-drains the pending buckets before reading fresh events, so cross-principal traffic surfaces instead of vanishing. (2) invoke_interceptor held a three-phase lock window (SET → drop → CALL → drop → CLEAR) which let a parallel chain dispatch observe another principal's caller_context between SET and CALL — the cross-principal race the 100-wide collapse turned on. The window is now collapsed to a single store.lock() held across SET + CALL + CLEAR, with a ClearOnDrop RAII guard so the per-invocation fields are wiped on every exit path (normal return, early ?, panic-unwind through func.call). Trade-off: chain steps for the same capsule now serialize on Store contention (they already did via the per-capsule mpsc on single-match dispatch); the existing per-capsule semaphore (cap 4) protects burst memory. (3) The per-capsule interceptor semaphore was declared but never acquired by the dispatcher; chain and single-match invokes both now acquire_owned() a permit immediately before each invoke_interceptor call, with Err from a closed semaphore treated as "capsule unloading" (debug log + return for chains, continue for the consumer loop so hot-reload doesn't tear down the replacement capsule). Wait time recorded via astrid_capsule_interceptor_permit_wait_seconds_total{capsule} so the cap-of-4 is observable from day one. Companion: the EventReceiver::recv Lagged arm upgrades from warn! to tracing::error!(target: "astrid.bus", security_event = true) to match the audit-pipeline keying used by astrid-capabilities/kernel-router. Companion capsule fix in capsule-react: TurnState::load and load_active_sessions switch from kv::get_json to kv::get_json_opt, treating cold-start Ok(None) as the default instead of an error case, which previously generated ERROR floods on every fresh session. Closes #813.
  • Per-invocation env overlay reaches capsule env::var(...) calls. The gateway's POST /api/capsules/{id}/env/{field} route writes operator-supplied env values to $ASTRID_HOME/home/<principal>/.config/env/<capsule>.env.json — but the kernel's get_config host-fn was reading only self.config (the manifest defaults loaded once at capsule boot from the load-time principal's home). For any principal other than default the env-write route was effectively write-only: an operator setting base_url = http://localhost:1234 on astrid-capsule-openai-compat for a gateway-minted bearer still saw their LLM request hit api.openai.com (the manifest default). The dispatcher now loads <home>/.config/env/<capsule>.env.json into a new HostState::invocation_env_overlay whenever an interceptor is dispatched under a non-load-time principal (and the recv-context installer mirrors the load on every fresh inbound principal in a run-loop subscription); get_config checks the overlay first, falls through to self.config on miss, so the manifest default still wins for keys the operator hasn't overridden. Three regression tests pin overlay-wins-over-default, fall-through-on-miss, and absent-overlay-falls-back-to-default. Defensive size cap of 1 MiB on the env JSON read keeps a misconfigured file from blocking every dispatch on a slow disk read.
  • Caller-context principal preserved across empty inner recvs. Capsules that follow the "receive an event, fan out to plugin hooks, drain responses, finish" pattern (prompt-builder, registry — any #[astrid::run] capsule that nests ipc::recv inside its event-handling code) were silently flipping every post-hook publish to the capsule's load-time principal (default) under any non-default caller. The host's ipc::recv / ipc::poll paths called clear_recv_invocation_context on empty drains, wiping the caller_context that recv had installed when the original event arrived — so when the fan-out timed out and the run loop continued with its follow-up publishes (e.g. session.v1.request.get_messages, prompt_builder.v1.response.assemble), each one stamped default and the kernel's mixed-principal recv-batch security gate truncated downstream consumers. End-to-end effect: every gateway-minted bearer (which always yields a new non-default principal) hit a ready SSE event on POST /api/agent/prompt and then nothing — no delta, no response, no LM Studio traffic at all. The fix has two parts. (a) Empty recv / poll drains no longer touch caller_context — they only update it when a new message arrives via install_recv_invocation_context, so a follow-up publish between recvs continues to stamp the most recently received message's principal. (b) A new interceptor_active flag on HostState shorts the recv-driven install path while an interceptor (#[astrid::interceptor]) is dispatching — the interceptor's caller is owned by WasmEngine::invoke_interceptor, not by recv, so a nested recv inside the handler can no longer rewrite the outer caller. Three unit tests pin the invariants. The now-redundant clear_recv_invocation_context is removed.
  • Event-bus subscribers yield more cooperatively under a storm. EventReceiver::recv filters by topic, and broadcast::recv().await returns buffered items without yielding — so a filtered subscriber draining a backlog of non-matching events could hold its worker for up to 100 synchronous iterations, and on Lagged (the broadcast buffer overran the receiver — a storm) it caught up with no yield at all. The non-matching drain now yields every 32 events (named YIELD_AFTER_SKIPPED), and the Lagged arm yields before catching up. Dampens worker monopolization during an event storm; the storm's root cause is surfaced separately by the bus-activity diagnostics. Closes #805.
  • Connection topic aligned to the WIT contract (client.v1.connect). The connection-tracker work above (and the capsule-cli proxy) used client.v1.connected (past tense), but the WIT client interface (astrid-bus:client@1.0.0, wit/interfaces/client.wit) specifies uplinks publish client.v1.connect on attach. A WIT-conforming uplink publishing the contract topic would not have been counted. The tracker now matches client.v1.connect; the client.v1.disconnect side already matched. Companion: capsule-cli aligns its publish. Closes #793.
  • Connection tracker now recognises client.v1.* topics, not just typed payloads. The kernel's connection tracker matched only the typed IpcPayload::Connect/Disconnect variants — which no capsule can ever produce, because the SDK publish surface is JSON-only (publish/publish_json/publish_as/publish_json_as) and never exposes those variants. So uplink capsules (the CLI proxy) could not populate active_connections, leaving total_connection_count() structurally pinned at zero: the ephemeral idle-shutdown gate and astrid who saw no connections regardless of reality, and the typed Disconnect the CLI client sends over the socket was flattened to RawJson by the proxy and lost. The tracker now classifies connection lifecycle by topic (client.v1.connect / client.v1.disconnect) in addition to the typed payload, via a pure, unit-tested connection_signal() helper; native producers' typed payloads continue to work. Stale comments that described the non-existent wiring are corrected. The capsule-cli companion has the proxy publish those topics carrying the authenticated principal, with stream-close as the authoritative disconnect. Closes #788.
  • Daemon singleton guard: real flock lock + fail-closed socket-path resolution. Two idle daemons were observed running at once. The guard was advisory-only: kernel_socket_path() silently fell back to /tmp/.astrid/run/system.sock when AstridHome::resolve() failed, so two processes with divergent env bound different sockets and coexisted (split-brain), and prepare_socket_path had a connect-probe → bind TOCTOU window with no lock. The kernel now (1) resolves the bind path strictly — a daemon whose ASTRID_HOME can't be resolved refuses to boot rather than using /tmp (fail-closed, matching generate_session_token); and (2) holds an exclusive non-blocking advisory lock (std::fs::File::try_lock) on <run>/system.lock for the process lifetime, so a second daemon fails fast and exits. The OS releases the lock on crash, so a restart is never wedged. Behaviour change: unresolvable ASTRID_HOME is now a fatal boot error. Closes #790.
  • cors_allow_origins is actually wired into the router now. The gateway shipped in v0.7.0 with a cors_allow_origins: Vec<String> field that the router never consumed — tower-http::CorsLayer was on the dep list, but no layer was applied. An operator setting the allowlist saw nothing happen at runtime; browsers fell back to same-origin (which was the correct secure default but not the configured one). routes::build now applies a CorsLayer when the allowlist is non-empty: Access-Control-Allow-Origin (per-request match), …-Allow-Methods (GET/POST/PUT/PATCH/DELETE/OPTIONS), …-Allow-Headers (authorization/content-type/accept), Vary: Origin, and a 1-hour preflight cache. Empty allowlist stays no-CORS (browsers refuse cross-origin) so single-tenant deployments don't grow a Vary header. New GatewayConfig::validate rejects malformed origins (scheme other than http/https, trailing slash, path/query/fragment, unparseable strings, embedded userinfo that browsers strip, and raw IDNs that wouldn't match the Punycode form browsers send) at daemon boot so misconfig fails fast instead of silently no-op-ing. Five new end-to-end tests in crates/astrid-gateway/tests/cors.rs cover preflight accept/reject, actual-request ACAO, empty-allowlist secure default, and per-origin echo for multi-origin allowlists. Closes #771.
  • Symlink-escape defence in copy_capsule_dir (gateway/CLI install path). Previously the install copier dereferenced symlinks via fs::metadata(). A malicious capsule tree could ship evil -> /etc/shadow and have the host secret copied bytewise into the per-capsule directory (which the capsule's WASM sandbox could then read via home://), and a directory symlink would either infinite-loop on an ancestor target or balloon the copy across an unrelated tree. The copier now canonicalizes the source root once at entry, threads it through recursion, refuses directory symlinks outright (npm only produces file symlinks under node_modules/.bin/, so no real use case is lost), and validates that any file symlink resolves inside the canonical root via Path::starts_with. Two new tests pin both threats: copy_capsule_dir_refuses_file_symlink_pointing_outside_root and copy_capsule_dir_refuses_directory_symlink. (#756, Gemini r2)
  • PID-only temp filenames in three install paths were race-prone under sibling tokio tasks. astrid-capsule-install::wasm::write_atomic, astrid-capsule-install::wit::write_atomic, and astrid-gateway::routes::env::write_env_string all used std::process::id() as the temp-file disambiguator. Sibling tasks in the same daemon process share that PID, so two concurrent installs of different content-addressed WASM blobs (or two concurrent env writes from different dashboards) could collide on the same temp path and clobber each other before rename. Each call now uses uuid::Uuid::new_v4().simple() so the temp path is unique even across in-process concurrency. (#756, Gemini r2)
  • POST /api/capsules/{id}/env/{field} was reading the manifest from the wrong directory. The handler looked up Capsule.toml under home.root().join("capsules").join(...) — a path that doesn't exist in the FHS layout (installed manifests live under each principal's home at principal_home(p).capsules_dir()). Every env-write request fell through to a "manifest not found" 500 against a real install. Fixed to use the principal-scoped path resolver. (#756, Gemini r2)
  • write_env was doing synchronous file I/O on the tokio worker thread. At the gateway's measured 5,400 RPS read ceiling, a single slow fsync on the env path would have stalled every other in-flight HTTP request. The secret-store write, the text/select write, and the array append now run inside tokio::task::spawn_blocking so the worker stays free. (#756, Gemini r2)
  • Audit-stream cap probe used the socket-based AdminClient instead of the bus-direct path. caller_holds on the GET /api/events handler dialled the daemon over the Unix socket through astrid-capsule-cli to ask whether the caller held audit:read_all, adding the ~50ms proxy handshake to every SSE open — exactly the latency the rest of the gateway routes were rewritten to skip. Now uses state.admin_client(principal) like every other route, so SSE first-byte time is microseconds. (#756, Gemini r2)
  • POST /api/auth/redeem rate limiter was per-proxy, not per-client, behind a reverse proxy. The redeem path keyed RateLimiter on ConnectInfo<SocketAddr>::ip(), which is the proxy's IP under a typical nginx/Caddy/cloud-LB deployment — one abusive client would lock out every legitimate user globally. New GatewayConfig.trust_forwarded_from: Vec<IpAddr> lists the reverse-proxy IPs the gateway trusts; when the immediate peer is on the list, the limiter resolves the real client from X-Forwarded-For (first hop) then X-Real-IP, falling back to peer. Empty list = no forwarded-header trust (peer IP is used directly), preserving the previous behaviour for direct-internet deployments. Operators must set this when the gateway is behind a proxy; the docstring on the config field calls this out explicitly. (#756, Gemini r2)
  • cargo publish -p astrid escaped the published tarball via a workspace-rooted include_str! path. astrid setup embedded the bundled AppArmor profile with include_str!("../../../../dist/apparmor/astrid") — a path that resolves outside the crate during cargo publish, so packaging the CLI failed the verifier compile (same class as the wit-staging / chrono publish fixes folded into 0.7.0). The profile moved to crates/astrid-cli/apparmor/astrid (content unchanged) and the include is now crate-relative; the workspace-rooted dist/ directory is gone. Closes #765.

Security

  • Capsule audit-feed subscriptions are now scoped to the subscriber's own principal by default. A capsule that declared ipc_subscribe = ["astrid.v1.audit.entry"] (or any astrid.v1.* superset) in its manifest previously received every principal's audit entries — the cross-principal firehose was the default, gated by no capability and guarded by no reserved namespace (check_subscribe_acl matched topics syntactically, and the per-principal bus demux is fairness-only, not access control). An audit-covering subscription is now self-scoped at enqueue to the capsule's load-time owner principal, so a foreign principal's entries never enter the route's byte budget and a noisy co-principal can never head-evict the owner's own entries. The catalogued audit:read_all capability — resolved against the owner's profile and the live group config, never the self-declared manifest, so a capsule cannot grant itself the firehose — lifts the scope back to the full firehose, matching the gateway SSE model. Unscoped subscriptions are byte-identical to before; wildcard supersets are detected via the route-layer topic matcher. No WIT change, no new capability. Refs #850.
  • Persistent host-process spawning now requires an explicit operator allow_persistent sub-grant. astrid:process.spawn-persistent was gated only on the host_process capability plus an authenticated principal, even though the WIT documents it as additionally requiring an operator allow_persistent opt-in — the sub-grant did not exist, so the gate the contract promised was never enforced. CapabilitiesDef gains an allow_persistent manifest field (bool, #[serde(default)], fail-closed): host_process alone grants the ephemeral spawn / spawn-background tiers, while a persistent child — which outlives the pooled instance and, on macOS, has no die-with-parent — additionally requires this opt-in. Without it, spawn-persistent returns capability-denied (audited); the ephemeral tiers are unaffected. The grant is surfaced by enumerate-capabilities through the serde-derived held_names / has, so a capsule can introspect it. No WIT change — the contract already specified the gate; this makes the host conform. Capsules that spawn persistent processes (e.g. astrid-capsule-shell) must add allow_persistent to their manifest [capabilities]. Refs #872.
  • macOS native-subprocess sandbox is no longer silently disabled on macOS 15+ (Darwin >= 24). SandboxCommand::wrap — the macOS arm of the host_process spawn path — carried a version guard that returned the spawned subprocess completely unsandboxed on every current Mac (Darwin 24 = macOS 15 Sequoia, Darwin 25 = macOS 26), logging only a tracing::warn!. The stated premise (that sandbox-exec is deprecated and therefore unusable on macOS 15+) was wrong: sandbox-exec is deprecated but still enforces on current macOS. The SIGABRT that originally motivated the guard (introduced in #603) was a fail-closed profile defect, not an OS incompatibility — wrap's inline profile was a stale duplicate that omitted the (allow file-read* (literal "/")) rule a dynamically-linked binary (e.g. node) needs to stat the filesystem root at startup, so Seatbelt correctly aborted the process and the guard turned that fail-closed signal into a silent fail-open passthrough. On host_process-capable capsules (astrid-capsule-shell) every native subprocess then inherited the host user's full filesystem reach (~/.ssh, dotfiles, arbitrary writes). The version guard and the stale inline profile are both removed; wrap's macOS arm now routes through build_seatbelt_prefix, the single profile that already carries (literal "/") + (allow mach*) (added for Node.js in #534) and that the MCP spawn path has been running on macOS 15+ all along — so both spawn paths share one profile and containment no longer depends on which path a capsule happens to use. If sandbox-exec genuinely cannot run, the subprocess spawn now fails rather than launching unsandboxed. A macOS-only test spawns a real node under the generated profile and asserts it runs, with a contrast proving the same profile without (literal "/") fails closed. Closes #855.
  • POST /api/auth/pair-device/redeem is now per-IP rate-limited. The device-pairing redeem route is unauthenticated and public (the pair-token is the auth), but unlike its sibling POST /api/auth/redeem it had no brute-force fence in front of the kernel's constant-time pair-token scan — an attacker could enumerate pair-tokens as fast as the network allowed. It now runs the same per-IP throttle as invite-redeem, deliberately sharing the one redeem_limiter so the per-IP budget is spent across both unauthenticated redeem routes and cannot be dodged by alternating between them. X-Forwarded-For is honoured only behind a configured trusted proxy (resolve_client_ip), and the 429 (with retry_after_secs) is now part of the route's documented OpenAPI contract. A new router test pre-seeds the shared limiter and asserts a pair-redeem from that IP is rejected before it ever reaches the daemon. Surfaced by the CLI/API sysadmin parity audit.
  • Failed token redeems now record a failure-outcome audit row. InviteRedeem and PairDeviceRedeem bypass the capability preamble (the token is the auth, since the redeemer's principal does not exist yet), so the kernel admin dispatcher special-cased them — but it stamped the audit row AuthorizationProof::System + AuditOutcome::success before dispatching the handler. A redeem the handler then rejected (invalid / expired / consumed / forged token, or an internal store error) still left a success row, so brute-force / forged-token attempts were invisible in the audit log itself and surfaced only as a tracing warn!(security_event=true). The dispatcher now records the audit row after dispatch with the real outcome — a rejected token writes Denied + Failure(reason), a mint writes System + Success — which is the "allow OR deny" the surrounding comment always promised. The decision is a small pure helper (redeem_audit_proof) with a unit test pinning the failure-on-Error mapping. A security team relying on audit rows (not tracing) can now detect token brute-forcing. No WIT/contract change. Surfaced by the CLI/API sysadmin parity audit.
  • self:agent:list no longer leaks the full principal roster. AgentList always resolves to AuthorityScope::Self_, so the required capability is self:agent:list — which the agent builtin holds via self:*. That lowering is deliberate (it lets an agent resolve its own group-inherited caps without being handed the admin-tier agent:list), but the kernel handler returned every principal's profile regardless of the caller — an information-disclosure: any ordinary agent could enumerate every other principal's id, groups, grants and revokes via GET /api/sys/principals. The gateway already documented the intended behaviour ("operators with agent:list see everyone; an agent with self:agent:list sees only themselves; the kernel filters server-side") — the kernel just never implemented it. agent_list now filters to the caller's own row unless the caller also holds the global agent:list capability; self:* does not match agent:list, so self-scoped callers are correctly narrowed. Both halves are required for the full roster — the AgentList preamble independently requires self:agent:list, which a bare global agent:list grant does not satisfy (the grammar does not make a global cap imply its self-scoped form) — so in practice only the admin group's * (matching both) sees everyone. Fail-secure: an unresolvable caller profile yields self-only. The CLI is unaffected (its default principal is admin-seeded). GroupList is intentionally left a full read — it is system config, not per-principal data, and an agent needs it to resolve group→capability inheritance. Behavioural change for API callers holding only self:agent:list. Surfaced by the CLI/API sysadmin parity audit.
  • Host random-bytes now sources from the OS CSPRNG. The astrid:sys/host.random-bytes implementation filled buffers from rand::thread_rng() — a userspace ChaCha CSPRNG seeded from OS entropy — while the WIT contract and the Rust/JS SDKs all advertise the bytes as coming from "the host's OS-level CSPRNG". It now fills from rand::rngs::OsRng so the implementation matches the documented guarantee, and uses try_fill_bytes (not fill_bytes) so a practically-impossible entropy-source failure fails secure as error-code::unknown(...) instead of panicking inside the host call. No WIT/contract change required. Closes #799.
  • Security-response-headers stack applied to every gateway response. The gateway returns JSON, SSE, plain text, and Prometheus — never HTML — but every response now carries X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy: no-referrer, and Content-Security-Policy: default-src 'none'; frame-ancestors 'none'. The headers are if_not_present so a handler that intentionally sets one wins; defaults fill in everywhere else. Defends against MIME-confusion (nosniff), clickjacking against any HTML that lands in the surface later (DENY + CSP frame-ancestors), and Referer leakage of principal-id-bearing URLs to third-party origins (no-referrer). Two new e2e tests pin headers on success and 401 paths. Closes #771.
  • Bearer revocation on principal delete. Session bearers shipped in v0.7.0 as 8-hour ed25519-signed tokens with no revocation mechanism — an admin who deleted a compromised principal still had to wait out the bearer lifetime (or rotate the gateway signing key, which logs out every other user too). The gateway now subscribes to the kernel's audit-event firehose (astrid.v1.audit.entry); when an AgentDelete admin op records a success outcome, the target principal is added to a per-gateway revoked_at map (persisted atomically to $ASTRID_HOME/etc/gateway-revocations.json) and every bearer with an iat at-or-before the recorded epoch is rejected by verify_bearer. Survives daemon restart. Resilient to clock skew between the kernel and the gateway because the timestamp baked into the revocation entry comes from the kernel's own audit envelope, not gateway-local wall clock. Closes #772.

Install

From source (requires Rust 1.95+):

cargo install astrid

Pre-built binaries:
Download the archive for your platform, extract, and add to PATH:

tar xzf astrid-*-$(uname -m)-*.tar.gz
sudo mv astrid-*/astrid astrid-*/astrid-daemon astrid-*/astrid-build astrid-*/astrid-emit /usr/local/bin/

Then run astrid init to set up capsules.


With many thanks from the following Astrinauts 🚀

  • Joshua J. Bouw