Conversation
Worker::run now returns Result<WorkerOutcome> with variants Success, Partial, Cancelled, Timeout, Failed — replaces the prior Result<String> so callers can distinguish a clean exit from a max-segments partial, a wall-clock timeout, a cancellation, or a hard failure without peeking at the result text. Adds CortexConfig.worker_wall_clock_timeout_secs (default 1800s), distinct from the existing supervisor idle-kill bound. Worker::run wraps run_inner in tokio::time::timeout; on expiry the inner future is dropped and we return Timeout, reading segment progress from a shared atomic so the caller knows how far the worker got. spawn_worker_task is now generic over Result<WorkerOutcome>; the OpenCode worker boundary wraps its text result into Success at the spawn site. Cortex pickup_one_ready_task path consolidates into a single Result<WorkerOutcome, WorkerCompletionError> dispatch and splits on success vs failure once. Lays the groundwork for phase 4 to add a Blocked variant for captcha / login-wall detection in the browser tool.
Adds WorkerOutcome::Blocked with structured BlockReason (Captcha, LoginWall, RateLimit, FraudDetect, Unknown) and BlockEvidence (final URL + truncated HTML snippet) so a worker that hits an external defense exits cleanly and surfaces enough for parent-app escalation without retry-looping on a dead end. Detection lives in a new pure module (tools/browser_detection.rs) operating on DOM markers — recaptcha / hcaptcha / Cloudflare Turnstile iframe srcs, two-marker Cloudflare interstitial heuristic, login-path final URL with optional same-host gate. Conservative on purpose: a single Cloudflare marker doesn't fire (would false-positive on every CF-hosted site); login-wall on a different host is treated as an intentional federation, not a block. Header-based heuristics deferred until chromiumoxide exposes response headers cleanly. The browser tools (navigate, click) run detection after their action and write to a worker-shared BlockSignal slot; the worker reads the slot at each loop iteration and short-circuits before the next LLM turn. Mirrors the SpacebotHook.outcome_signaled pattern so the worker loop stays the single arbiter of terminal state. Channel-level browser tool servers (cortex / interactive) pass None for the signal — only worker-driven sessions short-circuit.
Introduces SecretScope { InstanceShared | Agent(id) } orthogonal to the
existing SecretCategory { System | Tool }. Worker subprocess env now
only carries InstanceShared(Tool) ∪ Agent(this_agent, Tool) — a tenant's
worker can no longer printenv another tenant's GH_TOKEN. System secrets
(LLM keys, messaging tokens) stay InstanceShared since their consumers
are singletons.
Storage layout: redb keys are now NUL-delimited "s\\x00<NAME>" or
"a\\x00<AGENT_ID>\\x00<NAME>". Pre-scope keys are detected and rewritten
to InstanceShared on first store open — idempotent, non-destructive.
Existing instances keep working without admin intervention; promotion
of tool secrets to per-agent is an explicit op.
Sandbox now owns the agent_id and reads scoped secrets in tool_secrets().
Worker prompt rendering, the scrubber, the SecretSetTool, and the
browser_type secret-resolution path all flow agent_id through. Channel
and cortex tool servers pass None for the optional agent_id — those
contexts fall back to instance-shared only. The browser secret resolver
tries the agent scope first, then shared, so worker-scoped GH_TOKEN can
shadow a shared one without breaking the channel-level fallback.
API surface keeps the existing PUT/DELETE/INFO endpoints as admin-level
InstanceShared writes; list now returns scope per row. Per-agent secret
endpoints + dashboard work follow.
Tests: 3 new — per-agent isolation, shadow-shared precedence, and legacy
key migration. 874/874 lib tests pass.
Adds CortexConfig.cron_default_timeout_secs (Option<u64>, None = system default 1500s). Resolution chain in cron/scheduler is per-job override → per-agent default → DEFAULT_CRON_TIMEOUT_SECS. Lifts the previous hard-coded 1500s out of the scheduler so research-heavy agent classes can raise it without setting timeout_secs on every job, and short-task classes can lower it without a per-job hint.
Adds CortexMode { Active, Dormant } and lets agents skip every cortex
loop (warmup, tick, association, ready-task pickup) when dormant. Cuts
periodic LLM-backed bulletin generation to zero on idle agents — the
deployment shape this whole branch is aimed at.
Wake substrate (src/agent/wake.rs): a single mpsc-fed WakeManager task
runs at startup with a shared HashMap<AgentId, AgentDeps> registry.
Tools and cron fire wakes by sending the receiving agent's id over
AgentDeps.wake_tx; the manager looks up the receiver and runs
cortex::wake_one (one ready-task pickup pass). Active-mode agents are
also valid wake targets — wake_one races with the spawn_ready_task_loop
harmlessly because task_store.claim_next_ready is transactional.
Trigger sites:
- send_agent_message post-task_store.create
- cron fire path
- POST /api/agents/{id}/wake (admin / debugging)
Memory janitor (src/agent/maintenance.rs): MemoryJanitorConfig.enabled
(default false) toggles a single instance-wide task that walks every
registered agent on interval_secs (default daily) and runs the existing
memory::maintenance machinery. Required when running dormant agents at
scale; opt-in. Idempotent on active-mode agents.
Deferred from doc plan: bulletin caching with TTL (current minimum
viable wake just kicks pickup; the worker prompt rendering still pulls
the latest bulletin from memory state), messaging/webhook routing-loop
wake hooks (channels still spawn naturally — cortex loops are not on
that path).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughImplements per-agent secret scoping with migration, dormant cortex mode (wake manager + memory janitor), WorkerOutcome with wall-clock timeouts and blocked outcomes, browser block detection/evidence, per-agent cron defaults, and related API/state/tools/sandbox wiring and tests. ChangesAgentic Backend Readiness (Integrated Feature Delivery)
🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/config/load.rs (1)
2639-2663:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winWhitelist the new
[memory_janitor]top-level section.Adding
memory_janitorhere without updatingKNOWN_TOP_LEVEL_KEYSmeans every valid[memory_janitor]config will still emit the generic “unknown top-level key” warning fromwarn_unknown_config_keys, which makes the new setting look ignored.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/config/load.rs` around lines 2639 - 2663, The new MemoryJanitorConfig (constructed as memory_janitor in load.rs) was added to Config but not added to the whitelist of known top-level keys, so warn_unknown_config_keys still flags a valid [memory_janitor] section; fix by adding "memory_janitor" to the KNOWN_TOP_LEVEL_KEYS collection (or equivalent constant/vec used by warn_unknown_config_keys) so that the new top-level key is recognized, and ensure any unit tests that verify known keys are updated to include MemoryJanitorConfig if present.src/agent/worker.rs (1)
841-985:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFollow-up runs never consume
blocked_signal.The initial task loop exits with
WorkerOutcome::Blocked, but the interactive follow-up loop goes straight from tool/browser failures to genericFailed. If a resumed or interactive worker hits a captcha/login wall during follow-up, phase 4 regresses to an opaque failure instead of the structured blocked outcome.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/worker.rs` around lines 841 - 985, The follow-up loop currently turns tool/browser/captcha/login stalls into generic failures instead of honoring blocked_signal and returning WorkerOutcome::Blocked; update the error handling in the follow-up retry loop (the match that produces follow_up_result and the Err(error) arms) to check/consume the same blocked_signal used in the main task loop and, when present, set the worker state and outcome to the blocked path (mirroring WorkerOutcome::Blocked behavior) instead of treating it as a normal failure; specifically, when a tool-nudge/captcha/login condition is detected (e.g., via SpacebotHook::is_tool_nudge_reason or the same signal used elsewhere), call the same blocked_signal consumption logic and produce the blocked outcome (and avoid writing a generic failed state) so follow-up runs resume to the structured blocked handling rather than WorkerState::Failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design-docs/agentic-backend-readiness.md`:
- Line 11: Typo in the use-case bullet: change the word "pollluted" to
"polluted" in the sentence "Owns its entity's history forever (not pollluted by
other entities' context)"; update that exact bullet text so it reads "Owns its
entity's history forever (not polluted by other entities' context)".
- Around line 125-137: The fenced code block that begins with "```" and contains
the flow sketch starting "trigger fires → cortex::wake(agent_id, trigger) ..."
is missing a language tag; update that opening fence to include the language
"text" (i.e., change ``` to ```text) so markdownlint MD040 is satisfied and
syntax highlighting is preserved for the flow sketch.
In `@src/agent/cortex.rs`:
- Around line 3974-3982: The code treats task_store.update(...) which returns
Result<Option<_>, _> as a simple success/failure; explicitly handle the Ok(None)
case in both persistence branches where task_store.update(...) is awaited (the
lines around the current call to task_store.update(task.task_number,
UpdateTaskInput { status: Some(TaskStatus::Done), .. })) so that Ok(Some(_))
proceeds to emit TaskUpdated, WorkerComplete and delegation notifications but
Ok(None) is treated as “no row updated” and does not emit those events; update
the matching logic in the same blocks that already check Err(_) (and mirror the
timeout-branch matching that handles Ok(Some), Ok(None), Err) to return or log a
no-update path instead of treating Ok(None) as success.
- Around line 3947-3973: The current code flattens all non-success worker
completions into a generic retry path by using map_worker_completion(normalized)
and then requeuing to TaskStatus::Ready; instead, preserve and inspect the
actual WorkerOutcome (do not discard the Ok(outcome) variant in normalized) and
handle WorkerOutcome::Timeout and WorkerOutcome::Blocked specially (e.g., set
distinct TaskStatus variants like TimedOut/Blocked or route to separate
backoff/retry logic) rather than immediately marking them Ready. Concretely:
stop converting Ok(Ok(outcome)) into only a result_text/success boolean
flow—keep the outcome value (from normalized or the original worker_result),
call map_worker_completion only for text/notify but then match on the preserved
WorkerOutcome to choose the correct TaskStatus (handling Timeout and Blocked
separately), while keeping the existing failure handling for
WorkerCompletionError and panics.
In `@src/agent/maintenance.rs`:
- Around line 46-55: The per-agent maintenance loop currently awaits
run_maintenance_for_agent(&deps).await serially with no timeout, so a hung call
can stall the whole janitor; wrap each per-agent call in a Tokio timeout
(tokio::time::timeout) or spawn it as a cancellable task and await with timeout
so a slow/hung agent is aborted and the loop continues; specifically, replace
the direct await of run_maintenance_for_agent(&deps).await in the agents loop
(and the similar section covering lines 60–76) with a timeout-wrapped call that
logs a distinct timeout warning (including agent_id) and continues to the next
agent when the timeout elapses, while still logging other errors from
run_maintenance_for_agent.
- Around line 60-77: run_maintenance_for_agent currently discards the
maintenance report from memory::maintenance::run_maintenance so prunes/merges
never mark knowledge synthesis dirty; change it to capture the maintenance
report result, inspect it for pruning/merge indicators (e.g.,
pruned_count/merged_count or a modified flag returned by run_maintenance), and
when non-zero call into the cortex runtime (via deps.runtime_config.cortex) to
bump or mark knowledge_synthesis_version/knowledge_synthesis_dirty using the
same mechanism the other maintenance path uses so dormant agents see the update.
In `@src/api/secrets.rs`:
- Around line 102-116: list_secrets() currently returns both InstanceShared and
Agent-scoped secrets while put_secret(), delete_secret(), and secret_info() are
hard-coded to SecretScope::shared(), causing inconsistent visibility and
inability to act on agent-scoped rows; either (A) restrict list_secrets() to
only include shared secrets (filter metadata by SecretScope::shared()) so the
API remains shared-only, or (B) thread an explicit scope parameter through the
CRUD handlers (modify the HTTP handlers and signatures for put_secret,
delete_secret, secret_info to accept a scope and use it when calling store.put,
store.delete, store.info) and update SecretListItem/response handling to include
the scope so clients can target agent-scoped secrets. Ensure all usages of
list_secrets, put_secret, delete_secret, and secret_info are adjusted
consistently (including places referenced around the other occurrences) so
listing and mutation use the same scope semantics.
In `@src/config/load.rs`:
- Around line 199-204: The worker_wall_clock_timeout_secs value is currently
accepted as-is so a configured 0 would be allowed; compute the resolved value
(using
overrides.worker_wall_clock_timeout_secs.unwrap_or(defaults.worker_wall_clock_timeout_secs))
and validate it the same way maintenance_interval_secs is validated—ensure the
resolved worker_wall_clock_timeout_secs is >= 1 and return an Err (with a clear
message) from the config loader if it is < 1 before returning Ok(config); update
the resolution logic near the
cron_default_timeout_secs/worker_wall_clock_timeout_secs lines and reference
worker_wall_clock_timeout_secs, overrides, defaults, and the surrounding config
resolution function when making the change.
In `@src/cron/scheduler.rs`:
- Around line 1405-1409: The wake is currently fired too early (using
context.deps.wake_tx and crate::agent::wake::fire_wake) before the cron channel
produces any follow-up tasks, risking the one-shot dormant wake being consumed
before work exists; move the wake invocation out of the current pre-channel
location and either (a) invoke
crate::agent::wake::fire_wake(context.deps.wake_tx, &context.deps.agent_id)
immediately after the cron channel processing completes, or (b) call it from the
specific side-effect creation path that enqueues follow-up tasks so the dormant
wake is delivered only when new work/associations are actually created.
In `@src/main.rs`:
- Around line 2776-2784: The wake_registry created for spawn_wake_manager is
only populated in initialize_agents and becomes stale when agents are
added/removed later; move its ownership to live agent map scope (the same place
the runtime agent HashMap is stored) or add explicit register/unregister hooks
on the wake-manager that are invoked from the main-loop add/remove paths so new
agents call wake_manager.register(agent_id, AgentDeps) and removed agents call
wake_manager.unregister(agent_id); update uses of wake_registry,
spawn_wake_manager, api_state.wake_tx and the add/remove handling in the main
loop to call these hooks so the registry stays in sync with runtime agent
changes.
- Around line 3752-3764: The startup warmup pass must be gated by the same
dormant-mode check so dormant agents don't run warmup; update the code that
schedules run_warmup_once to first read each agent's cortex mode
(agent.deps.runtime_config.cortex.load().mode) and skip calling or scheduling
run_warmup_once when cortex_mode.is_dormant() is true (the same check used in
the loop that skips the recurring cortex loops), ensuring agents iteration
(agents, agent_id, agent) never triggers run_warmup_once for dormant agents.
In `@src/secrets/store.rs`:
- Around line 748-791: The tool_env_vars() two-pass merge currently leaves a
shared secret in place when an agent-scoped secret exists but read_value(scope,
name) fails; change the second-pass behavior so that for each agent-scoped entry
(match SecretScope::Agent { id } if id == agent_id.as_ref()) you either insert
the agent value on success or explicitly remove any existing entry from result
when read_value returns None, ensuring unreadable agent overrides do not fall
back to shared secrets; apply the same change to the other similar merge block
that follows the same pattern (the other tool-secret-merge routine using
list_metadata and read_value).
- Around line 592-624: The delete() method currently performs mutations without
acquiring the shared mutation_guard, allowing race conditions with
enable_encryption(), rotate_key(), and set(); fix it by acquiring and holding
the mutation_guard for the duration of the delete operation (before begin_write
and until after commit) so the delete's write_txn and its removals from
SECRETS_TABLE and METADATA_TABLE cannot interleave with encryption/rotation/set
operations; ensure you use the same mutation_guard used by enable_encryption(),
rotate_key(), and set() so the lock is taken at the start of delete() and
released only after write_txn.commit() completes.
In `@src/tools/browser_detection.rs`:
- Around line 102-130: detect_login_wall currently flags any final_url path
matching login_path_signals as BlockReason::LoginWall when the final host equals
navigation_target_host, which blocks intentional sign-in navigations; update
detect_login_wall to also consider the requested/navigation target path (or
accept an explicit allow_expected_login boolean) so it only classifies as
LoginWall when the navigation target path differs from the final path (i.e., an
unexpected redirect to a login page) or when callers do not mark the login as
expected; use the existing symbols (detect_login_wall, final_url,
navigation_target_host, login_path_signals, BlockReason::LoginWall) to locate
the logic and add the extra comparison or parameter and short-circuit
accordingly.
In `@src/tools/browser.rs`:
- Around line 851-860: When resolving secrets in Browser::get (the block using
self.agent_id, SecretScope::agent and store.get), only fall back to
SecretScope::shared() if the agent-scoped get returns a NotFound error; if
store.get(&SecretScope::agent(...), name) returns any other error (e.g.
decrypt/unreadable), propagate that error wrapped in BrowserError::new instead
of trying the shared scope. Concretely: replace the current unconditional
fall-through logic with a match on the agent-scoped store.get result —
Ok(secret) => Ok(secret), Err(err) => if err indicates NotFound then attempt
store.get(&SecretScope::shared(), name) and map errors into
BrowserError::new(format!(...)), else return
Err(BrowserError::new(format!(...err...))). Ensure you reference store.get,
SecretScope::agent, SecretScope::shared, self.agent_id and BrowserError::new
when making the change.
---
Outside diff comments:
In `@src/agent/worker.rs`:
- Around line 841-985: The follow-up loop currently turns
tool/browser/captcha/login stalls into generic failures instead of honoring
blocked_signal and returning WorkerOutcome::Blocked; update the error handling
in the follow-up retry loop (the match that produces follow_up_result and the
Err(error) arms) to check/consume the same blocked_signal used in the main task
loop and, when present, set the worker state and outcome to the blocked path
(mirroring WorkerOutcome::Blocked behavior) instead of treating it as a normal
failure; specifically, when a tool-nudge/captcha/login condition is detected
(e.g., via SpacebotHook::is_tool_nudge_reason or the same signal used
elsewhere), call the same blocked_signal consumption logic and produce the
blocked outcome (and avoid writing a generic failed state) so follow-up runs
resume to the structured blocked handling rather than WorkerState::Failed.
In `@src/config/load.rs`:
- Around line 2639-2663: The new MemoryJanitorConfig (constructed as
memory_janitor in load.rs) was added to Config but not added to the whitelist of
known top-level keys, so warn_unknown_config_keys still flags a valid
[memory_janitor] section; fix by adding "memory_janitor" to the
KNOWN_TOP_LEVEL_KEYS collection (or equivalent constant/vec used by
warn_unknown_config_keys) so that the new top-level key is recognized, and
ensure any unit tests that verify known keys are updated to include
MemoryJanitorConfig if present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2c73ff7f-40cf-48a8-aeb3-d2dadb34c70f
📒 Files selected for processing (31)
docs/design-docs/agentic-backend-readiness.mdsrc/agent.rssrc/agent/branch.rssrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/cortex.rssrc/agent/invariant_harness.rssrc/agent/maintenance.rssrc/agent/wake.rssrc/agent/worker.rssrc/api/agents.rssrc/api/providers.rssrc/api/secrets.rssrc/api/server.rssrc/api/state.rssrc/config/load.rssrc/config/toml_schema.rssrc/config/types.rssrc/cron/scheduler.rssrc/lib.rssrc/main.rssrc/opencode/worker.rssrc/sandbox.rssrc/secrets/scrub.rssrc/secrets/store.rssrc/tools.rssrc/tools/browser.rssrc/tools/browser_detection.rssrc/tools/secret_set.rssrc/tools/send_agent_message.rssrc/tools/spawn_worker.rs
👮 Files not reviewed due to content moderation or server errors (4)
- src/api/agents.rs
- src/config/types.rs
- src/agent/wake.rs
- src/agent/channel_dispatch.rs
Secret isolation: - delete() now under mutation_guard so it can't interleave with set/ enable_encryption/rotate_key - tool_env_vars: an unreadable agent override explicitly removes any same-named shared fallback instead of silently injecting the wrong tenant's value - browser_type secret resolver only falls back to shared on NotFound; decryption / read errors surface so the wrong tenant's credential isn't silently typed Cortex pickup: - Distinguish Ok(None) from Err in task_store.update — previously a missing row was treated as success and would emit TaskUpdated + WorkerComplete events for a task that wasn't actually updated - Route Cancelled / Timeout / Blocked to terminal Done instead of requeuing Ready — they would hot-loop on the next pickup pass otherwise (captcha won't fix itself, wall-clock won't unstick) - Failed and unexpected errors keep the existing requeue-Ready behavior since those may be transient - New DetachedRouting enum + handle_detached_completion helper consolidates the success / terminal / requeue paths Wake / dormant mode: - Cron wake fires after the channel completes (was firing before, so the one-shot wake was consumed before the cron created any work) - Startup warmup pass gated by dormant-mode check — dormant agents no longer touch model/embedding warmup at boot - Wake registry moved onto ApiState so runtime agent create/delete paths in api/agents.rs can keep it in sync; previously, agents added at runtime were never wakeable Worker follow-up: - Honor blocked_signal during interactive follow-up turns. A captcha hit during a follow-up no longer regresses to a generic Failed — surfaces structured WorkerOutcome::Blocked Janitor: - Per-agent tokio::time::timeout (600s) so one hung embedding call doesn't stall the whole instance-wide janitor pass - Bump knowledge_synthesis_version after prune/merge — mirrors the cortex-loop maintenance path so dormant agents don't get a stale bulletin / synthesis after a janitor sweep Browser detection: - detect_login_wall now requires the requested URL and only fires on unexpected redirect to a login path. A worker that intentionally navigates to /login or clicks a Sign-in button is no longer classified as blocked. Cross-host SSO bounces also pass through. Config validation: - worker_wall_clock_timeout_secs must be >= 1 (matches existing maintenance_interval_secs validation pattern) - memory_janitor added to KNOWN_TOP_LEVEL_KEYS so a configured [memory_janitor] section doesn't trigger the unknown-key warning API: - list_secrets restricted to InstanceShared until per-agent CRUD endpoints land — matches the scope the existing PUT/DELETE/INFO endpoints can actually operate on Doc: - pollluted → polluted - text language tag on the wake-sequence fenced block
| } | ||
| WorkerOutcome::Blocked { | ||
| reason, | ||
| url, |
There was a problem hiding this comment.
[🟠 High] [🔵 Bug]
spawn_worker_task now scrubs text fields inside WorkerOutcome, but in the Blocked branch it leaves url untouched and later classify_worker_completion embeds that URL into WorkerComplete.result, which is logged and injected back into channel flow. That creates a credential-leak path when blocked URLs include bearer tokens/session params. Scrub the URL in scrub_outcome before outcome classification so all emitted blocked messages are sanitized.
// src/agent/channel_dispatch.rs
WorkerOutcome::Blocked {
reason,
url,
mut evidence,
} => {| url, | |
| url: url.map(|url| scrub(url)) |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design-docs/agentic-backend-readiness.md`:
- Around line 28-29: The doc contains two conflicting current cron default
timeout values: one saying "Today cron timeout is 120s" in the introductory
summary of section "Per-agent cron defaults" and another saying "1500s (25min)".
Decide the correct current default (either 120s or 1500s), replace the incorrect
value so both the summary and the detailed section (§5 / "Per-agent cron
defaults") match, and run a quick search through the document to update any
other mentions (e.g., the detailed paragraph that currently states 1500s) so all
references are consistent.
In `@src/api/agents.rs`:
- Around line 421-437: The wake_agent endpoint currently always reports success;
change it to surface delivery failures by having crate::agent::wake::fire_wake
return a Result (or bool) indicating whether the send actually succeeded (e.g.,
propagate the tokio mpsc send error) and update wake_agent to inspect that
result: call fire_wake(tx, &target), map an OK/send-success to fired: true/"wake
delivered" and return Json(…) with StatusCode::OK, and map a send failure to
fired: false with an explanatory message and an appropriate error StatusCode
(e.g., SERVICE_UNAVAILABLE or INTERNAL_SERVER_ERROR). Update types/signature
references in wake_agent and adjust WakeAgentResponse usage accordingly so the
HTTP response reflects real delivery success.
- Around line 1357-1362: The wake_registry removal is happening too early:
currently the block that constructs key from agent_id and calls
state.wake_registry.write().await.remove(&key) runs before the subsequent
fallible teardown (config read/write and delete) so a failed delete leaves the
agent unwakeable; move that removal block to after all fallible teardown steps
complete successfully (i.e., after the functions that read/write config and
perform the agent deletion finish without error), keeping the same key
construction (let key: crate::AgentId =
std::sync::Arc::from(agent_id.as_str());) and using
state.wake_registry.write().await.remove(&key) only on successful completion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 94310976-2983-466e-9af9-500efbb35bf1
📒 Files selected for processing (13)
docs/design-docs/agentic-backend-readiness.mdsrc/agent/cortex.rssrc/agent/maintenance.rssrc/agent/worker.rssrc/api/agents.rssrc/api/secrets.rssrc/api/state.rssrc/config/load.rssrc/cron/scheduler.rssrc/main.rssrc/secrets/store.rssrc/tools/browser.rssrc/tools/browser_detection.rs
🚧 Files skipped from review as they are similar to previous changes (9)
- src/config/load.rs
- src/agent/cortex.rs
- src/main.rs
- src/cron/scheduler.rs
- src/tools/browser_detection.rs
- src/api/secrets.rs
- src/tools/browser.rs
- src/secrets/store.rs
- src/agent/worker.rs
- scrub_outcome now scrubs Blocked.url and Blocked.evidence.final_url (capy-ai HIGH/Bug). Blocked URLs flow into WorkerComplete.result and channel logs; query params and path segments routinely carry bearer tokens / session ids, so an unsanitized URL was a credential-leak path. Aligns with the existing scrubbing of result text and html_snippet. - wake_agent endpoint now reports failure honestly: 404 when the agent isn't registered (e.g. typo or torn-down), 503 when the mpsc send fails (manager not receiving). Previously always returned fired:true/wake delivered regardless. Response message also changes to 'wake queued' since delivery is async. - delete_agent moves the wake_registry.remove call to after the fallible config.toml write. A failed write previously left the agent alive but unwakeable. - Doc: line 28 cron-default reference updated 120s → 1500s to match §5 detail.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/api/agents.rs`:
- Around line 415-418: The OpenAPI annotation for wake_agent is missing the 404
response; update the #[utoipa::path] responses for the wake_agent handler to
include a (status = 404, description = "Agent not found" or similar) entry so
the docs match the actual StatusCode::NOT_FOUND returned by the wake_agent
function (check the wake_agent function and where StatusCode::NOT_FOUND is
returned and add the corresponding response tuple to the responses list).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 420ab31e-ffb7-4d15-844d-ef87b739bc92
📒 Files selected for processing (3)
docs/design-docs/agentic-backend-readiness.mdsrc/agent/channel_dispatch.rssrc/api/agents.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/agent/channel_dispatch.rs
| responses( | ||
| (status = 202, body = WakeAgentResponse), | ||
| (status = 503, description = "Wake manager not running"), | ||
| ), |
There was a problem hiding this comment.
wake_agent OpenAPI responses are missing the 404 case.
wake_agent returns StatusCode::NOT_FOUND (Line 433), but the #[utoipa::path] response list declares only 202 and 503. This leaves API docs/client generation out of sync.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/api/agents.rs` around lines 415 - 418, The OpenAPI annotation for
wake_agent is missing the 404 response; update the #[utoipa::path] responses for
the wake_agent handler to include a (status = 404, description = "Agent not
found" or similar) entry so the docs match the actual StatusCode::NOT_FOUND
returned by the wake_agent function (check the wake_agent function and where
StatusCode::NOT_FOUND is returned and add the corresponding response tuple to
the responses list).
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agent/channel_dispatch.rs (1)
1106-1115:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScrub panic payloads before logging or emitting them.
The panic branch bypasses
scrub_outcome()and logspanic_messageraw, then wraps that same raw text intoWorkerComplete.result. A panic from browser/tool code can easily include HTML, URLs, or echoed secret material, so this reopens the leak path the new scrubber is meant to close. Runpanic_messagethrough the existingscrubclosure before logging it or converting it intoWorkerCompletionError::failed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/channel_dispatch.rs` around lines 1106 - 1115, The panic path currently converts the panic payload with panic_payload_to_string and logs/returns it raw; instead pass panic_message through the existing scrub closure used elsewhere (the scrub/scrub_outcome logic) before using it in tracing::error and WorkerCompletionError::failed so secrets/HTML/URLs are removed; update the block that sets panic_message (and uses worker_id and WorkerCompletionError::failed) to call the scrub closure on panic_message and use the scrubbed string for both the log field and the error text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/secrets/store.rs`:
- Around line 1178-1185: import_all() currently calls set() for each ExportEntry
which recreates created_at/updated_at; modify import_all() to preserve
timestamps by either (A) calling a new method set_with_metadata(scope, name,
value, category, created_at, updated_at) that writes both value and metadata
atomically, or (B) after set() immediately update the stored record's
created_at/updated_at using the ExportEntry timestamps (e.g., via an
update_by_scope_name helper) so imported entries keep ExportEntry.created_at and
ExportEntry.updated_at; update or add functions (set_with_metadata or
update_timestamps) and use them in import_all() when importing entries.
---
Outside diff comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 1106-1115: The panic path currently converts the panic payload
with panic_payload_to_string and logs/returns it raw; instead pass panic_message
through the existing scrub closure used elsewhere (the scrub/scrub_outcome
logic) before using it in tracing::error and WorkerCompletionError::failed so
secrets/HTML/URLs are removed; update the block that sets panic_message (and
uses worker_id and WorkerCompletionError::failed) to call the scrub closure on
panic_message and use the scrubbed string for both the log field and the error
text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 21d2b91e-e8c5-4f0e-a874-3df4a4cbeb4a
📒 Files selected for processing (2)
src/agent/channel_dispatch.rssrc/secrets/store.rs
| for entry in &data.entries { | ||
| if self.exists(&entry.name) && !overwrite { | ||
| skipped.push(entry.name.clone()); | ||
| if self.exists(&entry.scope, &entry.name) && !overwrite { | ||
| skipped.push(format!("{}:{}", entry.scope, entry.name)); | ||
| continue; | ||
| } | ||
|
|
||
| self.set(&entry.name, &entry.value, entry.category)?; | ||
| self.set(&entry.scope, &entry.name, &entry.value, entry.category)?; | ||
| imported += 1; |
There was a problem hiding this comment.
Preserve exported timestamps during import.
import_all() routes every entry through set(), so created_at and updated_at get regenerated from Utc::now() instead of restoring the values from ExportEntry. That makes backup/restore lossy and changes audit metadata on every import. Persist the imported timestamps after set(), or add an import path that writes both value and metadata together.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/secrets/store.rs` around lines 1178 - 1185, import_all() currently calls
set() for each ExportEntry which recreates created_at/updated_at; modify
import_all() to preserve timestamps by either (A) calling a new method
set_with_metadata(scope, name, value, category, created_at, updated_at) that
writes both value and metadata atomically, or (B) after set() immediately update
the stored record's created_at/updated_at using the ExportEntry timestamps
(e.g., via an update_by_scope_name helper) so imported entries keep
ExportEntry.created_at and ExportEntry.updated_at; update or add functions
(set_with_metadata or update_timestamps) and use them in import_all() when
importing entries.
1f2ab05 to
7e99156
Compare
Summary
Five surgical changes from
docs/design-docs/agentic-backend-readiness.mdthat take spacebot from "single-user instance" to "ready to be the agentic backend of a production multi-tenant application." Each phase is one commit; phases are independent except 3→4 (4 needs theWorkerOutcomeenum 3 invents).The deployment shape this enables: thousands of mostly-idle agents on shared infrastructure, each with isolated credentials, woken on external events instead of polling, with predictable per-task latency and clean failure surfacing for things humans need to escalate.
Phase 1 — per-agent secret isolation
SecretScope { InstanceShared | Agent(id) }orthogonal to the existingSecretCategory { System | Tool }. Worker subprocess env now only carriesInstanceShared(Tool) ∪ Agent(this_agent, Tool)— a tenant's worker can no longerprintenvanother tenant'sGH_TOKEN. System secrets (LLM keys, messaging tokens) stayInstanceSharedsince their consumers are singletons. redb keys are NUL-delimited; pre-scope keys auto-migrate toInstanceSharedon first open. Browser secret resolver tries agent scope, then shared.Phase 2 — dormant cortex mode
CortexMode { Active, Dormant }. Dormant agents skip all four cortex loops (warmup / tick / association / ready-task). External events deliver wakes via an mpsc-fedWakeManager—send_agent_messagepost-insert, cron fire, andPOST /api/agents/{id}/wake(admin / debugging) are the trigger sites. Wake is a one-shot ready-task pickup; idempotent on active-mode agents (the spawn loop and wake race harmlessly). NewMemoryJanitorConfig(opt-in) replaces the periodic maintenance the dormant cortex no longer runs.Phase 3 — structured
WorkerOutcome+ wall-clock worker timeoutWorker::runreturnsResult<WorkerOutcome>instead ofResult<String>— variantsSuccess,Partial,Cancelled,Timeout,Failed,Blocked(added in phase 4). Wall-clock budget from newCortexConfig.worker_wall_clock_timeout_secs(default 1800s, distinct from the existing supervisor idle-killworker_timeout_secs).tokio::time::timeoutwrapsrun_inner; on expiry the inner future is dropped and we returnTimeoutwith progress read from a shared atomic so the caller knows how far the worker got.Phase 4 — browser captcha / login-wall / WAF detection
DOM-only detection (
tools/browser_detection.rs) for recaptcha, hcaptcha, Cloudflare Turnstile, Cloudflare interstitial (two-marker heuristic to avoid false-positive on every CF-hosted site), and login-wall final-URL paths with optional same-host gate. The browser tool writes a positive detection into a worker-sharedBlockSignalslot; worker reads it at each loop iteration and short-circuits toWorkerOutcome::Blockedwith structuredBlockReason+BlockEvidence. Header-based heuristics deferred until chromiumoxide exposes response headers cleanly.Phase 5 — per-agent cron defaults
CortexConfig.cron_default_timeout_secs: Option<u64>. Resolution: per-job override → per-agent default → 1500s system default. Lifts the previously hard-coded constant out of the scheduler.What's new at the surface
What's deferred (called out as v1)
worker_wall_clock_timeout_secsoverride — system + per-agent default sufficient for v1.Retry-After, WAF banners) — chromiumoxide doesn't expose response headers cleanly. DOM heuristics cover the high-value path.BlockEvidence(HTML snippet, screenshot) intoProcessEvent— text result + worker logs sufficient for v1.Test plan
cargo test --lib— 874/874 pass (includes 3 new secret-scope tests, 1 new browser-detection test suite of 10, 4 new worker-outcome classification tests)cortex.mode = "dormant"for one agent, confirm no LLM bulletin generation in logs at idlesend_agent_messageto a dormant receiver; verify task is picked up immediately (not on next tick)POST /api/agents/{id}/wakeagainst a dormant agent with areadytaskbrowser_navigateto a known captcha-fronted URL; verify worker exits withBlockedinstead of loopingAgent(alice, GH_TOKEN)andAgent(bob, GH_TOKEN)with different values; spawn a worker for each; verify env contains the right onesecrets.redbopens cleanly, all keys readable asInstanceShared