Skip to content

Tighten restore and recover safety#600

Merged
justinmoon merged 17 commits intomasterfrom
wt-dvlzikgo
Mar 11, 2026
Merged

Tighten restore and recover safety#600
justinmoon merged 17 commits intomasterfrom
wt-dvlzikgo

Conversation

@justinmoon
Copy link
Collaborator

@justinmoon justinmoon commented Mar 11, 2026

Summary

  • Add admin restore from backup flow
  • Reject reset without a current environment
  • Tighten reset confirmation and backup metadata
  • Add backup visibility and reset safety
  • Add OpenClaw launch path
  • Fix customer test borrow regressions
  • Lock inflight dashboard recovery actions
  • Fix inflight provision convergence
  • Add managed environment activity timeline
  • Clarify recovery fallback semantics
  • Fix customer control-loop safety and recovery
  • Add customer managed OpenClaw dashboard
  • Make OpenClaw agent runtime deterministic

Test plan

  • CI passes (pre-merge checks)

🤖 Generated with Claude Code


Open with Devin

Summary by CodeRabbit

  • New Features

    • VM restore-from-backup: recover VMs to prior backup states via a new restore API and manager flow
    • Admin dashboard: shows restore readiness and an Action to review/trigger restores
    • Restore confirmation UI/workflow with security checks and audit events
    • Restore helper and runtime control enhancements for safer restore/rollback
  • Tests

    • Expanded coverage for restore flows, error handling, contracts, and UI paths

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds end-to-end VM restore: client methods to call spawner restore API, server admin UI and CSRF/ticket flow, agent-side orchestration with events and DB updates, and spawner restore implementation plus helper script and config changes.

Changes

Cohort / File(s) Summary
VM Restore Client
crates/pika-agent-microvm/src/lib.rs
Adds restore_vm and restore_vm_with_request_id to POST /vms/{vm_id}/restore, propagating optional x-request-id, surfacing upstream status/request_id/body, and decoding VmResponse on success.
Admin Restore UI & Handlers
crates/pika-server/src/admin.rs, crates/pika-server/templates/admin/dashboard.html, crates/pika-server/templates/admin/restore_confirm.html
Implements restore confirmation ticket flow, CSRF verification, restore_confirm_page and restore_from_backup endpoints, dashboard action column, and restore confirmation template.
Admin Route Registration
crates/pika-server/src/main.rs
Re-exports and registers GET /admin/environments/:npub/restore/confirm and POST /admin/environments/:npub/restore.
Agent Restore Orchestration
crates/pika-server/src/agent_api.rs
Adds restore_managed_environment_from_backup (public(crate)) implementing restore orchestration with event logging (requested/succeeded/failed) and mark_agent_errored_preserving_vm helper.
Customer Tests
crates/pika-server/src/customer.rs
Updates test expectations for dashboard recovery messaging.
VM Spawner Handler & Config
crates/vm-spawner/src/main.rs, crates/vm-spawner/src/config.rs, crates/vm-spawner/Cargo.toml
Adds POST /vms/:id/restore handler, extends Config with gcroots_dir and restore_cmd (env-driven), and moves tempfile to runtime dependencies.
VM Restore Manager Implementation
crates/vm-spawner/src/manager.rs
Adds restore method, run_restore_helper, stop_vm_runtime, RestoreHelperResult struct, constant RESTORE_HELPER_RESULT_SCHEMA_V1, uses gcroots_dir for gcroot paths, and integrates restore/rollback/restart flows; updates tests.
Restore Script & Nix
infra/nix/modules/builder.nix
Enhances microvm-home-restore script: --json, --target-root, --host, validation, JSON output; adds jq to system packages and switches restic host resolution.

Sequence Diagram

sequenceDiagram
    participant Admin as Admin UI
    participant Server as Pika Server
    participant AgentAPI as Agent API
    participant Spawner as VM Spawner
    participant Database as Database

    Admin->>Server: POST /admin/environments/:npub/restore\n(csrf_token, confirmation_token)
    Server->>Server: verify_restore_confirmation_ticket(request token)
    Server->>AgentAPI: restore_managed_environment_from_backup(owner_npub, request_id, params)
    AgentAPI->>Database: record EVENT_RESTORE_REQUESTED
    AgentAPI->>Spawner: POST /vms/{vm_id}/restore (propagate x-request-id)
    alt Spawner success
        Spawner->>Spawner: stop_vm_runtime -> run_restore_helper -> restart VM
        Spawner-->>AgentAPI: 200 + VmResponse
        AgentAPI->>Database: record EVENT_RESTORE_SUCCEEDED
        AgentAPI-->>Server: success
        Server-->>Admin: redirect success
    else Spawner failure
        Spawner-->>AgentAPI: non-2xx (status, x-request-id, body)
        AgentAPI->>Database: mark agent errored, record EVENT_RESTORE_FAILED
        AgentAPI-->>Server: error
        Server-->>Admin: error response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 Hop, I scurried through the bytes and sod,
I stitched a token, stamped the rod,
A POST, a helper, a restarted hum—
Back comes the VM, drum by drum. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Tighten restore and recover safety' directly relates to the main changes: adding restore/recover safety features (restore confirmation flow, validation checks, error handling, safety constraints) across multiple modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch wt-dvlzikgo

Comment @coderabbitai help to get the list of available commands and usage tips.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
crates/pika-agent-microvm/src/lib.rs (1)

194-225: Consider adding an explicit timeout for restore operations.

The comment notes that durable-home restores can run for several minutes, and no timeout is set on the request. While this may be intentional, an unbounded request could potentially hang indefinitely if the upstream becomes unresponsive.

Consider adding a generous but finite timeout (e.g., 10-15 minutes) to prevent indefinite blocking:

+const RESTORE_VM_TIMEOUT: Duration = Duration::from_secs(600);
+
     pub async fn restore_vm_with_request_id(
         &self,
         vm_id: &str,
         request_id: Option<&str>,
     ) -> anyhow::Result<VmResponse> {
         let url = format!("{}/vms/{vm_id}/restore", self.base_url);
         // Durable-home restores can legitimately run for several minutes.
-        let resp = with_request_id(self.client.post(&url), request_id)
+        let resp = with_request_id(self.client.post(&url).timeout(RESTORE_VM_TIMEOUT), request_id)
             .send()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-agent-microvm/src/lib.rs` around lines 194 - 225, The
restore_vm_with_request_id path currently sends an unbounded POST (via
with_request_id(self.client.post(&url))...) which can hang; add a finite timeout
on the RequestBuilder before .send() (e.g. call
.timeout(Duration::from_secs(600)) or 900) so the request fails after a generous
period, and import std::time::Duration if necessary; ensure you apply the
timeout to the same RequestBuilder returned by with_request_id and keep the
.context and error handling unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/src/main.rs`:
- Around line 603-606: The Command::PublishKp branch currently builds the JSON
payload using cli.relay which ignores key-package-specific relay resolution;
change it to call resolve_kp_relays(cli) and use its result for the "relays"
field so --kp-relay and default key-package relays are honored (update the match
arm handling Command::PublishKp to construct the payload from
resolve_kp_relays(cli) instead of cli.relay), and adjust the new test to
exercise --kp-relay / default fallback rather than --relay.

In `@crates/pika-server/src/admin.rs`:
- Around line 658-668: The restore path currently validates the confirmation
token with verify_restore_confirmation_ticket but then immediately calls
restore_managed_environment_from_backup, which allows race conditions; change
the flow to atomically persist a "restore in-flight" marker or consume the
single-use confirmation token before invoking
restore_managed_environment_from_backup. Concretely, after
verify_restore_confirmation_ticket(&state, ...) succeeds, perform an atomic DB
update (or call the existing persistence helper used by agent_api.rs) that (a)
marks the VM/durable home as restore-in-flight or (b) marks the confirmation
token as consumed, and only then call
restore_managed_environment_from_backup(&state, ...); ensure errors from the
persistence step are propagated via map_agent_api_error so duplicate concurrent
POSTs cannot both reach the spawner.
- Around line 643-657: Re-check the backup state before calling
restore_managed_environment_from_backup: after validating vm_id (the block using
load_managed_environment_status and the vm_id unwrap), reload or re-inspect the
managed environment backup status (from the same load_managed_environment_status
result or by calling the appropriate loader) and reject requests whose backup
state is Missing or NotProvisioned with an appropriate error (e.g., return
StatusCode::CONFLICT and a clear message) instead of proceeding to
restore_managed_environment_from_backup; ensure you reference the same
status/row fields you used earlier and keep the existing error shape.

In `@crates/pika-server/src/agent_api.rs`:
- Around line 1535-1549: The transaction result is currently ignored which can
hide failures from mark_agent_errored_preserving_vm or
insert_managed_environment_event; change the code that calls conn.transaction
(replace the `let _ = conn.transaction::<AgentInstance, anyhow::Error, _>(|conn|
{ ... });` pattern) to capture and handle the Result instead—propagate the error
with `?` or map it into an anyhow::Error with context (e.g., using .map_err(|e|
anyhow::anyhow!("restore transaction failed: {}", e))?) so failures roll back
and are returned/upstream-handled; ensure you reference the existing closure and
the functions mark_agent_errored_preserving_vm and
insert_managed_environment_event when implementing the error propagation.

In `@crates/vm-spawner/src/manager.rs`:
- Around line 367-372: The code currently treats failures from
self.backup_status(&vm.id) as "no info" and silently falls back to
self.cfg.host_id; change this to fail closed: if backup_status() returns an
error/None or yields an empty backup_host, do not substitute cfg.host_id but
instead return an error (or propagate None) from the surrounding routine so the
restore is aborted. Locate the backup_host assignment (using
self.backup_status(&vm.id) and self.cfg.host_id) and replace the unwrap_or_else
fallback with explicit handling that returns Err/None with a clear error message
when backup metadata is unreadable or backup_host is empty. Ensure callers
handle the propagated error/None appropriately so no implicit fallback to
cfg.host_id occurs.

In `@ios/UITests/PikaUITests.swift`:
- Around line 110-138: The helper openAgentProvisioningFromChatList currently
returns as soon as the agentMenuItem is tapped even if the "OpenClaw" button
never appears; modify it so it only returns after the OpenClaw option has been
successfully found and tapped (or after a clear next-screen condition is
visible). Concretely, after tapping agentMenuItem, wait for the
app.buttons["OpenClaw"] firstMatch to exist within a short timeout and tap it;
if the button never appears continue the retry loop (or fail after the overall
deadline). Ensure the function checks a stable postcondition (e.g., presence of
the expected next-screen element or confirmation UI) before returning so the
helper reflects a completed provisioning flow.
- Around line 688-715: The testE2E_openclawAgent_createAndReply currently allows
agentApiUrl to be empty and thus silently uses the app's built-in backend;
change the validation to require agentApiUrl just like relays and kpRelays by
checking agentApiUrl.isEmpty and calling XCTFail("Missing env var:
PIKA_AGENT_API_URL (or PIKA_SERVER_URL)") then return when empty, ensuring the
test fails fast; locate the agentApiUrl variable and the app.launchEnvironment
block in the testE2E_openclawAgent_createAndReply function to implement this
check before setting app.launchEnvironment and calling app.launch().

---

Nitpick comments:
In `@crates/pika-agent-microvm/src/lib.rs`:
- Around line 194-225: The restore_vm_with_request_id path currently sends an
unbounded POST (via with_request_id(self.client.post(&url))...) which can hang;
add a finite timeout on the RequestBuilder before .send() (e.g. call
.timeout(Duration::from_secs(600)) or 900) so the request fails after a generous
period, and import std::time::Duration if necessary; ensure you apply the
timeout to the same RequestBuilder returned by with_request_id and keep the
.context and error handling unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b9f5f5b0-615f-45b4-bea8-82c1ee3abe01

📥 Commits

Reviewing files that changed from the base of the PR and between c1c46f6 and 761968c.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • flake.lock is excluded by !**/*.lock
📒 Files selected for processing (24)
  • cli/src/main.rs
  • crates/pika-agent-microvm/Cargo.toml
  • crates/pika-agent-microvm/src/lib.rs
  • crates/pika-server/src/admin.rs
  • crates/pika-server/src/agent_api.rs
  • crates/pika-server/src/customer.rs
  • crates/pika-server/src/main.rs
  • crates/pika-server/templates/admin/dashboard.html
  • crates/pika-server/templates/admin/restore_confirm.html
  • crates/pikahut/tests/guardrails.rs
  • crates/vm-spawner/Cargo.toml
  • crates/vm-spawner/src/config.rs
  • crates/vm-spawner/src/main.rs
  • crates/vm-spawner/src/manager.rs
  • flake.nix
  • infra/nix/modules/builder.nix
  • infra/nix/modules/microvm-host.nix
  • ios/Tests/AgentTests.swift
  • ios/Tests/AppManagerTests.swift
  • ios/UITests/PikaUITests.swift
  • just/agent.just
  • scripts/agent-chat-demo.sh
  • scripts/agent-demo.sh
  • scripts/demo-agent-microvm.sh
💤 Files with no reviewable changes (1)
  • ios/Tests/AgentTests.swift

Comment on lines +603 to +606
Command::PublishKp => Ok(serde_json::json!({
"cmd": "publish_keypackage",
"relays": cli.relay,
})),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use key-package relay resolution for remote publish-kp.

Line 605 currently serializes cli.relay, but local publish-kp uses resolve_kp_relays(cli). In --remote mode that ignores --kp-relay entirely and also drops the default key-package relays when --relay is unset, so key packages can be published to the wrong relay set or none at all. Please build this payload from resolve_kp_relays(cli) instead, and update the new test to exercise --kp-relay / default fallback rather than --relay.

Suggested fix
         Command::PublishKp => Ok(serde_json::json!({
             "cmd": "publish_keypackage",
-            "relays": cli.relay,
+            "relays": resolve_kp_relays(cli),
         })),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Command::PublishKp => Ok(serde_json::json!({
"cmd": "publish_keypackage",
"relays": cli.relay,
})),
Command::PublishKp => Ok(serde_json::json!({
"cmd": "publish_keypackage",
"relays": resolve_kp_relays(cli),
})),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/main.rs` around lines 603 - 606, The Command::PublishKp branch
currently builds the JSON payload using cli.relay which ignores
key-package-specific relay resolution; change it to call resolve_kp_relays(cli)
and use its result for the "relays" field so --kp-relay and default key-package
relays are honored (update the match arm handling Command::PublishKp to
construct the payload from resolve_kp_relays(cli) instead of cli.relay), and
adjust the new test to exercise --kp-relay / default fallback rather than
--relay.

Comment on lines +643 to +657
let status = load_managed_environment_status(&state, &owner_npub, &request_context.request_id)
.await
.map_err(map_agent_api_error)?;
let Some(row) = status.row.as_ref() else {
return Err((
StatusCode::NOT_FOUND,
"managed environment not found".to_string(),
));
};
let Some(vm_id) = row.vm_id.as_deref() else {
return Err((
StatusCode::CONFLICT,
"restore from backup requires a current VM assignment".to_string(),
));
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-check backup eligibility in the POST handler.

This endpoint only revalidates the current vm_id; it never re-checks backup freshness before restoring. If the backup became Missing or NotProvisioned after the confirm page was rendered, the request still flows into restore_managed_environment_from_backup, and that downstream path marks the agent errored on restore failure in crates/pika-server/src/agent_api.rs:1468-1558. Please load backup status again here and reject non-restorable states up front.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/admin.rs` around lines 643 - 657, Re-check the backup
state before calling restore_managed_environment_from_backup: after validating
vm_id (the block using load_managed_environment_status and the vm_id unwrap),
reload or re-inspect the managed environment backup status (from the same
load_managed_environment_status result or by calling the appropriate loader) and
reject requests whose backup state is Missing or NotProvisioned with an
appropriate error (e.g., return StatusCode::CONFLICT and a clear message)
instead of proceeding to restore_managed_environment_from_backup; ensure you
reference the same status/row fields you used earlier and keep the existing
error shape.

Comment on lines +1535 to +1549
if let Ok(mut conn) = state.db_pool.get() {
let _ = conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| {
let _ = mark_agent_errored_preserving_vm(conn, &active)
.map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
insert_managed_environment_event(
conn,
owner_npub,
Some(&active.agent_id),
Some(&vm_id),
EVENT_RESTORE_FAILED,
"Restore from backup failed. The managed environment was left in error for operator review.",
Some(request_id),
)?;
Ok(active.clone())
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle the restore-failure transaction result instead of dropping it.

If either mark_agent_errored_preserving_vm or insert_managed_environment_event fails here, the transaction rolls back and the error is silently discarded. The request still returns 500, but the row can remain ready and restore_failed never gets recorded, which breaks the restore audit trail this path is supposed to preserve.

Suggested direction
 if let Ok(mut conn) = state.db_pool.get() {
-    let _ = conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| {
-        let _ = mark_agent_errored_preserving_vm(conn, &active)
-            .map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
+    if let Err(tx_err) = conn.transaction::<(), anyhow::Error, _>(|conn| {
+        mark_agent_errored_preserving_vm(conn, &active)
+            .map(|_| ())
+            .map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
         insert_managed_environment_event(
             conn,
             owner_npub,
             Some(&active.agent_id),
             Some(&vm_id),
             EVENT_RESTORE_FAILED,
             "Restore from backup failed. The managed environment was left in error for operator review.",
             Some(request_id),
         )?;
-        Ok(active.clone())
-    });
+        Ok(())
+    }) {
+        tracing::error!(
+            request_id = %request_id,
+            owner_npub = %owner_npub,
+            agent_id = %active.agent_id,
+            vm_id = %vm_id,
+            error = %tx_err,
+            "failed to persist restore failure state"
+        );
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Ok(mut conn) = state.db_pool.get() {
let _ = conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| {
let _ = mark_agent_errored_preserving_vm(conn, &active)
.map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
insert_managed_environment_event(
conn,
owner_npub,
Some(&active.agent_id),
Some(&vm_id),
EVENT_RESTORE_FAILED,
"Restore from backup failed. The managed environment was left in error for operator review.",
Some(request_id),
)?;
Ok(active.clone())
});
if let Ok(mut conn) = state.db_pool.get() {
if let Err(tx_err) = conn.transaction::<(), anyhow::Error, _>(|conn| {
mark_agent_errored_preserving_vm(conn, &active)
.map(|_| ())
.map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
insert_managed_environment_event(
conn,
owner_npub,
Some(&active.agent_id),
Some(&vm_id),
EVENT_RESTORE_FAILED,
"Restore from backup failed. The managed environment was left in error for operator review.",
Some(request_id),
)?;
Ok(())
}) {
tracing::error!(
request_id = %request_id,
owner_npub = %owner_npub,
agent_id = %active.agent_id,
vm_id = %vm_id,
error = %tx_err,
"failed to persist restore failure state"
);
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/agent_api.rs` around lines 1535 - 1549, The
transaction result is currently ignored which can hide failures from
mark_agent_errored_preserving_vm or insert_managed_environment_event; change the
code that calls conn.transaction (replace the `let _ =
conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| { ... });` pattern)
to capture and handle the Result instead—propagate the error with `?` or map it
into an anyhow::Error with context (e.g., using .map_err(|e|
anyhow::anyhow!("restore transaction failed: {}", e))?) so failures roll back
and are returned/upstream-handled; ensure you reference the existing closure and
the functions mark_agent_errored_preserving_vm and
insert_managed_environment_event when implementing the error propagation.

Comment on lines +367 to +372
let backup_host = self
.backup_status(&vm.id)
.ok()
.map(|status| status.backup_host)
.filter(|host| !host.trim().is_empty())
.unwrap_or_else(|| self.cfg.host_id.clone());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail closed when backup metadata can't tell you which host owns the snapshot.

This turns any backup_status() read/parse failure into “restore from cfg.host_id”. That is unsafe now that backup ownership can legitimately differ from the current host: a metadata regression here can restore from the wrong machine instead of the recorded backup source.

Suggested direction
-        let backup_host = self
-            .backup_status(&vm.id)
-            .ok()
-            .map(|status| status.backup_host)
-            .filter(|host| !host.trim().is_empty())
-            .unwrap_or_else(|| self.cfg.host_id.clone());
+        let backup = self
+            .backup_status(&vm.id)
+            .with_context(|| format!("load backup metadata for vm {id}"))?;
+        anyhow::ensure!(
+            backup.successful_backup_known,
+            "no successful backup is recorded for vm {id}"
+        );
+        let backup_host = backup.backup_host.trim();
+        anyhow::ensure!(
+            !backup_host.is_empty(),
+            "backup metadata is missing a backup host for vm {id}"
+        );
+        let backup_host = backup_host.to_string();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm-spawner/src/manager.rs` around lines 367 - 372, The code currently
treats failures from self.backup_status(&vm.id) as "no info" and silently falls
back to self.cfg.host_id; change this to fail closed: if backup_status() returns
an error/None or yields an empty backup_host, do not substitute cfg.host_id but
instead return an error (or propagate None) from the surrounding routine so the
restore is aborted. Locate the backup_host assignment (using
self.backup_status(&vm.id) and self.cfg.host_id) and replace the unwrap_or_else
fallback with explicit handling that returns Err/None with a clear error message
when backup metadata is unreadable or backup_host is empty. Ensure callers
handle the propagated error/None appropriately so no implicit fallback to
cfg.host_id occurs.

Comment on lines +110 to +138
private func openAgentProvisioningFromChatList(_ app: XCUIApplication, timeout: TimeInterval = 20) {
let chatsNav = app.navigationBars["Chats"]
XCTAssertTrue(chatsNav.waitForExistence(timeout: timeout))

let window = app.windows.element(boundBy: 0)
let deadline = Date().addingTimeInterval(timeout)

while Date() < deadline {
window.coordinate(withNormalizedOffset: CGVector(dx: 0.94, dy: 0.08)).tap()

let agentMenuItem = app.buttons.matching(identifier: "chatlist_agent").firstMatch
if agentMenuItem.waitForExistence(timeout: 1.0) {
agentMenuItem.tap()

let openclawChoice = app.buttons["OpenClaw"].firstMatch
if openclawChoice.waitForExistence(timeout: 1.0) {
openclawChoice.tap()
}
return
}

// Dismiss the menu before retrying. The agent item appears only after
// the allowlist probe resolves and the chat list re-renders.
window.coordinate(withNormalizedOffset: CGVector(dx: 0.10, dy: 0.20)).tap()
Thread.sleep(forTimeInterval: 0.5)
}

XCTFail("Agent menu item did not appear on chat list")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait for a real postcondition before returning from the provisioning helper.

Line 128 returns even when Line 124's "OpenClaw" button never appears. That makes this helper succeed before the flow has actually reached a stable next state, so the real failure only shows up later as a long timeout in the test body. Return only after the OpenClaw option is tapped, or after some explicit next-screen condition is visible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ios/UITests/PikaUITests.swift` around lines 110 - 138, The helper
openAgentProvisioningFromChatList currently returns as soon as the agentMenuItem
is tapped even if the "OpenClaw" button never appears; modify it so it only
returns after the OpenClaw option has been successfully found and tapped (or
after a clear next-screen condition is visible). Concretely, after tapping
agentMenuItem, wait for the app.buttons["OpenClaw"] firstMatch to exist within a
short timeout and tap it; if the button never appears continue the retry loop
(or fail after the overall deadline). Ensure the function checks a stable
postcondition (e.g., presence of the expected next-screen element or
confirmation UI) before returning so the helper reflects a completed
provisioning flow.

Comment on lines +688 to +715
func testE2E_openclawAgent_createAndReply() throws {
let env = ProcessInfo.processInfo.environment
let dotenv = loadDotenv()
let testNsec = env["PIKA_UI_E2E_NSEC"]
?? env["PIKA_TEST_NSEC"]
?? dotenv["PIKA_UI_E2E_NSEC"]
?? dotenv["PIKA_TEST_NSEC"]
?? ""
let relays = env["PIKA_UI_E2E_RELAYS"] ?? dotenv["PIKA_UI_E2E_RELAYS"] ?? ""
let kpRelays = env["PIKA_UI_E2E_KP_RELAYS"] ?? dotenv["PIKA_UI_E2E_KP_RELAYS"] ?? ""
let agentApiUrl = env["PIKA_AGENT_API_URL"]
?? env["PIKA_SERVER_URL"]
?? dotenv["PIKA_AGENT_API_URL"]
?? dotenv["PIKA_SERVER_URL"]
?? ""

if testNsec.isEmpty { XCTFail("Missing env var: PIKA_UI_E2E_NSEC (or PIKA_TEST_NSEC)"); return }
if relays.isEmpty { XCTFail("Missing env var: PIKA_UI_E2E_RELAYS"); return }
if kpRelays.isEmpty { XCTFail("Missing env var: PIKA_UI_E2E_KP_RELAYS"); return }

let app = XCUIApplication()
app.launchEnvironment["PIKA_UI_TEST_RESET"] = "1"
app.launchEnvironment["PIKA_RELAY_URLS"] = relays
app.launchEnvironment["PIKA_KEY_PACKAGE_RELAY_URLS"] = kpRelays
if !agentApiUrl.isEmpty {
app.launchEnvironment["PIKA_AGENT_API_URL"] = agentApiUrl
}
app.launch()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require the agent API endpoint explicitly for this E2E.

Right now this flow silently falls back to whatever backend URL the app was built with when PIKA_AGENT_API_URL / PIKA_SERVER_URL is missing. For a test that provisions real agent state, that reintroduces non-determinism and can hit the wrong environment. Fail fast here the same way the relay config is validated above.

Suggested change
         if testNsec.isEmpty { XCTFail("Missing env var: PIKA_UI_E2E_NSEC (or PIKA_TEST_NSEC)"); return }
         if relays.isEmpty { XCTFail("Missing env var: PIKA_UI_E2E_RELAYS"); return }
         if kpRelays.isEmpty { XCTFail("Missing env var: PIKA_UI_E2E_KP_RELAYS"); return }
+        if agentApiUrl.isEmpty { XCTFail("Missing env var: PIKA_AGENT_API_URL (or PIKA_SERVER_URL)"); return }

         let app = XCUIApplication()
         app.launchEnvironment["PIKA_UI_TEST_RESET"] = "1"
         app.launchEnvironment["PIKA_RELAY_URLS"] = relays
         app.launchEnvironment["PIKA_KEY_PACKAGE_RELAY_URLS"] = kpRelays
-        if !agentApiUrl.isEmpty {
-            app.launchEnvironment["PIKA_AGENT_API_URL"] = agentApiUrl
-        }
+        app.launchEnvironment["PIKA_AGENT_API_URL"] = agentApiUrl
         app.launch()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ios/UITests/PikaUITests.swift` around lines 688 - 715, The
testE2E_openclawAgent_createAndReply currently allows agentApiUrl to be empty
and thus silently uses the app's built-in backend; change the validation to
require agentApiUrl just like relays and kpRelays by checking
agentApiUrl.isEmpty and calling XCTFail("Missing env var: PIKA_AGENT_API_URL (or
PIKA_SERVER_URL)") then return when empty, ensuring the test fails fast; locate
the agentApiUrl variable and the app.launchEnvironment block in the
testE2E_openclawAgent_createAndReply function to implement this check before
setting app.launchEnvironment and calling app.launch().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
infra/nix/modules/builder.nix (2)

328-329: ⚠️ Potential issue | 🟠 Major

Tighten VM_ID validation to the real microVM ID format.

The current check still accepts . and .., so the derived restore path can escape the intended /var/lib/microvms/<vm>/home subtree. For this flow, please validate against the actual VM ID shape instead of a generic character allowlist.

Suggested fix
-      if [ -z "$VM_ID" ] || [[ "$VM_ID" =~ [^A-Za-z0-9._-] ]]; then
+      if [[ ! "$VM_ID" =~ ^vm-[0-9a-f]{8}$ ]]; then
         echo "invalid vm-id: $VM_ID" >&2
         exit 2
       fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/nix/modules/builder.nix` around lines 328 - 329, Current VM_ID
validation uses a loose allowlist and still permits "." and ".."; tighten it by
replacing the existing check with a strict pattern match against the real
microVM ID shape (e.g. require only alphanumerics, dash and underscore and no
dot) — change the conditional that references VM_ID (the if using [ -z "$VM_ID"
] || [[ "$VM_ID" =~ [^A-Za-z0-9._-] ]]) to first reject empty, then require a
full-match regex like ^[A-Za-z0-9_-]+$ (and explicitly reject "." and ".." if
you keep any dots), so the restore path cannot escape
/var/lib/microvms/<vm>/home.

325-366: ⚠️ Potential issue | 🟠 Major

Validate SNAPSHOT before passing it to restic.

An input like --help or any leading-dash value will be parsed as a CLI option by restic, bypassing snapshot selection. Restrict this to latest or a valid snapshot ID (hex characters only) before invoking restic.

Suggested fix
       VM_ID="$1"
       SNAPSHOT="''${2:-latest}"
       ENV_FILE="${microvmBackupEnvFile}"
+
+      if [[ "$SNAPSHOT" != "latest" && ! "$SNAPSHOT" =~ ^[0-9a-fA-F]+$ ]]; then
+        echo "invalid snapshot: $SNAPSHOT" >&2
+        exit 2
+      fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/nix/modules/builder.nix` around lines 325 - 366, Validate the SNAPSHOT
value before building restore_args and calling restic: ensure SNAPSHOT equals
"latest" or matches a snapshot-id pattern of only hex characters (e.g.,
/^[0-9a-fA-F]+$/); if it does not, print an error like "invalid snapshot" to
stderr and exit with non-zero status. Update the logic around the SNAPSHOT
variable (used where restore_args is set and where restic is invoked) to perform
this check and refuse leading-dash or non-hex values before calling restic.
♻️ Duplicate comments (4)
crates/pika-server/src/admin.rs (2)

643-668: ⚠️ Potential issue | 🟠 Major

Re-check backup freshness in the POST handler.

This reloads the managed-environment row, but it still never reloads backup eligibility. If the backup flipped to Missing or NotProvisioned after the confirm page rendered, the request still reaches restore_managed_environment_from_backup, and that downstream failure path marks the environment errored. Reject non-restorable backup states here before calling the restore path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/admin.rs` around lines 643 - 668, After reloading the
managed environment via load_managed_environment_status and before calling
restore_managed_environment_from_backup, re-check the backup eligibility/state
on the returned row (e.g., inspect row.backup_status or the relevant field) and
reject the request if it has transitioned to a non-restorable state such as
Missing or NotProvisioned by returning a suitable error (StatusCode::CONFLICT
with a clear message) instead of proceeding; keep the existing
verify_restore_confirmation_ticket and only call
restore_managed_environment_from_backup when the backup state is confirmed
restorable.

658-668: ⚠️ Potential issue | 🔴 Critical

Serialize restore submissions before issuing /restore.

verify_restore_confirmation_ticket is reusable until expiry, and nothing is persisted between that check and restore_managed_environment_from_backup. Two concurrent submits can therefore send duplicate restore requests for the same VM. Claim the restore atomically first, or consume the confirmation token as single-use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/admin.rs` around lines 658 - 668, The verification
step (verify_restore_confirmation_ticket) is racy because it doesn't mark the
token as consumed before calling restore_managed_environment_from_backup,
allowing duplicate concurrent restores; modify the flow to atomically
claim/consume the confirmation token (e.g. add and call a
consume_restore_confirmation_ticket or claim_restore_request DB operation that
returns error if already claimed) immediately after verification and before
restore_managed_environment_from_backup, or combine verification+claim into a
single atomic DB operation so only one successful caller proceeds to call
restore_managed_environment_from_backup; update error handling to map the new
"already claimed" result via map_agent_api_error.
crates/pika-server/src/agent_api.rs (2)

1475-1523: ⚠️ Potential issue | 🔴 Critical

Persist a restore-in-flight claim before dropping the DB connection.

This function reads the active row, records restore_requested, then calls the spawner with no state transition that excludes a second caller. Concurrent restores can therefore record duplicate requests and issue duplicate /restore commands for the same VM. Mark the environment restore-in-flight in the same transaction, or gate on a single-use restore token before the external call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/agent_api.rs` around lines 1475 - 1523, The code
records EVENT_RESTORE_REQUESTED then drops the DB connection before calling
MicrovmSpawnerClient::restore_vm_with_request_id, allowing concurrent callers to
race and issue duplicate restores; update the logic in the handler that calls
load_visible_agent_row, record_managed_environment_event, and
is_inflight_provision_row to persist a "restore-in-flight" flag/token on the
agent row (or atomically insert a single-use restore token) inside the same DB
transaction that records EVENT_RESTORE_REQUESTED (i.e., before drop(conn)), and
have subsequent requests check that flag/token (use is_inflight_provision_row or
a new is_restore_inflight check) and return AgentApiErrorCode::InvalidRequest if
set so only one concurrent restore proceeds to
MicrovmSpawnerClient::restore_vm_with_request_id.

1535-1549: ⚠️ Potential issue | 🟠 Major

Don't discard the restore-failure transaction result.

If mark_agent_errored_preserving_vm or insert_managed_environment_event fails here, the transaction rolls back and the error is dropped. The request still returns 500, but the row can stay ready and restore_failed never lands in the activity log. Handle the transaction result and log or propagate the persistence failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/agent_api.rs` around lines 1535 - 1549, The
transaction result returned by conn.transaction is currently ignored, so
failures from mark_agent_errored_preserving_vm or
insert_managed_environment_event are dropped; change the code around the
transaction (the conn.transaction call in agent_api.rs that wraps
mark_agent_errored_preserving_vm and insert_managed_environment_event) to
capture its Result, and on Err either log the error with context (including
request_id, vm_id, agent_id) or propagate it (return Err(anyhow::Error)) so the
rollback isn't silently swallowed; ensure you handle the Result from
conn.transaction and do not discard it.
🧹 Nitpick comments (3)
just/checks.just (1)

33-35: Consider adding a tracking issue reference for the skipped test.

The skip comment explains the reasoning well, but without a linked issue (e.g., # TODO(#123): ...), this workaround may be forgotten once the timing flake is no longer top-of-mind. Additionally, verify whether pre-merge-pika-rust (invoked via Nix on Darwin/arm64) applies the same skip—otherwise CI behavior differs across platforms.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@just/checks.just` around lines 33 - 35, The paging regression test is skipped
without a tracking issue; update the skip comment next to the cargo test
invocation for paging_loads_older_messages_in_pages in checks.just to include a
TODO with an issue number (e.g., "# TODO(`#123`): skip due to timing flake - add
tracking issue/PR link") and ensure the comment references the test name;
additionally, verify that the pre-merge-pika-rust invocation (Nix on
Darwin/arm64) honors the same skip and, if it doesn't, modify the
pre-merge-pika-rust invocation or CI config to apply the same "-- --skip
paging_loads_older_messages_in_pages" flag so behavior is consistent across
platforms.
crates/vm-spawner/src/manager.rs (2)

424-436: Log rollback failures for operational visibility.

During rollback on lines 430-434, errors from stop_vm_runtime, remove_path_if_exists, fs::rename, and recreate_prebuilt_vm_with_existing_home are silently discarded. While best-effort semantics are appropriate here, operators need visibility into partial recovery states for debugging.

♻️ Suggested direction
             warn!(
                 vm_id = %id,
                 error = %err,
                 "restore recreate failed; attempting rollback to previous durable home"
             );
-            let _ = self.stop_vm_runtime(&vm.id, &vm.tap_name).await;
+            if let Err(stop_err) = self.stop_vm_runtime(&vm.id, &vm.tap_name).await {
+                warn!(vm_id = %id, error = %stop_err, "rollback: failed to stop vm runtime");
+            }
             if let Some(previous_home_backup) = previous_home_backup.as_ref() {
-                let _ = remove_path_if_exists(&home_dir);
-                let _ = fs::rename(previous_home_backup, &home_dir);
-                let _ = self.recreate_prebuilt_vm_with_existing_home(&vm).await;
+                if let Err(rm_err) = remove_path_if_exists(&home_dir) {
+                    warn!(vm_id = %id, error = %rm_err, "rollback: failed to remove restored home");
+                }
+                if let Err(mv_err) = fs::rename(previous_home_backup, &home_dir) {
+                    warn!(vm_id = %id, error = %mv_err, "rollback: failed to restore previous home");
+                }
+                if let Err(recreate_err) = self.recreate_prebuilt_vm_with_existing_home(&vm).await {
+                    warn!(vm_id = %id, error = %recreate_err, "rollback: failed to recreate vm");
+                }
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm-spawner/src/manager.rs` around lines 424 - 436, The rollback path
currently swallows errors from best-effort calls; update it to log failures for
operational visibility while preserving best-effort semantics: capture the
Result of stop_vm_runtime(&vm.id, &vm.tap_name).await,
remove_path_if_exists(&home_dir), fs::rename(previous_home_backup, &home_dir),
and the second recreate_prebuilt_vm_with_existing_home(&vm).await and emit a
warn! (or error!) with context and the error value for each failure (include
vm_id = %id, previous_home_backup and home_dir context where relevant) instead
of discarding via let _ = ... so operators can see partial-recovery failures
without changing the control flow or return value.

385-396: Log recreate failure when restore helper fails.

When run_restore_helper fails, the code attempts to restart the previous environment on line 392, but silently discards any error. If this recreate also fails, the VM is left in an unknown state with no log visibility.

♻️ Suggested direction
             Err(err) => {
                 warn!(
                     vm_id = %id,
                     error = %err,
                     "restore helper failed; attempting to restart previous environment if possible"
                 );
                 if had_existing_home {
-                    let _ = self.recreate_prebuilt_vm_with_existing_home(&vm).await;
+                    if let Err(recreate_err) = self.recreate_prebuilt_vm_with_existing_home(&vm).await {
+                        warn!(vm_id = %id, error = %recreate_err, "failed to restart previous environment after restore helper failure");
+                    }
                 }
                 return Err(err);
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm-spawner/src/manager.rs` around lines 385 - 396, When
run_restore_helper fails and you call
recreate_prebuilt_vm_with_existing_home(&vm), don't discard its Result; capture
it and log any failure so we have visibility. Replace the current let _ =
self.recreate_prebuilt_vm_with_existing_home(&vm).await with code that matches
on the Result (e.g., if let Err(e) =
self.recreate_prebuilt_vm_with_existing_home(&vm).await) and emit a log entry
(warn! or error!) including vm_id = %id and error = %e with a clear message like
"recreate_prebuilt_vm_with_existing_home failed after restore helper failure";
keep returning the original err from the restore helper afterwards.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/pika-agent-microvm/src/lib.rs`:
- Around line 198-225: The restore_vm_with_request_id function lacks a request
timeout and can hang indefinitely; update the RequestBuilder created from
self.client.post(&url) (in restore_vm_with_request_id) to set a bounded timeout
of about 10 minutes (e.g., Duration::from_secs(600)) before calling send().
Ensure you import/use std::time::Duration if needed and follow the same pattern
used by other RPC methods that call .timeout(...) on their RequestBuilder.

In `@infra/nix/modules/builder.nix`:
- Around line 356-372: Detect and refuse using a pre-populated TARGET_ROOT
before running the restic restore: after resolving TARGET_ROOT (the mktemp
fallback and TARGET_ROOT="$(cd "$TARGET_ROOT" && pwd)" logic) check whether the
path already exists and is non-empty and, if so, print an error and exit; only
proceed with mkdir -p/restore when TARGET_ROOT is newly created or empty.
Reference TARGET_ROOT, restore_args/restic invocation, and RESTORED_HOME to
locate where to add the existence-and-non-empty check and the early exit.

---

Outside diff comments:
In `@infra/nix/modules/builder.nix`:
- Around line 328-329: Current VM_ID validation uses a loose allowlist and still
permits "." and ".."; tighten it by replacing the existing check with a strict
pattern match against the real microVM ID shape (e.g. require only
alphanumerics, dash and underscore and no dot) — change the conditional that
references VM_ID (the if using [ -z "$VM_ID" ] || [[ "$VM_ID" =~ [^A-Za-z0-9._-]
]]) to first reject empty, then require a full-match regex like ^[A-Za-z0-9_-]+$
(and explicitly reject "." and ".." if you keep any dots), so the restore path
cannot escape /var/lib/microvms/<vm>/home.
- Around line 325-366: Validate the SNAPSHOT value before building restore_args
and calling restic: ensure SNAPSHOT equals "latest" or matches a snapshot-id
pattern of only hex characters (e.g., /^[0-9a-fA-F]+$/); if it does not, print
an error like "invalid snapshot" to stderr and exit with non-zero status. Update
the logic around the SNAPSHOT variable (used where restore_args is set and where
restic is invoked) to perform this check and refuse leading-dash or non-hex
values before calling restic.

---

Duplicate comments:
In `@crates/pika-server/src/admin.rs`:
- Around line 643-668: After reloading the managed environment via
load_managed_environment_status and before calling
restore_managed_environment_from_backup, re-check the backup eligibility/state
on the returned row (e.g., inspect row.backup_status or the relevant field) and
reject the request if it has transitioned to a non-restorable state such as
Missing or NotProvisioned by returning a suitable error (StatusCode::CONFLICT
with a clear message) instead of proceeding; keep the existing
verify_restore_confirmation_ticket and only call
restore_managed_environment_from_backup when the backup state is confirmed
restorable.
- Around line 658-668: The verification step
(verify_restore_confirmation_ticket) is racy because it doesn't mark the token
as consumed before calling restore_managed_environment_from_backup, allowing
duplicate concurrent restores; modify the flow to atomically claim/consume the
confirmation token (e.g. add and call a consume_restore_confirmation_ticket or
claim_restore_request DB operation that returns error if already claimed)
immediately after verification and before
restore_managed_environment_from_backup, or combine verification+claim into a
single atomic DB operation so only one successful caller proceeds to call
restore_managed_environment_from_backup; update error handling to map the new
"already claimed" result via map_agent_api_error.

In `@crates/pika-server/src/agent_api.rs`:
- Around line 1475-1523: The code records EVENT_RESTORE_REQUESTED then drops the
DB connection before calling MicrovmSpawnerClient::restore_vm_with_request_id,
allowing concurrent callers to race and issue duplicate restores; update the
logic in the handler that calls load_visible_agent_row,
record_managed_environment_event, and is_inflight_provision_row to persist a
"restore-in-flight" flag/token on the agent row (or atomically insert a
single-use restore token) inside the same DB transaction that records
EVENT_RESTORE_REQUESTED (i.e., before drop(conn)), and have subsequent requests
check that flag/token (use is_inflight_provision_row or a new
is_restore_inflight check) and return AgentApiErrorCode::InvalidRequest if set
so only one concurrent restore proceeds to
MicrovmSpawnerClient::restore_vm_with_request_id.
- Around line 1535-1549: The transaction result returned by conn.transaction is
currently ignored, so failures from mark_agent_errored_preserving_vm or
insert_managed_environment_event are dropped; change the code around the
transaction (the conn.transaction call in agent_api.rs that wraps
mark_agent_errored_preserving_vm and insert_managed_environment_event) to
capture its Result, and on Err either log the error with context (including
request_id, vm_id, agent_id) or propagate it (return Err(anyhow::Error)) so the
rollback isn't silently swallowed; ensure you handle the Result from
conn.transaction and do not discard it.

---

Nitpick comments:
In `@crates/vm-spawner/src/manager.rs`:
- Around line 424-436: The rollback path currently swallows errors from
best-effort calls; update it to log failures for operational visibility while
preserving best-effort semantics: capture the Result of stop_vm_runtime(&vm.id,
&vm.tap_name).await, remove_path_if_exists(&home_dir),
fs::rename(previous_home_backup, &home_dir), and the second
recreate_prebuilt_vm_with_existing_home(&vm).await and emit a warn! (or error!)
with context and the error value for each failure (include vm_id = %id,
previous_home_backup and home_dir context where relevant) instead of discarding
via let _ = ... so operators can see partial-recovery failures without changing
the control flow or return value.
- Around line 385-396: When run_restore_helper fails and you call
recreate_prebuilt_vm_with_existing_home(&vm), don't discard its Result; capture
it and log any failure so we have visibility. Replace the current let _ =
self.recreate_prebuilt_vm_with_existing_home(&vm).await with code that matches
on the Result (e.g., if let Err(e) =
self.recreate_prebuilt_vm_with_existing_home(&vm).await) and emit a log entry
(warn! or error!) including vm_id = %id and error = %e with a clear message like
"recreate_prebuilt_vm_with_existing_home failed after restore helper failure";
keep returning the original err from the restore helper afterwards.

In `@just/checks.just`:
- Around line 33-35: The paging regression test is skipped without a tracking
issue; update the skip comment next to the cargo test invocation for
paging_loads_older_messages_in_pages in checks.just to include a TODO with an
issue number (e.g., "# TODO(`#123`): skip due to timing flake - add tracking
issue/PR link") and ensure the comment references the test name; additionally,
verify that the pre-merge-pika-rust invocation (Nix on Darwin/arm64) honors the
same skip and, if it doesn't, modify the pre-merge-pika-rust invocation or CI
config to apply the same "-- --skip paging_loads_older_messages_in_pages" flag
so behavior is consistent across platforms.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 05fe0ae6-e145-483d-9d8b-4be8f98d91ae

📥 Commits

Reviewing files that changed from the base of the PR and between 761968c and 48e3393.

📒 Files selected for processing (13)
  • crates/pika-agent-microvm/src/lib.rs
  • crates/pika-server/src/admin.rs
  • crates/pika-server/src/agent_api.rs
  • crates/pika-server/src/customer.rs
  • crates/pika-server/src/main.rs
  • crates/pika-server/templates/admin/dashboard.html
  • crates/pika-server/templates/admin/restore_confirm.html
  • crates/vm-spawner/Cargo.toml
  • crates/vm-spawner/src/config.rs
  • crates/vm-spawner/src/main.rs
  • crates/vm-spawner/src/manager.rs
  • infra/nix/modules/builder.nix
  • just/checks.just
🚧 Files skipped from review as they are similar to previous changes (6)
  • crates/vm-spawner/src/config.rs
  • crates/pika-server/templates/admin/dashboard.html
  • crates/vm-spawner/src/main.rs
  • crates/pika-server/templates/admin/restore_confirm.html
  • crates/pika-server/src/customer.rs
  • crates/pika-server/src/main.rs

Comment on lines +198 to +225
pub async fn restore_vm_with_request_id(
&self,
vm_id: &str,
request_id: Option<&str>,
) -> anyhow::Result<VmResponse> {
let url = format!("{}/vms/{vm_id}/restore", self.base_url);
// Durable-home restores can legitimately run for several minutes.
let resp = with_request_id(self.client.post(&url), request_id)
.send()
.await
.context("send restore vm request")?;
let status = resp.status();
if !status.is_success() {
let upstream_request_id = response_request_id(resp.headers());
let text = resp.text().await.unwrap_or_default();
anyhow::bail!(
"{}",
upstream_error_message(
"restore vm",
Some(vm_id),
status,
upstream_request_id.as_deref(),
&text
)
);
}
resp.json().await.context("decode restore vm response")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In reqwest, if no timeout is configured on the Client or RequestBuilder, does the request wait indefinitely by default?

💡 Result:

Yes—by default reqwest applies no total request timeout. If you don’t set a timeout on the ClientBuilder (or per-request), the request can wait indefinitely (until the server/connection errors out, or something external cancels it). The ClientBuilder::timeout docs explicitly say “Default is no timeout,” and the default config sets timeout: None. (docs.rs)

Note: Older reqwest versions (around v0.8) introduced a 30s default timeout, but current reqwest (e.g., 0.13.x) does not. (docs.rs)

Citations:


🏁 Script executed:

# First, let's read the file around the lines mentioned
cd crates/pika-agent-microvm && wc -l src/lib.rs

Repository: sledtools/pika

Length of output: 73


🏁 Script executed:

# Read the relevant section
cd crates/pika-agent-microvm && sed -n '190,230p' src/lib.rs

Repository: sledtools/pika

Length of output: 1461


🏁 Script executed:

# Search for timeout constants defined in the file
cd crates/pika-agent-microvm && rg "TIMEOUT.*Duration" src/lib.rs -A 1

Repository: sledtools/pika

Length of output: 358


🏁 Script executed:

# Search for other RPC methods to verify they use timeouts
cd crates/pika-agent-microvm && rg "\.timeout\(" src/lib.rs -B 2 -A 2

Repository: sledtools/pika

Length of output: 1053


Give restore calls a bounded timeout.

All the other spawner RPCs here use explicit timeouts, but restore_vm_with_request_id currently lacks one and will inherit reqwest's default no-timeout behavior. A stalled spawner or half-open connection will leave the admin restore request hanging indefinitely. Add a dedicated long restore timeout (suggested 10 minutes) to match the pattern of other RPC methods.

🛠️ Suggested change
const DELETE_VM_TIMEOUT: Duration = Duration::from_secs(30);
const RECOVER_VM_TIMEOUT: Duration = Duration::from_secs(60);
+const RESTORE_VM_TIMEOUT: Duration = Duration::from_secs(10 * 60);
const GET_VM_TIMEOUT: Duration = Duration::from_secs(10);
const GET_VM_BACKUP_STATUS_TIMEOUT: Duration = Duration::from_secs(10);
@@
-        let resp = with_request_id(self.client.post(&url), request_id)
+        let resp = with_request_id(
+            self.client.post(&url).timeout(RESTORE_VM_TIMEOUT),
+            request_id,
+        )
             .send()
             .await
             .context("send restore vm request")?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-agent-microvm/src/lib.rs` around lines 198 - 225, The
restore_vm_with_request_id function lacks a request timeout and can hang
indefinitely; update the RequestBuilder created from self.client.post(&url) (in
restore_vm_with_request_id) to set a bounded timeout of about 10 minutes (e.g.,
Duration::from_secs(600)) before calling send(). Ensure you import/use
std::time::Duration if needed and follow the same pattern used by other RPC
methods that call .timeout(...) on their RequestBuilder.

Comment on lines +356 to +372
if [ -z "$TARGET_ROOT" ]; then
TARGET_ROOT="$(mktemp -d "/tmp/microvm-restore-$VM_ID-XXXXXX")"
fi
mkdir -p "$TARGET_ROOT"
TARGET_ROOT="$(cd "$TARGET_ROOT" && pwd)"

restore_args=(restore "$SNAPSHOT" --target "$TARGET_ROOT" --include "$INCLUDE_PATH")
if [ "$SNAPSHOT" = "latest" ]; then
restore_args+=(--host "$BACKUP_HOST" --tag microvm-home --path "$INCLUDE_PATH")
fi
restic "''${restore_args[@]}"

RESTORED_HOME="$TARGET_ROOT$INCLUDE_PATH"
if [ ! -d "$RESTORED_HOME" ]; then
echo "restore did not produce expected durable-home path: $RESTORED_HOME" >&2
exit 1
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Refuse restores into a pre-populated target root.

If --target-root points at an existing tree, this can produce a mixed result with stale files from a previous restore or manual contents, and the final -d check still passes. For a recovery path, the target should be empty or freshly created by this script.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/nix/modules/builder.nix` around lines 356 - 372, Detect and refuse
using a pre-populated TARGET_ROOT before running the restic restore: after
resolving TARGET_ROOT (the mktemp fallback and TARGET_ROOT="$(cd "$TARGET_ROOT"
&& pwd)" logic) check whether the path already exists and is non-empty and, if
so, print an error and exit; only proceed with mkdir -p/restore when TARGET_ROOT
is newly created or empty. Reference TARGET_ROOT, restore_args/restic
invocation, and RESTORED_HOME to locate where to add the existence-and-non-empty
check and the early exit.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (6)
infra/nix/modules/builder.nix (1)

356-360: ⚠️ Potential issue | 🟠 Major

Refuse non-empty --target-root values before restore.

If TARGET_ROOT already contains files, restic restore can leave stale contents behind and the later RESTORED_HOME check still succeeds. Please fail fast unless the directory is newly created or empty.

Suggested fix
       if [ -z "$TARGET_ROOT" ]; then
         TARGET_ROOT="$(mktemp -d "/tmp/microvm-restore-$VM_ID-XXXXXX")"
       fi
       mkdir -p "$TARGET_ROOT"
       TARGET_ROOT="$(cd "$TARGET_ROOT" && pwd)"
+      if [ -n "$(find "$TARGET_ROOT" -mindepth 1 -maxdepth 1 -print -quit)" ]; then
+        echo "refusing to restore into non-empty target root: $TARGET_ROOT" >&2
+        exit 2
+      fi
 
       restore_args=(restore "$SNAPSHOT" --target "$TARGET_ROOT" --include "$INCLUDE_PATH")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/nix/modules/builder.nix` around lines 356 - 360, Ensure restore refuses
a non-empty user-supplied TARGET_ROOT: after normalizing TARGET_ROOT (the
TARGET_ROOT="$(cd "$TARGET_ROOT" && pwd)" block) check whether the directory
pre-existed and contains any entries; if TARGET_ROOT was not created by this
script and is not empty, log an error and exit non‑zero. Use the existing
TARGET_ROOT variable to detect non-empty contents (e.g., test for any file/dir
under TARGET_ROOT) and fail fast before running restic or relying on
RESTORED_HOME.
crates/pika-agent-microvm/src/lib.rs (1)

198-225: ⚠️ Potential issue | 🟠 Major

Bound the restore RPC with a timeout.

This is the only spawner call here without .timeout(...), so a stalled restore can block the admin flow indefinitely.

🛠️ Suggested change
 const DELETE_VM_TIMEOUT: Duration = Duration::from_secs(30);
 const RECOVER_VM_TIMEOUT: Duration = Duration::from_secs(60);
+const RESTORE_VM_TIMEOUT: Duration = Duration::from_secs(10 * 60);
 const GET_VM_TIMEOUT: Duration = Duration::from_secs(10);
 const GET_VM_BACKUP_STATUS_TIMEOUT: Duration = Duration::from_secs(10);
@@
-        let resp = with_request_id(self.client.post(&url), request_id)
+        let resp = with_request_id(
+            self.client.post(&url).timeout(RESTORE_VM_TIMEOUT),
+            request_id,
+        )
             .send()
             .await
             .context("send restore vm request")?;
In reqwest, if no timeout is configured on the Client or RequestBuilder, is there a default total request timeout?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-agent-microvm/src/lib.rs` around lines 198 - 225, The
restore_vm_with_request_id function currently issues a POST without a
per-request timeout; wrap the RequestBuilder returned by
with_request_id(self.client.post(&url), request_id) with a .timeout(...) to
bound the RPC (e.g., Duration::from_secs(120) or a configurable value) so a
stalled restore cannot block forever; update restore_vm_with_request_id to call
.timeout(...) before .send().await and ensure the chosen timeout is configurable
via the client or settings if needed.
crates/vm-spawner/src/manager.rs (1)

367-372: ⚠️ Potential issue | 🔴 Critical

Fail closed when backup metadata is unreadable or incomplete.

This still turns missing/invalid backup metadata into cfg.host_id, so restore can pull from the current host instead of the recorded snapshot source. Please abort restore unless the backup status proves a successful backup with an explicit host; the new restore tests that assume the cfg.host_id fallback will need to flip with that change as well.

crates/pika-server/src/admin.rs (2)

658-670: ⚠️ Potential issue | 🟠 Major

Consider serializing concurrent restore submissions.

verify_restore_confirmation_ticket is stateless, and the downstream restore flow only checks inflight state before the external restore call. Two concurrent POSTs with the same valid token can both pass validation and send duplicate /restore requests for the same VM. Consider either:

  1. Consuming the confirmation token atomically (single-use token stored in DB)
  2. Marking the environment as restore-in-flight before calling the spawner

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/admin.rs` around lines 658 - 670, The current flow
lets two concurrent POSTs pass verify_restore_confirmation_ticket and both call
restore_managed_environment_from_backup, causing duplicate restores; to fix,
make the confirmation token consumption or restore-in-flight mark atomic before
calling restore_managed_environment_from_backup: update the DB in a single
transactional operation (e.g., in the same function or a helper) that verifies
the token and either deletes/flags it as used or sets a "restore_in_flight" flag
on the environment row, returning an error if already consumed or in-flight, and
only then call restore_managed_environment_from_backup; ensure the transaction
is committed before the external spawner call and handle rollback/clearing the
in-flight flag on spawn failure so verify_restore_confirmation_ticket +
restore_managed_environment_from_backup cannot be executed concurrently for the
same VM.

643-657: ⚠️ Potential issue | 🟠 Major

Re-check backup eligibility before proceeding with restore.

The handler validates vm_id exists but doesn't re-verify backup freshness. If the backup became Missing or NotProvisioned after the confirmation page was rendered, the request still proceeds to restore_managed_environment_from_backup. Consider loading backup status here and rejecting non-restorable states (e.g., return StatusCode::CONFLICT with a clear message) before calling the restore workflow.

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/admin.rs` around lines 643 - 657, Re-check the
backup's current state after loading the managed environment status and before
calling restore_managed_environment_from_backup: call
load_managed_environment_status (or the appropriate status loader) again or
inspect the freshly-loaded status.row backup field (e.g., row.backup_status /
row.backup_state / row.backup) and if it is Missing, NotProvisioned, or
otherwise not restorable return Err((StatusCode::CONFLICT, "backup is not
available for restore".to_string())) instead of proceeding; keep existing vm_id
check and only call restore_managed_environment_from_backup when the backup
state is valid.
crates/pika-server/src/agent_api.rs (1)

1535-1549: ⚠️ Potential issue | 🟠 Major

Handle the restore-failure transaction result instead of dropping it.

The transaction result is silently discarded with let _ = conn.transaction.... If mark_agent_errored_preserving_vm or insert_managed_environment_event fails, the transaction rolls back but the error is lost. The agent row may remain in ready state and restore_failed never gets recorded, breaking the audit trail.

Suggested fix
 if let Ok(mut conn) = state.db_pool.get() {
-    let _ = conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| {
-        let _ = mark_agent_errored_preserving_vm(conn, &active)
-            .map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
+    if let Err(tx_err) = conn.transaction::<(), anyhow::Error, _>(|conn| {
+        mark_agent_errored_preserving_vm(conn, &active)
+            .map(|_| ())
+            .map_err(|inner| anyhow::anyhow!(inner.error_code()))?;
         insert_managed_environment_event(
             conn,
             owner_npub,
             Some(&active.agent_id),
             Some(&vm_id),
             EVENT_RESTORE_FAILED,
             "Restore from backup failed. The managed environment was left in error for operator review.",
             Some(request_id),
         )?;
-        Ok(active.clone())
-    });
+        Ok(())
+    }) {
+        tracing::error!(
+            request_id = %request_id,
+            owner_npub = %owner_npub,
+            agent_id = %active.agent_id,
+            vm_id = %vm_id,
+            error = %tx_err,
+            "failed to persist restore failure state"
+        );
+    }
 }

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/pika-server/src/agent_api.rs` around lines 1535 - 1549, The
transaction result from state.db_pool.get().transaction(...) is currently
discarded (let _ = ...) so any rollback error from
mark_agent_errored_preserving_vm or insert_managed_environment_event is lost;
change this to capture the result (e.g., let tx_res =
conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| { ... });) and then
handle tx_res by propagating or converting the error (return Err(...) from the
enclosing function or log the error and return an anyhow error) so failures are
not ignored; reference the state.db_pool.get(), conn.transaction,
mark_agent_errored_preserving_vm, insert_managed_environment_event,
EVENT_RESTORE_FAILED and request_id to find the exact block to update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/pika-agent-microvm/src/lib.rs`:
- Around line 198-225: The restore_vm_with_request_id function currently issues
a POST without a per-request timeout; wrap the RequestBuilder returned by
with_request_id(self.client.post(&url), request_id) with a .timeout(...) to
bound the RPC (e.g., Duration::from_secs(120) or a configurable value) so a
stalled restore cannot block forever; update restore_vm_with_request_id to call
.timeout(...) before .send().await and ensure the chosen timeout is configurable
via the client or settings if needed.

In `@crates/pika-server/src/admin.rs`:
- Around line 658-670: The current flow lets two concurrent POSTs pass
verify_restore_confirmation_ticket and both call
restore_managed_environment_from_backup, causing duplicate restores; to fix,
make the confirmation token consumption or restore-in-flight mark atomic before
calling restore_managed_environment_from_backup: update the DB in a single
transactional operation (e.g., in the same function or a helper) that verifies
the token and either deletes/flags it as used or sets a "restore_in_flight" flag
on the environment row, returning an error if already consumed or in-flight, and
only then call restore_managed_environment_from_backup; ensure the transaction
is committed before the external spawner call and handle rollback/clearing the
in-flight flag on spawn failure so verify_restore_confirmation_ticket +
restore_managed_environment_from_backup cannot be executed concurrently for the
same VM.
- Around line 643-657: Re-check the backup's current state after loading the
managed environment status and before calling
restore_managed_environment_from_backup: call load_managed_environment_status
(or the appropriate status loader) again or inspect the freshly-loaded
status.row backup field (e.g., row.backup_status / row.backup_state /
row.backup) and if it is Missing, NotProvisioned, or otherwise not restorable
return Err((StatusCode::CONFLICT, "backup is not available for
restore".to_string())) instead of proceeding; keep existing vm_id check and only
call restore_managed_environment_from_backup when the backup state is valid.

In `@crates/pika-server/src/agent_api.rs`:
- Around line 1535-1549: The transaction result from
state.db_pool.get().transaction(...) is currently discarded (let _ = ...) so any
rollback error from mark_agent_errored_preserving_vm or
insert_managed_environment_event is lost; change this to capture the result
(e.g., let tx_res = conn.transaction::<AgentInstance, anyhow::Error, _>(|conn| {
... });) and then handle tx_res by propagating or converting the error (return
Err(...) from the enclosing function or log the error and return an anyhow
error) so failures are not ignored; reference the state.db_pool.get(),
conn.transaction, mark_agent_errored_preserving_vm,
insert_managed_environment_event, EVENT_RESTORE_FAILED and request_id to find
the exact block to update.

In `@infra/nix/modules/builder.nix`:
- Around line 356-360: Ensure restore refuses a non-empty user-supplied
TARGET_ROOT: after normalizing TARGET_ROOT (the TARGET_ROOT="$(cd "$TARGET_ROOT"
&& pwd)" block) check whether the directory pre-existed and contains any
entries; if TARGET_ROOT was not created by this script and is not empty, log an
error and exit non‑zero. Use the existing TARGET_ROOT variable to detect
non-empty contents (e.g., test for any file/dir under TARGET_ROOT) and fail fast
before running restic or relying on RESTORED_HOME.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 995ab3f9-5773-41a6-938b-e205777e6fd1

📥 Commits

Reviewing files that changed from the base of the PR and between 48e3393 and 591bb7f.

📒 Files selected for processing (12)
  • crates/pika-agent-microvm/src/lib.rs
  • crates/pika-server/src/admin.rs
  • crates/pika-server/src/agent_api.rs
  • crates/pika-server/src/customer.rs
  • crates/pika-server/src/main.rs
  • crates/pika-server/templates/admin/dashboard.html
  • crates/pika-server/templates/admin/restore_confirm.html
  • crates/vm-spawner/Cargo.toml
  • crates/vm-spawner/src/config.rs
  • crates/vm-spawner/src/main.rs
  • crates/vm-spawner/src/manager.rs
  • infra/nix/modules/builder.nix
🚧 Files skipped from review as they are similar to previous changes (4)
  • crates/vm-spawner/src/config.rs
  • crates/pika-server/src/customer.rs
  • crates/pika-server/src/main.rs
  • crates/vm-spawner/Cargo.toml

@justinmoon justinmoon merged commit a926877 into master Mar 11, 2026
18 of 20 checks passed
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 02:04
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 02:05
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 02:10
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 02:15
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 18:29
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 18:31
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 18:35
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 18:36
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 18:41
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 18:46
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 18:47
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 18:52
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 18:53
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 18:57
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 18:59
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:02
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:04
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:08
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:10
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:13
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:16
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:18
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:21
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:24
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:27
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:29
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:33
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:34
@justinmoon justinmoon restored the wt-dvlzikgo branch March 21, 2026 19:39
@justinmoon justinmoon deleted the wt-dvlzikgo branch March 21, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant