Skip to content

Fix agent recovery races and add demo flow#475

Closed
justinmoon wants to merge 1 commit intosledtools:masterfrom
justinmoon:new-agent-button
Closed

Fix agent recovery races and add demo flow#475
justinmoon wants to merge 1 commit intosledtools:masterfrom
justinmoon:new-agent-button

Conversation

@justinmoon
Copy link
Collaborator

@justinmoon justinmoon commented Mar 7, 2026

Summary

  • fix agent recovery/reprovision race handling in pika-server and Rust core
  • simplify native agent button plumbing and regenerate checked-in project/bindings artifacts
  • add a repeatable just agent-demo flow for live reset/recover/chat verification

Validation

  • cargo fmt
  • cargo test -p pika-server agent_api -- --nocapture
  • cargo test -p pika_core agent:: --no-run
  • cargo test -p pika_core create_chat_waits_for_local_key_package_publication -- --nocapture
  • cargo test -p pika_core key_package_publish_failure_clears_pending_direct_chat -- --nocapture
  • cargo test -p pika_core stale_key_package_publish_failure_is_ignored -- --nocapture
  • cargo test -p pika_core allowlist_probe_error_does_not_override_known_state -- --nocapture
  • cargo test -p pika-server prepare_agent_for_reprovision_clears_active_constraint_for_missing_vm_id_row -- --nocapture
  • just ios-xcframework
  • just ios-xcodeproj
  • live CLI smoke against https://api.pikachat.org via just agent-demo

Open with Devin

Summary by CodeRabbit

Release Notes

  • New Features

    • Implemented agent allowlist status tracking with UI integration
    • Enhanced agent recovery mechanism for improved reliability
    • Upgraded agent button appearance with improved visual indicator
    • Added demo utility for agent testing and validation
  • Bug Fixes

    • Improved error handling for agent provisioning failures
    • Better recovery pathways for interrupted agent creation

@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Across Rust backend, iOS, and Android platforms, these changes introduce AgentMenuItemState as a unified agent button state representation. Agent button state transitions from computed UI methods to observable AppState. Backend refactoring adds VM-not-found detection, recovery paths, and agent allowlist probing. Key package publishing now uses token-based async coordination to safely defer direct chat creation.

Changes

Cohort / File(s) Summary
Agent State Model Definition
rust/src/state.rs, android/app/src/main/java/com/pika/app/rust/pika_core.kt, ios/Sources/ViewState.swift
Introduces AgentMenuItemState struct with title and is_busy/isBusy fields across all platforms. Adds agent_button: Option<AgentMenuItemState> to AppState. iOS replaces AgentButtonState struct with typealias to AgentMenuItemState.
Backend Agent Provisioning & Recovery
crates/pika-server/src/agent_api.rs
Refactors agent provisioning flow: removes load_recoverable_agent_row, adds mark_agent_errored, prepare_agent_for_reprovision, is_vm_not_found_error, and provision_agent_for_owner. Updates ensure_agent, get_my_agent, and recover_my_agent to handle VM-not-found as recoverable condition with reprovisioning logic.
Client-Side Agent Allowlist & Key Package Flow
rust/src/core/mod.rs, rust/src/core/agent.rs, rust/src/core/session.rs, rust/src/updates.rs
Introduces AgentAllowlistState enum and token-based coordination for agent allowlist probing and key package publishing. Adds new InternalEvent variants: KeyPackagePublished { token, ok, error } and AgentAllowlistResolved { token, pubkey, allowlisted, error }. Defers direct chat creation until local key package is published via pending state tracking.
iOS Agent Logic Simplification
ios/Sources/Agent.swift, ios/Sources/AppManager.swift, ios/Sources/ContentView.swift, ios/Sources/Views/ChatListView.swift, ios/Tests/AgentTests.swift
Removes isAgentEligible and makeAgentButtonState functions from Agent.swift. Removes agentButtonState(for:) and currentNpub() from AppManager.swift. Updates ContentView to read agentButton from state rather than manager helper. Changes agent button icon from "desktopcomputer" to "sparkles". Deletes AgentTests.swift eligibility and button state tests.
iOS Test Updates
ios/Tests/AppManagerTests.swift, ios/Sources/PreviewData.swift
Adds updateRequired, agentButton, and mediaGallery fields to test AppState construction. Replaces manager.agentButtonState(for:) calls with direct manager.state.agentButton access. Updates test names and expectations to reflect AgentMenuItemState type.
Android FFI Converters & State Wiring
android/app/src/main/java/com/pika/app/AppManager.kt
Initializes agentButton: null in AppState constructor usage within AppManager. Adds FfiConverterTypeAgentMenuItemState and FfiConverterOptionalTypeAgentMenuItemState to handle Rust↔Kotlin conversion of AgentMenuItemState and optional variants.
Demo & Build Configuration
justfile, scripts/agent-demo.sh
Adds agent-demo Justfile recipe to execute agent demo script. Adds agent-demo.sh Bash script that orchestrates live agent chat demo: manages VM lifecycle, handles agent creation/recovery, coordinates key package publishing, and executes pikachat CLI agent flow.

Sequence Diagram

sequenceDiagram
    participant Client as iOS/Android Client
    participant Core as AppCore
    participant Allowlist as Agent Allowlist<br/>Service
    participant KeyPkg as Key Package<br/>Publisher
    participant Server as pika-server<br/>agent_api
    participant VM as Agent VM

    Client->>Core: ensure_agent(npub)
    Core->>Core: validate_local_keys()
    Core->>Allowlist: refresh_agent_allowlist()<br/>(token: 1)
    Allowlist->>Server: check_allowlist(npub)
    Server->>Allowlist: allowlisted=true
    Allowlist->>Core: InternalEvent::AgentAllowlistResolved<br/>{token:1, allowlisted:true}
    Core->>Core: update_agent_button_state(allowed)
    Client->>Core: create_direct_chat(peer_npub)
    Core->>KeyPkg: publish_key_package()<br/>(token: 1)
    Note over KeyPkg: Store pending_direct_chat_creation<br/>if not yet published
    KeyPkg->>Server: publish_key_package()
    Server->>KeyPkg: success
    KeyPkg->>Core: InternalEvent::KeyPackagePublished<br/>{token:1, ok:true}
    Core->>Core: continue_pending_chat_creation()
    Core->>Server: ensure_agent()
    Server->>VM: provision_agent_for_owner()
    VM->>Server: AgentInstance created
    Server->>Core: AgentInstance
    Core->>Client: update_app_state(agent_button)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 A button springs to life with sparkles bright,
Token-bound key packages dance through the night,
Allowlist whispers from server to core,
Agent VM recovers when lost—what a tour!
State flows like carrots, from Rust to the view. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main objectives: fixing agent recovery races and adding a demo flow, which aligns with the core changes across server, Rust core, and native app layers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@justinmoon
Copy link
Collaborator Author

Superseded by #476 so CI can run from an in-repo branch.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +298 to 303
AgentAppState::Error => {
recover_my_agent(&client, &keys, &base_url).await?;
if attempt < AGENT_POLL_MAX_ATTEMPTS {
tokio::time::sleep(AGENT_POLL_DELAY).await;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unbounded recover calls in Error state can spawn up to 45 orphaned VMs

In run_agent_flow, the AgentAppState::Error branch calls recover_my_agent on every poll iteration without any rate-limiting guard. Unlike the Creating state, which uses recovered_stalled_creating to limit recovery to one attempt, the Error state will call recover on every 2-second cycle. If the server-side recovery provisions a new VM that immediately fails (e.g., spawner network error), the new agent enters Error state, and the next poll triggers another recovery — creating yet another VM. Over 45 iterations this can produce up to 45 orphaned VMs.

Comparison with the guarded Creating handler

The Creating handler at rust/src/core/agent.rs:289-296 correctly guards with recovered_stalled_creating, ensuring recovery is attempted at most once. The Error handler at line 298 has no equivalent guard.

On the server side (crates/pika-server/src/agent_api.rs:436-448), each recover call to an errored agent provisions a brand new VM via provision_agent_for_owner. If the spawner is down, each call creates a new DB row that transitions to Error, and the cycle repeats.

Prompt for agents
In rust/src/core/agent.rs, function run_agent_flow, around lines 283-314: Add a guard (similar to recovered_stalled_creating) to prevent repeated recover_my_agent calls in the AgentAppState::Error branch. For example, add a `recovered_error` boolean flag that is set to true after the first recover attempt in the Error state, and skip subsequent recover calls. Alternatively, reuse the existing recovered_stalled_creating flag or introduce a single recovered flag for both Error and stalled-Creating states. The key change is: in the Error match arm (line 298), only call recover_my_agent if a guard flag is false, then set the flag to true after the call.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant