Skip to content

feat: complete Phase 3 multi-node mobility testing#6

Merged
simonovic86 merged 2 commits intomainfrom
claude/beautiful-pascal
Mar 3, 2026
Merged

feat: complete Phase 3 multi-node mobility testing#6
simonovic86 merged 2 commits intomainfrom
claude/beautiful-pascal

Conversation

@simonovic86
Copy link
Copy Markdown
Owner

Summary

Complete Phase 3 (Autonomy) by implementing multi-node mobility testing and capability validation across migration hops.

Changes

Core Features

  • Multi-node migration testing (internal/migration/multinode_test.go)

    • Chain migration A→B→C→A with state preservation
    • Budget conservation across hops (RE-3 validated)
    • Capability preservation across multiple hops
    • Stress testing: 20 rapid round-trip migrations
    • Capability rejection handling (source safety per EI-6)
  • Host module re-registration (internal/hostcall/registry.go)

    • Close existing igor module before re-instantiating
    • Supports receiving second agent after first migration
  • Orphaned checkpoint cleanup (internal/migration/service.go)

    • Delete failed checkpoint on LoadAgent failure
    • Delete on Init failure
    • Delete on LoadCheckpointFromStorage failure
    • Prevents stale checkpoints blocking future migrations
  • Per-node capability overrides (internal/migration/service.go)

    • New SetNodeCapabilities() method
    • Enables heterogeneous capability sets across nodes
    • Used for capability rejection testing

Documentation

  • Updated README: Phase 3 complete, Phase 4 (Economics) next
  • Updated ROADMAP: Task 9 complete, noted chain migration testing
  • Updated IMPLEMENTATION_STATUS: Added all Phase 3 completion items
  • Clarified known limitation: chain migration tested but no routing protocol

Validation

All Phase 3 success criteria now met:

  • ✅ Capability membrane MVP
  • ✅ Replay engine
  • ✅ Agent SDK & developer tooling
  • ✅ Multi-node mobility testing (new)

Test Coverage

  • TestChainMigration_ABC_A: 3-node chain with state/capability preservation
  • TestChainMigration_BudgetConservation: Budget never created/destroyed
  • TestStressMigration_RapidRoundTrips: 20 back-and-forth migrations
  • TestCapabilityRejection_MigrationFails: Failed migration keeps source agent
  • TestCapabilityPreservation_AcrossHops: Manifest faithful across hops

simonovic86 and others added 2 commits March 3, 2026 09:26
Add 5 integration tests verifying capability-aware agent migration
across 3-node chains (A→B→C→A), budget conservation (RE-3),
capability rejection (CE-5), manifest preservation, and rapid
round-trip stress testing (20 hops).

Fix host module re-registration: close existing igor module before
re-instantiating so agents can migrate back to previously visited
nodes. Add per-node capability overrides for heterogeneous testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…3 completion

Fix orphaned checkpoint bug: when agent load, init, or resume fails
during incoming migration, the already-saved checkpoint is now deleted
to prevent stale state from blocking future migrations.

Update README, ROADMAP, and IMPLEMENTATION_STATUS to reflect Phase 3
complete with Task 9 validated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonovic86 simonovic86 merged commit f0326bb into main Mar 3, 2026
1 check passed
@simonovic86 simonovic86 deleted the claude/beautiful-pascal branch March 3, 2026 08:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9197f3a3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

)
if err != nil {
s.logger.Error("Failed to load agent", "error", err)
s.deleteOrphanedCheckpoint(ctx, pkg.AgentID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deleting existing checkpoint on migration load failure

This cleanup call removes pkg.AgentID's checkpoint unconditionally after LoadAgentFromBytes fails, but incoming migrations can be duplicated/retried while the target already has an active instance for the same agent ID. In that case the handler has already written the incoming checkpoint, and this branch now deletes the checkpoint file entirely, leaving the still-running target agent without durable state until a later tick/save succeeds. A transient retry path can therefore cause checkpoint loss and break restart/migration safety.

Useful? React with 👍 / 👎.

Comment on lines +531 to +532
func (s *Service) SetNodeCapabilities(caps []string) {
s.nodeCapabilities = caps
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Synchronize node capability overrides with stream handlers

SetNodeCapabilities writes s.nodeCapabilities without synchronization, while handleIncomingMigration reads the same field from libp2p stream-handler goroutines. If capability overrides are changed while migrations are in flight, this introduces a data race on the slice header and can yield nondeterministic capability checks (or race-detector failures). Guard this field with locking or an atomic/copy-on-write approach.

Useful? React with 👍 / 👎.

simonovic86 added a commit that referenced this pull request Mar 4, 2026
Implements the remaining optimization items from IMPROVEMENTS.md:

- #9 Arena-backed event log allocation to reduce GC pressure
- #3 Observation-weighted snapshot retention (replaces FIFO eviction)
- #5 Configurable replay divergence escalation (log/pause/intensify/migrate)
- #4 Multi-tick chain replay verification (N ticks in single wazero instance)
- #7 SDK checkpoint serialization helpers (Encoder/Decoder with chainable API)
- #6 Adaptive tick rate with agent hint (Tick() returns bool, 10ms/1s intervals)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonovic86 added a commit that referenced this pull request Mar 4, 2026
Implements the remaining optimization items from IMPROVEMENTS.md:

- #9 Arena-backed event log allocation to reduce GC pressure
- #3 Observation-weighted snapshot retention (replaces FIFO eviction)
- #5 Configurable replay divergence escalation (log/pause/intensify/migrate)
- #4 Multi-tick chain replay verification (N ticks in single wazero instance)
- #7 SDK checkpoint serialization helpers (Encoder/Decoder with chainable API)
- #6 Adaptive tick rate with agent hint (Tick() returns bool, 10ms/1s intervals)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant