feat: complete Phase 3 multi-node mobility testing#6
Conversation
Add 5 integration tests verifying capability-aware agent migration across 3-node chains (A→B→C→A), budget conservation (RE-3), capability rejection (CE-5), manifest preservation, and rapid round-trip stress testing (20 hops). Fix host module re-registration: close existing igor module before re-instantiating so agents can migrate back to previously visited nodes. Add per-node capability overrides for heterogeneous testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…3 completion Fix orphaned checkpoint bug: when agent load, init, or resume fails during incoming migration, the already-saved checkpoint is now deleted to prevent stale state from blocking future migrations. Update README, ROADMAP, and IMPLEMENTATION_STATUS to reflect Phase 3 complete with Task 9 validated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9197f3a3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
| if err != nil { | ||
| s.logger.Error("Failed to load agent", "error", err) | ||
| s.deleteOrphanedCheckpoint(ctx, pkg.AgentID) |
There was a problem hiding this comment.
Avoid deleting existing checkpoint on migration load failure
This cleanup call removes pkg.AgentID's checkpoint unconditionally after LoadAgentFromBytes fails, but incoming migrations can be duplicated/retried while the target already has an active instance for the same agent ID. In that case the handler has already written the incoming checkpoint, and this branch now deletes the checkpoint file entirely, leaving the still-running target agent without durable state until a later tick/save succeeds. A transient retry path can therefore cause checkpoint loss and break restart/migration safety.
Useful? React with 👍 / 👎.
| func (s *Service) SetNodeCapabilities(caps []string) { | ||
| s.nodeCapabilities = caps |
There was a problem hiding this comment.
Synchronize node capability overrides with stream handlers
SetNodeCapabilities writes s.nodeCapabilities without synchronization, while handleIncomingMigration reads the same field from libp2p stream-handler goroutines. If capability overrides are changed while migrations are in flight, this introduces a data race on the slice header and can yield nondeterministic capability checks (or race-detector failures). Guard this field with locking or an atomic/copy-on-write approach.
Useful? React with 👍 / 👎.
Implements the remaining optimization items from IMPROVEMENTS.md: - #9 Arena-backed event log allocation to reduce GC pressure - #3 Observation-weighted snapshot retention (replaces FIFO eviction) - #5 Configurable replay divergence escalation (log/pause/intensify/migrate) - #4 Multi-tick chain replay verification (N ticks in single wazero instance) - #7 SDK checkpoint serialization helpers (Encoder/Decoder with chainable API) - #6 Adaptive tick rate with agent hint (Tick() returns bool, 10ms/1s intervals) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements the remaining optimization items from IMPROVEMENTS.md: - #9 Arena-backed event log allocation to reduce GC pressure - #3 Observation-weighted snapshot retention (replaces FIFO eviction) - #5 Configurable replay divergence escalation (log/pause/intensify/migrate) - #4 Multi-tick chain replay verification (N ticks in single wazero instance) - #7 SDK checkpoint serialization helpers (Encoder/Decoder with chainable API) - #6 Adaptive tick rate with agent hint (Tick() returns bool, 10ms/1s intervals) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Complete Phase 3 (Autonomy) by implementing multi-node mobility testing and capability validation across migration hops.
Changes
Core Features
Multi-node migration testing (
internal/migration/multinode_test.go)Host module re-registration (
internal/hostcall/registry.go)Orphaned checkpoint cleanup (
internal/migration/service.go)Per-node capability overrides (
internal/migration/service.go)SetNodeCapabilities()methodDocumentation
Validation
All Phase 3 success criteria now met:
Test Coverage
TestChainMigration_ABC_A: 3-node chain with state/capability preservationTestChainMigration_BudgetConservation: Budget never created/destroyedTestStressMigration_RapidRoundTrips: 20 back-and-forth migrationsTestCapabilityRejection_MigrationFails: Failed migration keeps source agentTestCapabilityPreservation_AcrossHops: Manifest faithful across hops