Skip to content

fix(runtime): harden Phase 3 — validate bounds, fix leaks, add tests#7

Merged
simonovic86 merged 1 commit intomainfrom
claude/nice-sanderson
Mar 3, 2026
Merged

fix(runtime): harden Phase 3 — validate bounds, fix leaks, add tests#7
simonovic86 merged 1 commit intomainfrom
claude/nice-sanderson

Conversation

@simonovic86
Copy link
Copy Markdown
Owner

Changes

Input Validation & Bounds Checking

  • Config validation: Added Validate() method to enforce positive PricePerSecond, non-negative ReplayWindowSize/VerifyInterval, and valid ReplayMode values
  • Checkpoint parsing: Validate budget and price are non-negative when deserializing checkpoints
  • Hostcall bounds: Cap log_emit buffer to 1 MB and rand_bytes to 64 MB; return error code -2 for oversized requests
  • Memory safety: Guard against int64 overflow when calculating execution costs

Memory Leak Fixes

  • Slice eviction: Replace slice reslicing with explicit copy when evicting old entries in EventLog.SealTick() and Instance.Tick() to release references and free underlying arrays
  • Stale temp files: Add cleanStaleTempFiles() to FSProvider.NewFSProvider() to remove leftover .tmp checkpoint files from interrupted writes

Migration & Error Handling

  • Delete checkpoint atomicity: Fail migration if local checkpoint deletion fails (prevents EI-1 violations on restart)
  • Lock scope: Simplify MigrateAgent() lock handling and ensure Close is called under lock
  • Null pointer check: Validate malloc success in Instance.Resume()

Testing

  • Added comprehensive tests for config validation (price/interval bounds, replay mode)
  • Added checkpoint header validation tests (negative budget/price detection)
  • Added event log memory leak test verifying capacity is properly pruned
  • Added storage provider stale temp file cleanup test
  • Added protocol message serialization round-trip tests for AgentPackage, AgentTransfer, and AgentStarted

Code Quality

  • Modernized logging calls from logging.Info/Error to direct logger methods
  • Improved error messages and logging clarity

Hardening pass before Phase 4 (Economics):

- Validate hostcall bounds: cap rand_bytes at 64MB and log_emit at 1MB
  to prevent OOM panics (EI-6), check malloc null pointer return
- Reject negative budget/price in checkpoint parsing (RE-3), guard
  int64 overflow in cost calculation
- Fix migration: return error on stale checkpoint deletion failure
  (EI-1), hold lock during instance Close to prevent use-after-close
- Plug memory leaks: copy-to-new-slice on eventlog and replay window
  eviction instead of slice truncation
- Clean stale .tmp files on FSProvider startup (RE-1 crash recovery)
- Add config validation (PricePerSecond, ReplayWindowSize, ReplayMode)
- Remove unused protocol types, consolidate logging in cmd/igord
- Add 13 new tests covering all hardening fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonovic86 simonovic86 merged commit 14e28a5 into main Mar 3, 2026
1 check passed
@simonovic86 simonovic86 deleted the claude/nice-sanderson branch March 3, 2026 08:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81aa9bbd91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/hostcall/rand.go
Comment on lines +24 to +25
if length == 0 {
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Record zero-length rand calls for deterministic replay

Return early on length == 0 skips eventLog.Record, but replay always consumes one RandBytes entry per rand_bytes invocation (internal/replay/engine.go registerReplayRand calls iter.next unconditionally). In replay modes (on-migrate/full), any tick that calls rand_bytes(ptr, 0) will shift the replay iterator and produce a false divergence/rejection even though execution was valid on the source node.

Useful? React with 👍 / 👎.

Comment thread internal/hostcall/log.go
Comment on lines +21 to +22
if length == 0 || length > maxLogBytes {
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve log_emit entries when dropping oversized or empty logs

This early return drops log_emit calls from the event log, but replay consumes one LogEmit entry for every hostcall invocation (internal/replay/engine.go registerReplayLog uses iter.next on each call). In replay-enabled flows, agents that emit empty logs or logs above the cap will produce missing entries and trigger deterministic replay failures during verification/migration.

Useful? React with 👍 / 👎.

simonovic86 added a commit that referenced this pull request Mar 4, 2026
Implements the remaining optimization items from IMPROVEMENTS.md:

- #9 Arena-backed event log allocation to reduce GC pressure
- #3 Observation-weighted snapshot retention (replaces FIFO eviction)
- #5 Configurable replay divergence escalation (log/pause/intensify/migrate)
- #4 Multi-tick chain replay verification (N ticks in single wazero instance)
- #7 SDK checkpoint serialization helpers (Encoder/Decoder with chainable API)
- #6 Adaptive tick rate with agent hint (Tick() returns bool, 10ms/1s intervals)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonovic86 added a commit that referenced this pull request Mar 4, 2026
Implements the remaining optimization items from IMPROVEMENTS.md:

- #9 Arena-backed event log allocation to reduce GC pressure
- #3 Observation-weighted snapshot retention (replaces FIFO eviction)
- #5 Configurable replay divergence escalation (log/pause/intensify/migrate)
- #4 Multi-tick chain replay verification (N ticks in single wazero instance)
- #7 SDK checkpoint serialization helpers (Encoder/Decoder with chainable API)
- #6 Adaptive tick rate with agent hint (Tick() returns bool, 10ms/1s intervals)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant