Skip to content

feat: add --per-depth-timeout option for kontrol prove#1141

Draft
Stevengre wants to merge 1 commit into
masterfrom
progressive-depth-timeout
Draft

feat: add --per-depth-timeout option for kontrol prove#1141
Stevengre wants to merge 1 commit into
masterfrom
progressive-depth-timeout

Conversation

@Stevengre
Copy link
Copy Markdown
Contributor

@Stevengre Stevengre commented May 12, 2026

Summary

Adds --per-depth-timeout SECONDS to kontrol prove. When set (>0), each prove attempt runs in a forked subprocess. The parent treats current_depth * per_depth_timeout as a stall window rather than a hard total budget: every window the parent polls the proof's on-disk subdir for file mtime changes; any change means the prove committed a step and the window resets for another round. Only when a full window passes with no progress does the parent SIGKILL the entire subprocess session (Python + KoreServer subprocess + parallel-frontier worker threads), halve current_depth (floor 1), and start the next attempt — which resumes from the disk-persisted KCFG state. Default 0 disables the wrapper.

Example: --max-depth 1000 --per-depth-timeout 10 starts with a 10000s stall window at depth=1000. As long as KCFG keeps growing within each 10000s window, the prove runs as long as it likes. If no node is committed for 10000s, kill → halve to depth=500 with a 5000s window → … → depth=1 with a 10s window.

Why stall window (vs. hard wall-clock cap)

A hard total budget kills proofs that are productively making progress just because the wall-clock crossed a threshold. The intent of progressive halving is to detect when execute_depth is too coarse to break out of a stuck step — exactly the case where no new nodes appear. Stall-window semantics matches the symptom directly: if the prove keeps writing to disk, leave it alone; if it stops writing, the next step is stuck and shrinking execute_depth is the right move.

Why subprocess + SIGKILL (vs. callback inside run_prover)

advance_proof's maintenance callback only fires after step_proof returns. If a single step takes minutes (deep symbolic execution, expensive SMT), the callback can't fire and a callback-based timeout can't interrupt. Subprocess + session-group SIGKILL gives precise wall-clock cutoff regardless of what the prover is doing internally — the kernel atomically reaps every relevant process and thread.

Implementation

  • cli.py: new --per-depth-timeout flag.
  • options.py: new ProveOptions.per_depth_timeout (default 0).
  • prove.py:
    • When per_depth_timeout > 0, init_and_run_proof delegates to _init_and_run_proof_progressive, which loops over halving depths.
    • _run_attempt_under_timeout(test, attempt_max_depth, budget_s) runs one attempt:
      • mp.get_context('fork').Process so the child inherits everything via CoW (no pickling, no spawn cost).
      • Child calls os.setsid() to become a session leader, mutates its local fork-copy of options.max_depth and options.per_depth_timeout = 0 (preventing recursion), then re-enters init_and_run_proof. The new KoreServer it starts is in the same session.
      • Parent polls every budget_s via proc.join(timeout=budget_s). If child exited → break out and read the result via Pipe. If still alive → walk foundry.proofs_dir / test.id and compare max file mtime against the previous snapshot.
      • Marker changed → grant another budget_s window. Marker unchanged → os.killpg(os.getpgid(proc.pid), SIGKILL) reaps the whole session, return _ATTEMPT_TIMEOUT.

Test plan

  • kontrol prove --help lists --per-depth-timeout.
  • Default-off: existing test suite passes unchanged (no subprocess overhead, run_prover(...) is the unaltered code path).
  • Stall detection: a contrived proof that hangs in a single long step_proof is killed within budget_s + poll_overhead; logs show depth=N attempt exhausted Ms budget; halving.
  • No false positives: a productive proof making steady progress (any committed step inside each window) runs to completion without being killed.
  • Hard kill cleanup: while a --per-depth-timeout 10 --max-depth 100 proof runs, pgrep kore-rpc-booster | wc -l returns to baseline within a second of the kill (kore-rpc subprocess reaped via session-group SIGKILL, not orphaned).
  • Resumption: after a forced halving, the next attempt's KCFG starts from where the previous one left off (no full restart).
  • workers > 1: each test's per-attempt subprocess is independent; no cross-contamination of progress signals (the marker walks proofs_dir / test.id, scoped per test).

Caveats

  • Uses mp.get_context('fork'); requires POSIX (already a kontrol requirement for kore-rpc-booster).
  • Each halving spawns a fresh KoreServer; per-attempt startup cost adds a few seconds.
  • Atomicity: pyk's proof.write_proof_data() is assumed to write atomically (temp + rename). A SIGKILL during write could yield a partial state, but pyk's loader can re-fetch from logs.
  • Polling cadence equals budget_s. Kill latency is therefore bounded by one extra window beyond actual stall onset — not millisecond-precise but adequate for budgets >= seconds.

@Stevengre Stevengre force-pushed the progressive-depth-timeout branch from 1f0cc6a to 36d97eb Compare May 13, 2026 02:38
@Stevengre Stevengre force-pushed the progressive-depth-timeout branch from 36d97eb to a20aacb Compare May 26, 2026 03:07
@Stevengre Stevengre force-pushed the progressive-depth-timeout branch from a20aacb to bd56284 Compare May 26, 2026 09:55
@Stevengre Stevengre changed the title feat: add --per-depth-timeout option for progressive depth-halving prove feat: add --per-depth-timeout option for kontrol prove (blocked by kevm-pyk #2850) May 26, 2026
@Stevengre Stevengre force-pushed the progressive-depth-timeout branch from bd56284 to a0fce21 Compare May 26, 2026 12:16
@Stevengre Stevengre changed the title feat: add --per-depth-timeout option for kontrol prove (blocked by kevm-pyk #2850) feat: add --per-depth-timeout option for kontrol prove May 26, 2026
Adds progressive depth-halving with a per-attempt **stall window**.
When `--per-depth-timeout S` is set, each attempt is given an initial
window of `current_depth * S` seconds. The parent polls the proof's
on-disk subdir every window: if any file mtime changed (i.e. the prove
committed at least one step), the window is reset for another round.
If no progress is observed across a full window, the entire subprocess
session is reaped with `os.killpg(..., SIGKILL)` (Python + KoreServer
+ parallel-frontier worker threads), `current_depth` is halved (floor
1), and the next attempt resumes from the disk-persisted KCFG state.

Each attempt runs in a forked subprocess that calls `os.setsid()` to
become its own session leader, so a single `killpg` reaps the entire
subtree. The proof state is persisted by `advance_proof`'s maintenance
loop (maintenance_rate=1, default), so on-disk KCFG state is current
up to the last committed step at the moment of the kill.

Default 0 disables the wrapper: the existing single-attempt
`run_prover(...)` path is taken with no subprocess overhead.
@Stevengre Stevengre force-pushed the progressive-depth-timeout branch from a0fce21 to 62ed2fd Compare May 26, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant