feat: add --per-depth-timeout option for kontrol prove#1141
Draft
Stevengre wants to merge 1 commit into
Draft
Conversation
1f0cc6a to
36d97eb
Compare
36d97eb to
a20aacb
Compare
4 tasks
a20aacb to
bd56284
Compare
--per-depth-timeout option for progressive depth-halving prove--per-depth-timeout option for kontrol prove (blocked by kevm-pyk #2850)
bd56284 to
a0fce21
Compare
--per-depth-timeout option for kontrol prove (blocked by kevm-pyk #2850)--per-depth-timeout option for kontrol prove
Adds progressive depth-halving with a per-attempt **stall window**. When `--per-depth-timeout S` is set, each attempt is given an initial window of `current_depth * S` seconds. The parent polls the proof's on-disk subdir every window: if any file mtime changed (i.e. the prove committed at least one step), the window is reset for another round. If no progress is observed across a full window, the entire subprocess session is reaped with `os.killpg(..., SIGKILL)` (Python + KoreServer + parallel-frontier worker threads), `current_depth` is halved (floor 1), and the next attempt resumes from the disk-persisted KCFG state. Each attempt runs in a forked subprocess that calls `os.setsid()` to become its own session leader, so a single `killpg` reaps the entire subtree. The proof state is persisted by `advance_proof`'s maintenance loop (maintenance_rate=1, default), so on-disk KCFG state is current up to the last committed step at the moment of the kill. Default 0 disables the wrapper: the existing single-attempt `run_prover(...)` path is taken with no subprocess overhead.
a0fce21 to
62ed2fd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
--per-depth-timeout SECONDStokontrol prove. When set (>0), each prove attempt runs in a forked subprocess. The parent treatscurrent_depth * per_depth_timeoutas a stall window rather than a hard total budget: every window the parent polls the proof's on-disk subdir for file mtime changes; any change means the prove committed a step and the window resets for another round. Only when a full window passes with no progress does the parent SIGKILL the entire subprocess session (Python + KoreServer subprocess + parallel-frontier worker threads), halvecurrent_depth(floor 1), and start the next attempt — which resumes from the disk-persisted KCFG state. Default0disables the wrapper.Example:
--max-depth 1000 --per-depth-timeout 10starts with a 10000s stall window at depth=1000. As long as KCFG keeps growing within each 10000s window, the prove runs as long as it likes. If no node is committed for 10000s, kill → halve to depth=500 with a 5000s window → … → depth=1 with a 10s window.Why stall window (vs. hard wall-clock cap)
A hard total budget kills proofs that are productively making progress just because the wall-clock crossed a threshold. The intent of progressive halving is to detect when
execute_depthis too coarse to break out of a stuck step — exactly the case where no new nodes appear. Stall-window semantics matches the symptom directly: if the prove keeps writing to disk, leave it alone; if it stops writing, the next step is stuck and shrinkingexecute_depthis the right move.Why subprocess + SIGKILL (vs. callback inside
run_prover)advance_proof's maintenance callback only fires afterstep_proofreturns. If a single step takes minutes (deep symbolic execution, expensive SMT), the callback can't fire and a callback-based timeout can't interrupt. Subprocess + session-group SIGKILL gives precise wall-clock cutoff regardless of what the prover is doing internally — the kernel atomically reaps every relevant process and thread.Implementation
cli.py: new--per-depth-timeoutflag.options.py: newProveOptions.per_depth_timeout(default0).prove.py:per_depth_timeout > 0,init_and_run_proofdelegates to_init_and_run_proof_progressive, which loops over halving depths._run_attempt_under_timeout(test, attempt_max_depth, budget_s)runs one attempt:mp.get_context('fork').Processso the child inherits everything via CoW (no pickling, no spawn cost).os.setsid()to become a session leader, mutates its local fork-copy ofoptions.max_depthandoptions.per_depth_timeout = 0(preventing recursion), then re-entersinit_and_run_proof. The new KoreServer it starts is in the same session.budget_sviaproc.join(timeout=budget_s). If child exited → break out and read the result viaPipe. If still alive → walkfoundry.proofs_dir / test.idand compare max file mtime against the previous snapshot.budget_swindow. Marker unchanged →os.killpg(os.getpgid(proc.pid), SIGKILL)reaps the whole session, return_ATTEMPT_TIMEOUT.Test plan
kontrol prove --helplists--per-depth-timeout.run_prover(...)is the unaltered code path).step_proofis killed withinbudget_s + poll_overhead; logs showdepth=N attempt exhausted Ms budget; halving.--per-depth-timeout 10 --max-depth 100proof runs,pgrep kore-rpc-booster | wc -lreturns to baseline within a second of the kill (kore-rpc subprocess reaped via session-group SIGKILL, not orphaned).workers > 1: each test's per-attempt subprocess is independent; no cross-contamination of progress signals (the marker walksproofs_dir / test.id, scoped per test).Caveats
mp.get_context('fork'); requires POSIX (already a kontrol requirement for kore-rpc-booster).proof.write_proof_data()is assumed to write atomically (temp + rename). A SIGKILL during write could yield a partial state, but pyk's loader can re-fetch from logs.budget_s. Kill latency is therefore bounded by one extra window beyond actual stall onset — not millisecond-precise but adequate for budgets >= seconds.