feat: RunLivenessWatchdog shadow rule flags implausibly-long Runs#273
Merged
Conversation
Step 1 of the trust-first observe-stream reframe: a SHADOW, observe-only rule inside the existing RunSupervisor loop that logs (run_liveness.would_flag) a Run that has been Running past an operator ceiling, the de-facto-dead scan a human must currently catch by hand. It records NO Decision and issues NO command. Off by default (run_liveness_ceiling_seconds None = off, plus run_supervisor_enabled). It is a RULE, not a new agent (runs under the existing RunSupervisor identity), and keys on a NEW running_since proj_run_summary column, set on RunStarted and RESET on RunResumed, not created_at (which over-counts overnight Held intervals and would false-alarm). The edge-trigger memory (liveness) is walled off from the beam-Hold FSM (memory); NULL running_since never flags. - additive nullable running_since migration (no backfill: NULL never flags) - running_since surfaced on list_runs (RunSummaryItem / SELECT) - is_run_stale pure rule (inclusive >= ceiling) + the shadow pass before the beam read (independent of beam I/O) - run_liveness_ceiling_seconds setting + >0-when-set validator Advise mode (a Decision(choice=SupervisionQuieted)) and the per-channel observe-stream feeder are the gated next rungs. Gate review (4 agents: 3 baseline + migration-safety): 3 APPROVE, 1 APPROVE-WITH-NITS, 0 P0/P1. R2's 3 coverage nits (multi-run filtering, liveness prune, discard/re-flag arm) folded here. AST command-ban deferred to advise mode (shadow is log-only; observe-only proven behaviorally). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
xmap
added a commit
that referenced
this pull request
Jun 21, 2026
…leet (#290) The recent agent PRs (#233 gated resume, #266 ClearanceWatcher, #273 run-liveness, #288 observation-signal rules) shipped code-only; the hand-authored module docs had drifted. Shape-level catch-up, no internals: - agent/index.md: five -> six seeded agents (add ClearanceWatcher, a passive flag-only periodic-loop agent recording a ClearanceProgress Decision); note RunSupervisor also carries shadow observe-only rules. - run/index.md: RunSupervisor now does gated autonomous resume (not wind-down only) and carries shadow run-liveness / signal-quality / signal-stall rules that log-only; add is_simulated to the Observation shape + the entries_run_observations DDL with a one-line rationale. mkdocs build --strict passes. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
xmap
added a commit
that referenced
this pull request
Jun 22, 2026
…l rules (#294) * feat(decision): add the three RunSupervision advise-rung choices Slice A of the observation-signal advise rung. Adds SupervisionQuieted (run-age liveness backstop), SupervisionStalled (Rule R rate-dropout), and SupervisionBreached (Rule Q quality-below-limit) to the RunSupervisionChoice Literal + RUN_SUPERVISION_CHOICES frozenset (7 -> 10), with the vocab test updated to the 10-value set + a work-noun guard on the new dispositions. WHY: promoting the shipped shadow observation-signal + run-liveness rules one rung (observe -> advise) means the supervisor records one Decision per breach edge for a human; that Decision's choice must exist in the closed set first. Decision-only dispositions (never a command). SupervisionBreached is the naming-r3 rename of the originally-proposed SupervisionDoubted: "Doubted" read as the supervisor's epistemic state; "Breached" names the objective limit-crossing, family-uniform with Deferred / Conflicted / Stalled. This slice adds vocabulary only; the supervisor emission lands next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(api): promote the RunSupervisor shadow rules to the advise rung Slice B of the observation-signal advise rung. Adds run_supervisor_advise_enabled (default off, a further opt-in above each rule's own enable) and, when on, emits exactly one Decision per breach EDGE from the three shadow rules -- still issuing NO command (advise rung): - run-liveness backstop -> SupervisionQuieted - Rule R rate-dropout -> SupervisionStalled - Rule Q quality breach -> SupervisionBreached WHY: the shadow rules (#288 / #273) log would_flag but leave no durable record a human can triage. The advise rung climbs exactly one step (observe -> advise), recording one RunSupervision Decision per breach episode for a human while keeping the act rung (auto-Hold) deferred. Emission is edge-triggered off the already-walled per-rule memory (one Decision per episode; nothing on a standing breach across ticks), beam-free (the liveness rule runs before the beam read), and reuses the existing DecisionRegistered shape under the RunSupervisor identity + Authorize path. Shadow logging is unchanged; advise only adds the Decision. cannot-tell still defers (no Decision). Tests cover advise-off (no Decision), each disposition under advise-on (one Decision, no command), and edge-triggering (one Decision across two ticks of a standing breach). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(api): cover advise-rung edge-trigger + cannot-tell gates Gate-review follow-ups (the advise diff drew 2 ship + 1 changes_needed, the last purely a test-coverage gap; the correctness/trust lens passed clean). Adds three tests: - advise liveness is edge-triggered: two ticks of a standing stale Run record only ONE SupervisionQuieted Decision (parity with the quality + stall edge-trigger tests). - advise records no Decision when the quality channel has no observation (cannot-tell -> defer; pins that the value-None path never emits, which a reviewer worried about -- the decider returns would_flag=False on None). - advise records no Decision when the rule is disabled (snr_limit None): advise respects each rule's own enable, not just the global advise flag. Test-only; no production change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(api): cover the advise-emitter ConcurrencyError no-op branch The diff-coverage gate (hard 90% on changed lines) flagged _run_supervisor.py at 88.9%: the new _record_supervision_advice except ConcurrencyError branch (lines 490-491) was uncovered. Adds an idempotency test that re-derives the same advise Decision id (via a FixedIdGenerator repeating the id) so the second append collides and is swallowed -- mirrors the existing test_record_decision_is_idempotent_on_repeated_id for the beam-Hold path. Test-only; covers the cross-restart re-emission no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Step 1 of the trust-first observe-stream reframe: a shadow, observe-only rule inside the existing RunSupervisor loop that flags a Run which has been
Runningpast an operator ceiling, i.e. the de-facto-dead scan a human must currently catch by hand. In shadow mode it only logsrun_liveness.would_flag-- it records no Decision and issues no command. Off by default.It is a rule, not a new agent (it runs under the existing RunSupervisor identity), and the per-channel observe-stream feeder is deliberately not built here (the 102-agent stress test's headline: decouple the source-agnostic rule from the structurally-expensive feeder).
Why
The one beamtime-stewardship failure CORA still cannot detect autonomously is a silently-hung run burning allocation while
statusstaysRunning. This rung delivers that detection now, on a signal CORA owns and cannot have spoofed, while building the live-run-watch loop without any feeder.The signal:
running_since(notcreated_at)Keys on a new
running_sincecolumn onproj_run_summary, set onRunStartedand RESET onRunResumed, so it measures actual Running-duration.created_atover-counts overnightHeldintervals and would false-alarm (the failure the design warns against);updated_atis reset by every transition and isn't on the list read surface. (This corrects the design memo's earlier "no migration / use created_at" framing.)Changes
running_sincemigration (no backfill -- NULL never flags, the safe default)running_sincesurfaced onlist_runs(RunSummaryItem/ SELECT)is_run_stalepure rule (inclusive>=ceiling) + the shadow pass, run before the beam read so it's independent of beam I/Orun_liveness_ceiling_secondssetting (defaultNone= off) +>0-when-set validatorliveness) walled off from the beam-Hold FSM (memory)Posture
Off + inert by default (two gates: the
Noneceiling ANDrun_supervisor_enabled). Observe-only: a flagged Run leaves hold/resume calls empty (proven behaviorally). Honors every stress-test fix: no feeder, no ControlPort hoist, no seeded-Agent grant, own walled memory, own (future) choice -- not the overloadedSupervisionDeferred.Test plan
ruff, pyright, tach, full architecture fitness suite, and the run/api/decision unit suites green (verified locally + by the pre-push hook). New tests:
is_run_stale(old/recent/inclusive-boundary/None), shadow tick (observe-only, off-when-None, recent-not-flagged = the resumed-overnight regression, NULL-never-flags, edge-trigger stable, multi-run only-stale-flagged, liveness-prune-on-leave, discard-then-re-flag), projection (RunStarted writesrunning_since; RunResumed resets it), config validator.Gate review
4 agents (3 baseline + migration-safety): 3 APPROVE, 1 APPROVE-WITH-NITS, 0 P0/P1. R2's three coverage nits (multi-run filtering, liveness prune, discard/re-flag arm) were folded before commit. R1's AST command-ban is deferred to advise mode (shadow is log-only; observe-only is proven behaviorally).
Next rungs (deferred)
Advise mode -- a
Decision(context=RunSupervision, choice=SupervisionQuieted)per quiet episode, promoted once the ceiling is calibrated from shadow logs. Then, only when staff confirm a real machine-readable signal (SNR-1/PROG-1), the per-channel observe-stream feeder. Seeproject_run_liveness_watchdog_design.md.🤖 Generated with Claude Code