Abort foreground fix loop on agent quota or session limit#696
Conversation
Shared classifier in a new internal/agentlimit package, used by the daemon worker (preserves cooldown + failover for daemon jobs) and the foreground roborev fix loop (strict abort with reset time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Corrects daemon retry semantics (quota errors skip retries entirely and go straight to failoverOrFail), completes the day-1 pattern table to match the existing isQuotaError set bit-for-bit, adds an unexported classifyWithRules helper plus injectable classifier function on WorkerPool / fix-loop options for test isolation, and clarifies that PR 1 ships the framework while a Claude-specific rule follows once a real session-cap message is captured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no behavior change" claim was wrong: foreground roborev fix gains strict abort for KindQuota in PR 1, which changes outcomes for Gemini and Codex sessions too. Spelled out explicitly. Also dropped the proposed exported test helper — Classification and the Kind* constants are already exported, so cross-package tests can return a literal struct from their stub classifier without extra plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eleven TDD-driven tasks splitting the work into a shared agentlimit package, daemon-worker integration that preserves existing Gemini/Codex behavior bit-for-bit, and CLI fix-loop abort plumbing exercised against an injected stub classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esetTime Replace candidate.Add(24*time.Hour) with time.Date(..., Day()+1, ...) so the rollover stays at the same wall-clock time across DST transitions (the 24h-add lands an hour off in March/November). Switch candidate.Before(now) to !candidate.After(now) so equality also rolls forward — avoids returning a zero-duration cooldown when the agent's clock matches ours to the second. Adds tests for the rollover boundary, timezone propagation, and the conservative trailing-content behavior that documents what the parser does NOT handle yet (deferred until a real Claude message is captured).
- ParseResetTime doc comment was stale after the DST fix; update to describe at-or-before rollover and DST-safe wall-clock semantics. - TestParseResetTimeAcrossDST: real IANA-zone test (America/New_York) that exercises the spring-forward and fall-back transitions where Add(24h) would land an hour off but time.Date day arithmetic does not. - Plan: formatAgentLimitMessage now branches on cls.Kind for the user- facing label so a Gemini/Codex KindQuota abort says "quota limit" instead of "session limit". - Plan: TestRunFixBatchAbortsOnQuotaError uses batchSize=1 (forces one job per agent call so agentCalls=1 unambiguously means abort) and drops the brittle "second job not reviewed" assertion that would fail because runFixBatch prefetches all reviews up-front. - Spec: replace "try again at 17:42 UTC" with "try again at 17:42" and document that timezone suffixes are out of scope until a real session-cap message is captured. Closes review jobs 18483, 18487, 18498. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test asserts cooldown + retry_count==0 with no backup configured, so the failover path is not exercised. Rename clarifies the actual behavior under test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor failOrRetryInner to classify the agent error once at the top of the agent-error path via a switch over Classification.Kind. The existing KindQuota/KindSession branch is preserved verbatim. KindNone now emits a single WARN line (with newlines flattened and truncated to 200 runes) when errorMsg is non-empty, so future unmatched session-cap or quota wordings get captured for adding a rule. KindTransient and KindNone both fall through to the unchanged isContextWindowError check and retry path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After fixJobDirect returns an error, classify it via opts.classify and abort the loop with an agentLimitError on KindQuota/KindSession. Print a warning on KindNone (mirroring the daemon-side WARN). Other kinds (KindTransient) fall through to the existing return path. Removes the //nolint:unused directives from agentLimitError, its Error() method, formatAgentLimitMessage, and agentLimitLabel now that they are wired into the abort path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- 18506 (worker_test.go): tighten unclassified-agent log assertion to "from test:" so it cannot pass on the workerID's "test-worker" substring. - 18513 (fix.go runFixWithSeen): detect agentLimitError via errors.As before discovery-mode warn-and-continue, otherwise quota/session aborts get demoted to warnings and the re-query loop keeps invoking the exhausted agent until every queued job is burned through. - Regression test TestRunFixWithSeenDiscoveryAbortsOnAgentLimit covers the discovery path: KindQuota classification produces an agentLimitError that propagates out of seen != nil mode without marking the second job seen. Closes review jobs 18506 and 18513.
The retry/failover section still pointed at the removed isQuotaError helper. Update it to describe the agentlimit.Classify-based path with the new KindQuota/KindSession kinds and ParseResetDuration/ParseResetTime helpers that own the cooldown extraction.
roborev: Combined Review (
|
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
roborev: Combined Review (
|
The classifier only had two consumers (the daemon worker and the foreground fix loop), so a separate package was overkill. Move the files into internal/agent/ and rename the exported names with a Limit prefix (LimitKind, LimitClassification, LimitClassifier, ClassifyLimit, LimitKindNone/Transient/Quota/Session). ParseResetDuration and ParseResetTime keep their names; they were already specific. Drop the internal/agentlimit/ directory. Address review finding 18515 (medium): fixJobDirect's commit-retry swallowed any retry error as a "Warning: commit agent failed:" line, so a quota or session-limit hit on that retry slipped past the classifier and the fix loop kept invoking the exhausted agent. Plumb agent.ClassifyLimit through fixJobParams.Classify (defaulting when nil so existing direct callers do not break) and abort with *agentLimitError on LimitKindQuota / LimitKindSession; non-limit retry errors keep today's warn-and-continue. fixSingleJob and runFixBatch now short-circuit on a returned *agentLimitError before re-classifying err.Error(), since the formatted user-facing message is no longer the original agent error. Regression test TestFixJobDirect_RetryClassifiesAgentLimit drives the retry path with a FakeAgent that succeeds-with-uncommitted- changes on the first call and returns a synthetic session-cap on the second; asserts *agentLimitError propagates and the formatted message identifies the agent and limit kind.
roborev: Combined Review (
|
Reviewers keep raising "function X not defined" or "package Y not imported" findings against diffs that build cleanly — the symbol just lives elsewhere in the file or package, which the reviewer cannot see. Add a guideline that explicitly excludes compile-level findings and points at the local toolchain (go build, go vet, golangci-lint, the pre-commit hook) as the source of truth.
roborev: Combined Review (
|
defaultLimitRules contains only the nine quota substrings inherited from isQuotaError; nothing in production produces LimitKindSession today. Claude's 5-hour cap will fall through to the unmatched-error path until a real session-cap message is captured and a narrow rule is added. Update the LimitKindSession const comment, the defaultLimitRules comment (with a TODO and a note on why speculative substrings are not added), and the CLAUDE.md cooldown reference so the daemon-side documentation reflects what the production classifier actually emits today vs. what the plumbing supports once a rule lands. Addresses the medium review finding about session-limit handling being unreachable in production.
roborev: Combined Review (
|
Two findings on commit 657052b: 1. fixJobDirect's retry-path limit abort lost the "Changes were made but not committed" warning that the success path emits. If the first agent call left changes and the commit retry hit a quota or session limit, the user got only a cooldown message and a dirty tree. Print the same warning to out before returning the limit error so the user sees both states. 2. ParseResetDuration and parseResetTimeAt computed indices against strings.ToLower(errMsg) and then sliced errMsg with those indices. A rune like "İ" U+0130 shortens to "i" under ToLower (2 bytes → 1 byte), so the indices can land in the middle of a UTF-8 sequence or extract the wrong substring. Slice the lowercased copy instead. time.ParseDuration and time.Parse work fine with lowercase unit suffixes, so the lowercase format strings are sufficient (drop the redundant uppercase "3:04 PM" / "3:04PM" entries). Add regression tests: - TestParseResetDurationToLowerByteSafety with "İ reset after 30m" - TestParseResetTimeToLowerByteSafety with "İ resets at 5:42 PM" - TestFixJobDirect_RetryClassifiesAgentLimit captures fixJobDirect's output via bytes.Buffer and asserts the dirty-tree warning fires alongside the agentLimitError propagation.
roborev: Combined Review (
|
|
unrelated CI flake |
Summary
internal/agent(LimitKind,LimitClassification,LimitClassifier,ClassifyLimit, plusParseResetDuration/ParseResetTime). The default ruleset reproduces the nine quota substrings the daemon worker already used; reset-time parsing handles relative (reset after 48m) and absolute (resets at 5:42 PM) hints, and rolls forward viatime.Date(..., Day()+1, ...)so wall-clock time is preserved across DST transitions.agent.ClassifyLimitvia an injectableWorkerPool.classifyfield. Cooldown duration order is unchanged:defaultCooldown→Classification.CooldownFor→time.Until(ResetAt).LimitKindQuotacools down and skips retries;LimitKindNonenow logs a one-line WARN with a truncated error preview so unmatched agent errors are visible.agentLimitError,formatAgentLimitMessage, andagentLimitLabelincmd/roborev/fix.go. The foregroundroborev fixloop classifies eachfixJobDirecterror and aborts withagentLimitErroronLimitKindQuota/LimitKindSession;LimitKindNoneprints a stderr warning. The abort path is wired intofixSingleJob,runFixBatch, andrunFixWithSeen(so discovery mode no longer demotes quota aborts to per-job warnings).fixJobDirect's commit-retry path also classifies its own error so a quota or session limit on the secondAgent.Reviewcall returns*agentLimitErrorinstead of being swallowed as aWarning: commit agent failed:line. The classifier is plumbed throughfixJobParams.Classify(defaults toagent.ClassifyLimitwhen nil).LimitKindSessionplumbing (cooldown on the daemon side, abort on the CLI side) is in place but is not produced by any production rule yet — Claude's 5-hour cap falls through to the unmatched-error path until a real session-cap message is captured. Speculative substrings are deliberately not added todefaultLimitRulesbecause they would also match policy errors and transient 429s, and a false positive would abortroborev fixand cool down the agent when retrying might have worked. Tests exercise the session-limit path via an injected classifier.isQuotaError,parseQuotaCooldown,minCooldown, andmaxCooldownwith the classifier-backed flow..roborev.tomlreview guidelines to tell reviewers not to flag suspected compile errors, missing imports, or undeclared identifiers — the local toolchain catches those before merge, and reviewers cannot see the rest of the package.