Abort foreground fix loop on agent quota or session limit by wesm · Pull Request #696 · roborev-dev/roborev

wesm · 2026-05-05T21:47:37Z

Summary

Add a rate-limit classifier inside internal/agent (LimitKind, LimitClassification, LimitClassifier, ClassifyLimit, plus ParseResetDuration / ParseResetTime). The default ruleset reproduces the nine quota substrings the daemon worker already used; reset-time parsing handles relative (reset after 48m) and absolute (resets at 5:42 PM) hints, and rolls forward via time.Date(..., Day()+1, ...) so wall-clock time is preserved across DST transitions.
Route the daemon worker's quota detection through agent.ClassifyLimit via an injectable WorkerPool.classify field. Cooldown duration order is unchanged: defaultCooldown → Classification.CooldownFor → time.Until(ResetAt). LimitKindQuota cools down and skips retries; LimitKindNone now logs a one-line WARN with a truncated error preview so unmatched agent errors are visible.
Add agentLimitError, formatAgentLimitMessage, and agentLimitLabel in cmd/roborev/fix.go. The foreground roborev fix loop classifies each fixJobDirect error and aborts with agentLimitError on LimitKindQuota / LimitKindSession; LimitKindNone prints a stderr warning. The abort path is wired into fixSingleJob, runFixBatch, and runFixWithSeen (so discovery mode no longer demotes quota aborts to per-job warnings).
fixJobDirect's commit-retry path also classifies its own error so a quota or session limit on the second Agent.Review call returns *agentLimitError instead of being swallowed as a Warning: commit agent failed: line. The classifier is plumbed through fixJobParams.Classify (defaults to agent.ClassifyLimit when nil).
LimitKindSession plumbing (cooldown on the daemon side, abort on the CLI side) is in place but is not produced by any production rule yet — Claude's 5-hour cap falls through to the unmatched-error path until a real session-cap message is captured. Speculative substrings are deliberately not added to defaultLimitRules because they would also match policy errors and transient 429s, and a false positive would abort roborev fix and cool down the agent when retrying might have worked. Tests exercise the session-limit path via an injected classifier.
Replace the daemon's isQuotaError, parseQuotaCooldown, minCooldown, and maxCooldown with the classifier-backed flow.
Tighten .roborev.toml review guidelines to tell reviewers not to flag suspected compile errors, missing imports, or undeclared identifiers — the local toolchain catches those before merge, and reviewers cannot see the rest of the package.

Shared classifier in a new internal/agentlimit package, used by the daemon worker (preserves cooldown + failover for daemon jobs) and the foreground roborev fix loop (strict abort with reset time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Corrects daemon retry semantics (quota errors skip retries entirely and go straight to failoverOrFail), completes the day-1 pattern table to match the existing isQuotaError set bit-for-bit, adds an unexported classifyWithRules helper plus injectable classifier function on WorkerPool / fix-loop options for test isolation, and clarifies that PR 1 ships the framework while a Claude-specific rule follows once a real session-cap message is captured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "no behavior change" claim was wrong: foreground roborev fix gains strict abort for KindQuota in PR 1, which changes outcomes for Gemini and Codex sessions too. Spelled out explicitly. Also dropped the proposed exported test helper — Classification and the Kind* constants are already exported, so cross-package tests can return a literal struct from their stub classifier without extra plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eleven TDD-driven tasks splitting the work into a shared agentlimit package, daemon-worker integration that preserves existing Gemini/Codex behavior bit-for-bit, and CLI fix-loop abort plumbing exercised against an injected stub classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ooldown

…esetTime Replace candidate.Add(24*time.Hour) with time.Date(..., Day()+1, ...) so the rollover stays at the same wall-clock time across DST transitions (the 24h-add lands an hour off in March/November). Switch candidate.Before(now) to !candidate.After(now) so equality also rolls forward — avoids returning a zero-duration cooldown when the agent's clock matches ours to the second. Adds tests for the rollover boundary, timezone propagation, and the conservative trailing-content behavior that documents what the parser does NOT handle yet (deferred until a real Claude message is captured).

- ParseResetTime doc comment was stale after the DST fix; update to describe at-or-before rollover and DST-safe wall-clock semantics. - TestParseResetTimeAcrossDST: real IANA-zone test (America/New_York) that exercises the spring-forward and fall-back transitions where Add(24h) would land an hour off but time.Date day arithmetic does not. - Plan: formatAgentLimitMessage now branches on cls.Kind for the user- facing label so a Gemini/Codex KindQuota abort says "quota limit" instead of "session limit". - Plan: TestRunFixBatchAbortsOnQuotaError uses batchSize=1 (forces one job per agent call so agentCalls=1 unambiguously means abort) and drops the brittle "second job not reviewed" assertion that would fail because runFixBatch prefetches all reviews up-front. - Spec: replace "try again at 17:42 UTC" with "try again at 17:42" and document that timezone suffixes are out of scope until a real session-cap message is captured. Closes review jobs 18483, 18487, 18498. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Test asserts cooldown + retry_count==0 with no backup configured, so the failover path is not exercised. Rename clarifies the actual behavior under test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Refactor failOrRetryInner to classify the agent error once at the top of the agent-error path via a switch over Classification.Kind. The existing KindQuota/KindSession branch is preserved verbatim. KindNone now emits a single WARN line (with newlines flattened and truncated to 200 runes) when errorMsg is non-empty, so future unmatched session-cap or quota wordings get captured for adding a rule. KindTransient and KindNone both fall through to the unchanged isContextWindowError check and retry path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After fixJobDirect returns an error, classify it via opts.classify and abort the loop with an agentLimitError on KindQuota/KindSession. Print a warning on KindNone (mirroring the daemon-side WARN). Other kinds (KindTransient) fall through to the existing return path. Removes the //nolint:unused directives from agentLimitError, its Error() method, formatAgentLimitMessage, and agentLimitLabel now that they are wired into the abort path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- 18506 (worker_test.go): tighten unclassified-agent log assertion to "from test:" so it cannot pass on the workerID's "test-worker" substring. - 18513 (fix.go runFixWithSeen): detect agentLimitError via errors.As before discovery-mode warn-and-continue, otherwise quota/session aborts get demoted to warnings and the re-query loop keeps invoking the exhausted agent until every queued job is burned through. - Regression test TestRunFixWithSeenDiscoveryAbortsOnAgentLimit covers the discovery path: KindQuota classification produces an agentLimitError that propagates out of seen != nil mode without marking the second job seen. Closes review jobs 18506 and 18513.

The retry/failover section still pointed at the removed isQuotaError helper. Update it to describe the agentlimit.Classify-based path with the new KindQuota/KindSession kinds and ParseResetDuration/ParseResetTime helpers that own the cooldown extraction.

roborev-ci · 2026-05-05T21:50:04Z

roborev: Combined Review (`a0bb883`)

Review verdict: one medium issue remains; no high or critical findings were reported.

Medium

cmd/roborev/fix.go:272 - The second params.Agent.Review call used for commit instructions still swallows agent errors as a warning, so quota/session-limit failures on that retry path bypass the new classifier and the fix loop can continue after the agent is exhausted.

Suggested fix: Propagate that retry error, or classify it at the call site and return agentLimitError for KindQuota / KindSession. Add a test covering a limit error on the commit retry path.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

codecov-commenter · 2026-05-05T21:54:50Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 85.49223% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
cmd/roborev/fix.go	80.64%	15 Missing and 3 partials ⚠️
internal/daemon/worker.go	82.60%	3 Missing and 1 partial ⚠️
internal/agent/limit.go	86.95%	2 Missing and 1 partial ⚠️
internal/agent/limit_parse.go	94.44%	2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

roborev-ci · 2026-05-05T21:57:46Z

roborev: Combined Review (`a3f244b`)

High-risk compile errors and one runtime parsing bug need fixing before merge.

High

cmd/roborev/fix.go:506
The new errors.As call uses the errors package, but errors was not added to the import block, so cmd/roborev will not compile.
Fix: Add errors to the imports in cmd/roborev/fix.go.
cmd/roborev/fix.go:1027, cmd/roborev/fix.go:1288, internal/daemon/worker.go:805
The code calls agent.CanonicalName, but no exported CanonicalName function exists in internal/agent; the existing alias resolver is unexported. This creates undefined-symbol compile failures.
Fix: Add an exported CanonicalName helper in internal/agent that wraps alias resolution, or use already canonical agent names at these call sites.
internal/agentlimit/parse.go:26, internal/agentlimit/parse.go:74
The parser finds byte indexes in the lowercased string but slices the original mixed-case string. Unicode case changes can misalign byte indexes and cause a runtime panic. It also preserves uppercase duration units such as 30M, which time.ParseDuration rejects.
Fix: Slice from the lowercased string, e.g. rest := lower[idx+len("reset after "):] and rest := lower[idx:].

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The classifier only had two consumers (the daemon worker and the foreground fix loop), so a separate package was overkill. Move the files into internal/agent/ and rename the exported names with a Limit prefix (LimitKind, LimitClassification, LimitClassifier, ClassifyLimit, LimitKindNone/Transient/Quota/Session). ParseResetDuration and ParseResetTime keep their names; they were already specific. Drop the internal/agentlimit/ directory. Address review finding 18515 (medium): fixJobDirect's commit-retry swallowed any retry error as a "Warning: commit agent failed:" line, so a quota or session-limit hit on that retry slipped past the classifier and the fix loop kept invoking the exhausted agent. Plumb agent.ClassifyLimit through fixJobParams.Classify (defaulting when nil so existing direct callers do not break) and abort with *agentLimitError on LimitKindQuota / LimitKindSession; non-limit retry errors keep today's warn-and-continue. fixSingleJob and runFixBatch now short-circuit on a returned *agentLimitError before re-classifying err.Error(), since the formatted user-facing message is no longer the original agent error. Regression test TestFixJobDirect_RetryClassifiesAgentLimit drives the retry path with a FakeAgent that succeeds-with-uncommitted- changes on the first call and returns a synthetic session-cap on the second; asserts *agentLimitError propagates and the formatted message identifies the agent and limit kind.

roborev-ci · 2026-05-05T22:07:13Z

roborev: Combined Review (`3fecb4b`)

Build-breaking issues remain in the PR.

High

cmd/roborev/fix.go:409, internal/daemon/worker.go:803
Calls were added to agent.CanonicalName, but internal/agent does not export a CanonicalName function. This will fail to compile. Add the exported helper or use the existing alias-resolution path.
cmd/roborev/fix.go:520
errors.As is used without importing the errors package, which also breaks compilation. Add errors to the imports in cmd/roborev/fix.go.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Reviewers keep raising "function X not defined" or "package Y not imported" findings against diffs that build cleanly — the symbol just lives elsewhere in the file or package, which the reviewer cannot see. Add a guideline that explicitly excludes compile-level findings and points at the local toolchain (go build, go vet, golangci-lint, the pre-commit hook) as the source of truth.

roborev-ci · 2026-05-05T22:12:32Z

roborev: Combined Review (`a7ac2c1`)

Medium: session-limit handling is unreachable in production.

Medium

internal/agent/limit.go:38 - Production ClassifyLimit never returns LimitKindSession because defaultLimitRules only includes quota patterns and omits Claude session-cap patterns. As a result, the new daemon/fix handling for session limits is only exercised by tests that inject a classifier; real session-cap failures will still follow the normal retry/continue path.
- Fix: Add production session-cap rule(s) for the observed Claude error wording with a ClassifyLimit test fixture, or remove the advertised/session-specific handling until a real pattern is supported.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

defaultLimitRules contains only the nine quota substrings inherited from isQuotaError; nothing in production produces LimitKindSession today. Claude's 5-hour cap will fall through to the unmatched-error path until a real session-cap message is captured and a narrow rule is added. Update the LimitKindSession const comment, the defaultLimitRules comment (with a TODO and a note on why speculative substrings are not added), and the CLAUDE.md cooldown reference so the daemon-side documentation reflects what the production classifier actually emits today vs. what the plumbing supports once a rule lands. Addresses the medium review finding about session-limit handling being unreachable in production.

roborev-ci · 2026-05-05T23:49:12Z

roborev: Combined Review (`657052b`)

Verdict: Two medium-severity correctness issues need attention before merge.

Medium

cmd/roborev/fix.go:410
- Problem: If the first fix pass leaves uncommitted changes and the commit-retry pass hits a quota/session limit, fixJobDirect returns agentLimitError immediately. That bypasses the existing caller path that warns users when changes were made but not committed, leaving a dirty worktree with only a cooldown message.
- Fix: Before returning the limit error from the retry path, preserve or emit the uncommitted-changes warning, or include that state in the returned error/result so callers can report it.
internal/agent/limit_parse.go:26, internal/agent/limit_parse.go:72
- Problem: The code slices errMsg using byte indices derived from lower := strings.ToLower(errMsg). ToLower can change byte length for some Unicode characters, so those indices may not safely map back to the original string and could panic or produce invalid strings.
- Fix: Slice lower instead of errMsg, for example rest := lower[idx+len("reset after "):] and rest := lower[idx:].

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Two findings on commit 657052b: 1. fixJobDirect's retry-path limit abort lost the "Changes were made but not committed" warning that the success path emits. If the first agent call left changes and the commit retry hit a quota or session limit, the user got only a cooldown message and a dirty tree. Print the same warning to out before returning the limit error so the user sees both states. 2. ParseResetDuration and parseResetTimeAt computed indices against strings.ToLower(errMsg) and then sliced errMsg with those indices. A rune like "İ" U+0130 shortens to "i" under ToLower (2 bytes → 1 byte), so the indices can land in the middle of a UTF-8 sequence or extract the wrong substring. Slice the lowercased copy instead. time.ParseDuration and time.Parse work fine with lowercase unit suffixes, so the lowercase format strings are sufficient (drop the redundant uppercase "3:04 PM" / "3:04PM" entries). Add regression tests: - TestParseResetDurationToLowerByteSafety with "İ reset after 30m" - TestParseResetTimeToLowerByteSafety with "İ resets at 5:42 PM" - TestFixJobDirect_RetryClassifiesAgentLimit captures fixJobDirect's output via bytes.Buffer and asserts the dirty-tree warning fires alongside the agentLimitError propagation.

roborev-ci · 2026-05-05T23:58:03Z

roborev: Combined Review (`333420a`)

No Medium, High, or Critical findings were reported; code looks clean for PR-blocking review purposes.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm · 2026-05-06T00:15:25Z

unrelated CI flake

wesm and others added 19 commits May 5, 2026 12:57

feat(agentlimit): create package with Kind and Classification types

2115905

feat(agentlimit): port reset-duration parsing from worker.parseQuotaC…

d580130

…ooldown

feat(agentlimit): add absolute reset-time parser with same-day rollover

c00b2e9

feat(agentlimit): add Classify with nine production quota patterns

44f74f7

refactor(daemon): route quota detection through internal/agentlimit

62e7aca

test(daemon): cover session-limit cooldown via injected classifier

b34ea49

feat(fix): add classifier hook and agentLimitError plumbing

073f816

feat(fix): abort runFixBatch on KindQuota or KindSession

7823a65

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

remove specs

a3f244b

wesm merged commit d306c1e into main May 6, 2026
7 of 8 checks passed

wesm deleted the fix/fix-rate-limit-degradation branch May 6, 2026 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort foreground fix loop on agent quota or session limit#696

Abort foreground fix loop on agent quota or session limit#696
wesm merged 24 commits intomainfrom
fix/fix-rate-limit-degradation

wesm commented May 5, 2026 •

edited

Loading

Uh oh!

roborev-ci Bot commented May 5, 2026

Uh oh!

codecov-commenter commented May 5, 2026 •

edited

Loading

Uh oh!

roborev-ci Bot commented May 5, 2026

Uh oh!

roborev-ci Bot commented May 5, 2026

Uh oh!

roborev-ci Bot commented May 5, 2026

Uh oh!

roborev-ci Bot commented May 5, 2026

Uh oh!

roborev-ci Bot commented May 5, 2026

Uh oh!

wesm commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wesm commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (a0bb883)

Medium

Uh oh!

codecov-commenter commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (a3f244b)

High

Uh oh!

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (3fecb4b)

High

Uh oh!

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (a7ac2c1)

Medium

Uh oh!

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (657052b)

Medium

Uh oh!

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (333420a)

Uh oh!

wesm commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wesm commented May 5, 2026 •

edited

Loading

roborev: Combined Review (`a0bb883`)

codecov-commenter commented May 5, 2026 •

edited

Loading

roborev: Combined Review (`a3f244b`)

roborev: Combined Review (`3fecb4b`)

roborev: Combined Review (`a7ac2c1`)

roborev: Combined Review (`657052b`)

roborev: Combined Review (`333420a`)