Skip to content

Abort foreground fix loop on agent quota or session limit#696

Merged
wesm merged 24 commits intomainfrom
fix/fix-rate-limit-degradation
May 6, 2026
Merged

Abort foreground fix loop on agent quota or session limit#696
wesm merged 24 commits intomainfrom
fix/fix-rate-limit-degradation

Conversation

@wesm
Copy link
Copy Markdown
Collaborator

@wesm wesm commented May 5, 2026

Summary

  • Add a rate-limit classifier inside internal/agent (LimitKind, LimitClassification, LimitClassifier, ClassifyLimit, plus ParseResetDuration / ParseResetTime). The default ruleset reproduces the nine quota substrings the daemon worker already used; reset-time parsing handles relative (reset after 48m) and absolute (resets at 5:42 PM) hints, and rolls forward via time.Date(..., Day()+1, ...) so wall-clock time is preserved across DST transitions.
  • Route the daemon worker's quota detection through agent.ClassifyLimit via an injectable WorkerPool.classify field. Cooldown duration order is unchanged: defaultCooldownClassification.CooldownFortime.Until(ResetAt). LimitKindQuota cools down and skips retries; LimitKindNone now logs a one-line WARN with a truncated error preview so unmatched agent errors are visible.
  • Add agentLimitError, formatAgentLimitMessage, and agentLimitLabel in cmd/roborev/fix.go. The foreground roborev fix loop classifies each fixJobDirect error and aborts with agentLimitError on LimitKindQuota / LimitKindSession; LimitKindNone prints a stderr warning. The abort path is wired into fixSingleJob, runFixBatch, and runFixWithSeen (so discovery mode no longer demotes quota aborts to per-job warnings).
  • fixJobDirect's commit-retry path also classifies its own error so a quota or session limit on the second Agent.Review call returns *agentLimitError instead of being swallowed as a Warning: commit agent failed: line. The classifier is plumbed through fixJobParams.Classify (defaults to agent.ClassifyLimit when nil).
  • LimitKindSession plumbing (cooldown on the daemon side, abort on the CLI side) is in place but is not produced by any production rule yet — Claude's 5-hour cap falls through to the unmatched-error path until a real session-cap message is captured. Speculative substrings are deliberately not added to defaultLimitRules because they would also match policy errors and transient 429s, and a false positive would abort roborev fix and cool down the agent when retrying might have worked. Tests exercise the session-limit path via an injected classifier.
  • Replace the daemon's isQuotaError, parseQuotaCooldown, minCooldown, and maxCooldown with the classifier-backed flow.
  • Tighten .roborev.toml review guidelines to tell reviewers not to flag suspected compile errors, missing imports, or undeclared identifiers — the local toolchain catches those before merge, and reviewers cannot see the rest of the package.

wesm and others added 19 commits May 5, 2026 12:57
Shared classifier in a new internal/agentlimit package, used by the
daemon worker (preserves cooldown + failover for daemon jobs) and the
foreground roborev fix loop (strict abort with reset time).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Corrects daemon retry semantics (quota errors skip retries entirely
and go straight to failoverOrFail), completes the day-1 pattern table
to match the existing isQuotaError set bit-for-bit, adds an
unexported classifyWithRules helper plus injectable classifier
function on WorkerPool / fix-loop options for test isolation, and
clarifies that PR 1 ships the framework while a Claude-specific rule
follows once a real session-cap message is captured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no behavior change" claim was wrong: foreground roborev fix
gains strict abort for KindQuota in PR 1, which changes outcomes for
Gemini and Codex sessions too. Spelled out explicitly. Also dropped
the proposed exported test helper — Classification and the Kind*
constants are already exported, so cross-package tests can return a
literal struct from their stub classifier without extra plumbing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eleven TDD-driven tasks splitting the work into a shared agentlimit
package, daemon-worker integration that preserves existing
Gemini/Codex behavior bit-for-bit, and CLI fix-loop abort plumbing
exercised against an injected stub classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esetTime

Replace candidate.Add(24*time.Hour) with time.Date(..., Day()+1, ...)
so the rollover stays at the same wall-clock time across DST
transitions (the 24h-add lands an hour off in March/November).

Switch candidate.Before(now) to !candidate.After(now) so equality
also rolls forward — avoids returning a zero-duration cooldown when
the agent's clock matches ours to the second.

Adds tests for the rollover boundary, timezone propagation, and the
conservative trailing-content behavior that documents what the
parser does NOT handle yet (deferred until a real Claude message is
captured).
- ParseResetTime doc comment was stale after the DST fix; update to
  describe at-or-before rollover and DST-safe wall-clock semantics.
- TestParseResetTimeAcrossDST: real IANA-zone test (America/New_York)
  that exercises the spring-forward and fall-back transitions where
  Add(24h) would land an hour off but time.Date day arithmetic does not.
- Plan: formatAgentLimitMessage now branches on cls.Kind for the user-
  facing label so a Gemini/Codex KindQuota abort says "quota limit"
  instead of "session limit".
- Plan: TestRunFixBatchAbortsOnQuotaError uses batchSize=1 (forces one
  job per agent call so agentCalls=1 unambiguously means abort) and
  drops the brittle "second job not reviewed" assertion that would
  fail because runFixBatch prefetches all reviews up-front.
- Spec: replace "try again at 17:42 UTC" with "try again at 17:42" and
  document that timezone suffixes are out of scope until a real
  session-cap message is captured.

Closes review jobs 18483, 18487, 18498.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test asserts cooldown + retry_count==0 with no backup configured, so
the failover path is not exercised. Rename clarifies the actual
behavior under test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor failOrRetryInner to classify the agent error once at the top
of the agent-error path via a switch over Classification.Kind. The
existing KindQuota/KindSession branch is preserved verbatim. KindNone
now emits a single WARN line (with newlines flattened and truncated to
200 runes) when errorMsg is non-empty, so future unmatched session-cap
or quota wordings get captured for adding a rule. KindTransient and
KindNone both fall through to the unchanged isContextWindowError check
and retry path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After fixJobDirect returns an error, classify it via opts.classify and
abort the loop with an agentLimitError on KindQuota/KindSession. Print
a warning on KindNone (mirroring the daemon-side WARN). Other kinds
(KindTransient) fall through to the existing return path.

Removes the //nolint:unused directives from agentLimitError, its
Error() method, formatAgentLimitMessage, and agentLimitLabel now that
they are wired into the abort path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- 18506 (worker_test.go): tighten unclassified-agent log assertion to
  "from test:" so it cannot pass on the workerID's "test-worker"
  substring.
- 18513 (fix.go runFixWithSeen): detect agentLimitError via errors.As
  before discovery-mode warn-and-continue, otherwise quota/session
  aborts get demoted to warnings and the re-query loop keeps invoking
  the exhausted agent until every queued job is burned through.
- Regression test TestRunFixWithSeenDiscoveryAbortsOnAgentLimit covers
  the discovery path: KindQuota classification produces an
  agentLimitError that propagates out of seen != nil mode without
  marking the second job seen.

Closes review jobs 18506 and 18513.
The retry/failover section still pointed at the removed isQuotaError
helper. Update it to describe the agentlimit.Classify-based path with
the new KindQuota/KindSession kinds and ParseResetDuration/ParseResetTime
helpers that own the cooldown extraction.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (a0bb883)

Review verdict: one medium issue remains; no high or critical findings were reported.

Medium

  • cmd/roborev/fix.go:272 - The second params.Agent.Review call used for commit instructions still swallows agent errors as a warning, so quota/session-limit failures on that retry path bypass the new classifier and the fix loop can continue after the agent is exhausted.

    Suggested fix: Propagate that retry error, or classify it at the call site and return agentLimitError for KindQuota / KindSession. Add a test covering a limit error on the commit retry path.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 85.49223% with 28 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
cmd/roborev/fix.go 80.64% 15 Missing and 3 partials ⚠️
internal/daemon/worker.go 82.60% 3 Missing and 1 partial ⚠️
internal/agent/limit.go 86.95% 2 Missing and 1 partial ⚠️
internal/agent/limit_parse.go 94.44% 2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (a3f244b)

High-risk compile errors and one runtime parsing bug need fixing before merge.

High

  • cmd/roborev/fix.go:506
    The new errors.As call uses the errors package, but errors was not added to the import block, so cmd/roborev will not compile.
    Fix: Add errors to the imports in cmd/roborev/fix.go.

  • cmd/roborev/fix.go:1027, cmd/roborev/fix.go:1288, internal/daemon/worker.go:805
    The code calls agent.CanonicalName, but no exported CanonicalName function exists in internal/agent; the existing alias resolver is unexported. This creates undefined-symbol compile failures.
    Fix: Add an exported CanonicalName helper in internal/agent that wraps alias resolution, or use already canonical agent names at these call sites.

  • internal/agentlimit/parse.go:26, internal/agentlimit/parse.go:74
    The parser finds byte indexes in the lowercased string but slices the original mixed-case string. Unicode case changes can misalign byte indexes and cause a runtime panic. It also preserves uppercase duration units such as 30M, which time.ParseDuration rejects.
    Fix: Slice from the lowercased string, e.g. rest := lower[idx+len("reset after "):] and rest := lower[idx:].


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The classifier only had two consumers (the daemon worker and the
foreground fix loop), so a separate package was overkill. Move the
files into internal/agent/ and rename the exported names with a Limit
prefix (LimitKind, LimitClassification, LimitClassifier, ClassifyLimit,
LimitKindNone/Transient/Quota/Session). ParseResetDuration and
ParseResetTime keep their names; they were already specific. Drop
the internal/agentlimit/ directory.

Address review finding 18515 (medium): fixJobDirect's commit-retry
swallowed any retry error as a "Warning: commit agent failed:" line,
so a quota or session-limit hit on that retry slipped past the
classifier and the fix loop kept invoking the exhausted agent. Plumb
agent.ClassifyLimit through fixJobParams.Classify (defaulting when
nil so existing direct callers do not break) and abort with
*agentLimitError on LimitKindQuota / LimitKindSession; non-limit
retry errors keep today's warn-and-continue.

fixSingleJob and runFixBatch now short-circuit on a returned
*agentLimitError before re-classifying err.Error(), since the
formatted user-facing message is no longer the original agent
error.

Regression test TestFixJobDirect_RetryClassifiesAgentLimit drives
the retry path with a FakeAgent that succeeds-with-uncommitted-
changes on the first call and returns a synthetic session-cap on
the second; asserts *agentLimitError propagates and the formatted
message identifies the agent and limit kind.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (3fecb4b)

Build-breaking issues remain in the PR.

High

  • cmd/roborev/fix.go:409, internal/daemon/worker.go:803
    Calls were added to agent.CanonicalName, but internal/agent does not export a CanonicalName function. This will fail to compile. Add the exported helper or use the existing alias-resolution path.

  • cmd/roborev/fix.go:520
    errors.As is used without importing the errors package, which also breaks compilation. Add errors to the imports in cmd/roborev/fix.go.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Reviewers keep raising "function X not defined" or "package Y not
imported" findings against diffs that build cleanly — the symbol just
lives elsewhere in the file or package, which the reviewer cannot see.
Add a guideline that explicitly excludes compile-level findings and
points at the local toolchain (go build, go vet, golangci-lint, the
pre-commit hook) as the source of truth.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (a7ac2c1)

Medium: session-limit handling is unreachable in production.

Medium

  • internal/agent/limit.go:38 - Production ClassifyLimit never returns LimitKindSession because defaultLimitRules only includes quota patterns and omits Claude session-cap patterns. As a result, the new daemon/fix handling for session limits is only exercised by tests that inject a classifier; real session-cap failures will still follow the normal retry/continue path.
    • Fix: Add production session-cap rule(s) for the observed Claude error wording with a ClassifyLimit test fixture, or remove the advertised/session-specific handling until a real pattern is supported.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

defaultLimitRules contains only the nine quota substrings inherited
from isQuotaError; nothing in production produces LimitKindSession
today. Claude's 5-hour cap will fall through to the unmatched-error
path until a real session-cap message is captured and a narrow rule
is added.

Update the LimitKindSession const comment, the defaultLimitRules
comment (with a TODO and a note on why speculative substrings are
not added), and the CLAUDE.md cooldown reference so the daemon-side
documentation reflects what the production classifier actually
emits today vs. what the plumbing supports once a rule lands.

Addresses the medium review finding about session-limit handling
being unreachable in production.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (657052b)

Verdict: Two medium-severity correctness issues need attention before merge.

Medium

  • cmd/roborev/fix.go:410

    • Problem: If the first fix pass leaves uncommitted changes and the commit-retry pass hits a quota/session limit, fixJobDirect returns agentLimitError immediately. That bypasses the existing caller path that warns users when changes were made but not committed, leaving a dirty worktree with only a cooldown message.
    • Fix: Before returning the limit error from the retry path, preserve or emit the uncommitted-changes warning, or include that state in the returned error/result so callers can report it.
  • internal/agent/limit_parse.go:26, internal/agent/limit_parse.go:72

    • Problem: The code slices errMsg using byte indices derived from lower := strings.ToLower(errMsg). ToLower can change byte length for some Unicode characters, so those indices may not safely map back to the original string and could panic or produce invalid strings.
    • Fix: Slice lower instead of errMsg, for example rest := lower[idx+len("reset after "):] and rest := lower[idx:].

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Two findings on commit 657052b:

1. fixJobDirect's retry-path limit abort lost the "Changes were made
   but not committed" warning that the success path emits. If the
   first agent call left changes and the commit retry hit a quota or
   session limit, the user got only a cooldown message and a dirty
   tree. Print the same warning to out before returning the limit
   error so the user sees both states.

2. ParseResetDuration and parseResetTimeAt computed indices against
   strings.ToLower(errMsg) and then sliced errMsg with those indices.
   A rune like "İ" U+0130 shortens to "i" under ToLower (2 bytes →
   1 byte), so the indices can land in the middle of a UTF-8 sequence
   or extract the wrong substring. Slice the lowercased copy instead.
   time.ParseDuration and time.Parse work fine with lowercase unit
   suffixes, so the lowercase format strings are sufficient (drop
   the redundant uppercase "3:04 PM" / "3:04PM" entries).

Add regression tests:
- TestParseResetDurationToLowerByteSafety with "İ reset after 30m"
- TestParseResetTimeToLowerByteSafety with "İ resets at 5:42 PM"
- TestFixJobDirect_RetryClassifiesAgentLimit captures fixJobDirect's
  output via bytes.Buffer and asserts the dirty-tree warning fires
  alongside the agentLimitError propagation.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (333420a)

No Medium, High, or Critical findings were reported; code looks clean for PR-blocking review purposes.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Copy Markdown
Collaborator Author

wesm commented May 6, 2026

unrelated CI flake

@wesm wesm merged commit d306c1e into main May 6, 2026
7 of 8 checks passed
@wesm wesm deleted the fix/fix-rate-limit-degradation branch May 6, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants