fix(usage-metrics): replace the per-version inline retry with a multi-pass approach#3620
Conversation
…-pass approach Try all versions in a first pass, collect failures, wait for the rate limit window to reset, then retry only the failed versions. Repeat until all succeed or a max number of passes is hit. This avoids cascading failures where retries for one version burn rate budget for the next.
✅ Deploy Preview for testcontainers-go ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary by CodeRabbit
WalkthroughSwitches usage metrics collection to a deduplicating, multi-pass retry loop that queries pending versions sequentially and sorts the CSV on disk after appends. The GitHub Actions workflow now uses month-based branches and avoids recreating existing branches or open PRs. Changes
Sequence Diagram(s)sequenceDiagram
participant Collector as Usage Metrics Collector
participant GitHub as GitHub API
participant FS as Filesystem (CSV)
Note over Collector: prepare versions → trim, filter, dedupe → pending set
loop Passes (up to maxPasses)
alt pending versions exist
Collector->>GitHub: query usage for version V_i
GitHub-->>Collector: usage result / error
alt success
Collector->>Collector: store in map[version]=metric
Collector->>FS: append metric row to CSV
FS-->>Collector: ack
else retryable error
Collector->>Collector: mark V_i pending for next pass
else non-retryable error
Collector-->>Collector: abort with error
end
Collector->>Collector: wait interRequestWait
end
Note right of Collector: after pass ends, if pending non-empty\nwait passCooldown then continue
end
Collector->>FS: sortCSV(file, by (date asc, version asc))
FS-->>Collector: sorted file ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 58-67: Replace the current construction of pending (which only
trims empties but allows duplicates and can result in no queries) with logic
that trims each entry from versions, skips empty strings, and inserts into a
temporary map[string]struct{} to deduplicate; then build pending as a slice from
that map so pending contains a unique, non-empty set of versions to query
(references: the variables versions, pending, and metrics in
collect-metrics.go).
- Around line 93-98: queryGitHubUsage failures are currently always requeued;
change the flow so only retryable errors (e.g., rate limit) are retried: update
queryGitHubUsage (or wrap its call) to return or be examined for a
distinguishable retryable error (e.g., an error type, sentinel, or helper
isRetryableError(err)), then in the loop that increments queriesMade and appends
to failed, only append/continue for retryable errors and for non-retryable
errors log the failure for version and skip further retries (do not requeue
across passes 1–4); reference queryGitHubUsage, the err variable, failed slice,
queriesMade and pass when making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3ab7b35-2828-4b2b-b9ec-7743f1261715
📒 Files selected for processing (1)
usage-metrics/collect-metrics.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/usage-metrics.yml (1)
78-80: Consider deduplication when re-running within the same month.The month-based branch strategy means multiple
workflow_dispatchruns in the same month will push additional commits to the same branch. SinceappendToCSV(incollect-metrics.go) unconditionally appends rows without deduplication, re-runs on the same date will produce duplicate(date, version)entries in the CSV.This is fine for the scheduled monthly run but may need attention if manual re-runs are common.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/usage-metrics.yml around lines 78 - 80, The CSV append logic in appendToCSV (collect-metrics.go) unconditionally appends rows, causing duplicate (date, version) entries when the monthly branch is reused; update appendToCSV to read the existing CSV first, check for an existing row matching the new row's date and version, and either skip adding the duplicate or replace the existing row before writing back (e.g., load CSV into a map keyed by date+version, update/insert the entry, then write the deduplicated rows back to disk). Ensure you still create the file if missing and preserve header handling in appendToCSV.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/usage-metrics.yml:
- Around line 82-97: The current checkout of an existing remote branch (git
checkout "$BRANCH_NAME") can overwrite the just-staged docs/usage-metrics.csv;
instead, when origin/$BRANCH_NAME exists, avoid switching branches and simply
create a commit locally and push it to that remote branch (use the staged file
docs/usage-metrics.csv and commit with git commit -m "Update usage metrics",
then push with git push origin HEAD:"$BRANCH_NAME"); keep the existing git diff
--staged check and only perform the commit+push when there are staged changes,
and reference BRANCH_NAME and docs/usage-metrics.csv to locate the change.
---
Nitpick comments:
In @.github/workflows/usage-metrics.yml:
- Around line 78-80: The CSV append logic in appendToCSV (collect-metrics.go)
unconditionally appends rows, causing duplicate (date, version) entries when the
monthly branch is reused; update appendToCSV to read the existing CSV first,
check for an existing row matching the new row's date and version, and either
skip adding the duplicate or replace the existing row before writing back (e.g.,
load CSV into a map keyed by date+version, update/insert the entry, then write
the deduplicated rows back to disk). Ensure you still create the file if missing
and preserve header handling in appendToCSV.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dace4a3e-4c78-4dbc-943a-cd60e250a576
📒 Files selected for processing (2)
.github/workflows/usage-metrics.ymlusage-metrics/collect-metrics.go
🚧 Files skipped from review as they are similar to previous changes (1)
- usage-metrics/collect-metrics.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
usage-metrics/collect-metrics.go (2)
221-225: Consider write-to-temp-then-rename for atomic updates.
os.Createtruncates the file before writing. If the process crashes between truncation andWriteAllcompletion, the CSV data is lost. For higher durability, write to a temporary file and rename atomically.♻️ Atomic write pattern
- out, err := os.Create(absPath) + tmpPath := absPath + ".tmp" + out, err := os.Create(tmpPath) if err != nil { return fmt.Errorf("create file: %w", err) } - defer out.Close() writer := csv.NewWriter(out) if err := writer.Write(header); err != nil { + out.Close() + os.Remove(tmpPath) return fmt.Errorf("write header: %w", err) } if err := writer.WriteAll(data); err != nil { + out.Close() + os.Remove(tmpPath) return fmt.Errorf("write records: %w", err) } + if err := out.Close(); err != nil { + os.Remove(tmpPath) + return fmt.Errorf("close temp file: %w", err) + } + if err := os.Rename(tmpPath, absPath); err != nil { + return fmt.Errorf("rename temp to final: %w", err) + } return nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@usage-metrics/collect-metrics.go` around lines 221 - 225, Replace the direct os.Create(absPath) usage with an atomic write-to-temp-then-rename pattern: create a temp file in the same directory (use os.CreateTemp with the directory from absPath), write your data to that temp file (the code that currently calls WriteAll should target the temp file), sync/close the temp file, then atomically rename it to absPath with os.Rename; ensure you still handle and return errors from create/write/sync/close/rename and use the same file permissions as the original if needed (references: absPath, out variable, and the write call that currently writes the CSV).
127-146: Consider returning a distinct error or exit code for partial failures.When versions remain pending after all passes, the function logs a warning but returns
nil. This allows partial results to be written, which matches the PR objective. However, it also means CI workflows cannot distinguish between full success and partial failure without parsing logs.If partial failure should surface in CI (e.g., to flag data gaps), consider returning a wrapped error or setting a non-zero exit code while still writing collected metrics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@usage-metrics/collect-metrics.go` around lines 127 - 146, The code logs pending versions but returns nil, hiding partial failures; modify the function that contains this block so that after appending metrics and successfully calling sortCSV(csvPath) it checks len(pending)>0 and returns a distinct, wrapped error (e.g., fmt.Errorf("partial failure: %d version(s) pending after %d passes: %s", len(pending), maxPasses, strings.Join(pending, ", "))) or a sentinel ErrPartialFailure while still writing metrics via appendToCSV and sorting via sortCSV; ensure the error is returned after sortCSV succeeds so CI can detect partial failures without losing written data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 214-219: The sort comparator for sort.SliceStable over variable
data currently indexes data[i][0] and data[i][1] directly and can panic on
malformed rows; update the comparator used in sort.SliceStable to first check
that both data[i] and data[j] have length >= 2 (e.g., len(data[i]) >= 2 &&
len(data[j]) >= 2) and handle cases where a row is short by treating missing
values as less/greater (consistent ordering) or placing malformed rows at the
end, then perform the date-then-version comparison only after those bounds
checks so indexing is always safe.
---
Nitpick comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 221-225: Replace the direct os.Create(absPath) usage with an
atomic write-to-temp-then-rename pattern: create a temp file in the same
directory (use os.CreateTemp with the directory from absPath), write your data
to that temp file (the code that currently calls WriteAll should target the temp
file), sync/close the temp file, then atomically rename it to absPath with
os.Rename; ensure you still handle and return errors from
create/write/sync/close/rename and use the same file permissions as the original
if needed (references: absPath, out variable, and the write call that currently
writes the CSV).
- Around line 127-146: The code logs pending versions but returns nil, hiding
partial failures; modify the function that contains this block so that after
appending metrics and successfully calling sortCSV(csvPath) it checks
len(pending)>0 and returns a distinct, wrapped error (e.g., fmt.Errorf("partial
failure: %d version(s) pending after %d passes: %s", len(pending), maxPasses,
strings.Join(pending, ", "))) or a sentinel ErrPartialFailure while still
writing metrics via appendToCSV and sorting via sortCSV; ensure the error is
returned after sortCSV succeeds so CI can detect partial failures without losing
written data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 62e2e17c-417a-470a-a453-3a2a45255f56
📒 Files selected for processing (1)
usage-metrics/collect-metrics.go
What does this PR do?
Replace the per-version inline retry strategy in
collect-metrics.gowith a multi-pass approach. Instead of retrying each version up to 3 times with 60s backoffs inline (blocking the pipeline), the collector now:failedlistThe
queryGitHubUsageWithRetrywrapper is removed —queryGitHubUsageis called directly from the multi-pass loop.Why is it important?
The April 2026 run showed cascading 429 failures: retrying one version inline consumed rate budget that subsequent versions needed, causing most versions to be skipped. The resulting PR #3619 only contained 9 out of ~30 versions.
The multi-pass approach is more efficient because successful versions proceed without unnecessary waits, and the 120s cooldown between passes lets the rate limit window fully reset before retrying failures.
Related issues