Skip to content

fix(usage-metrics): replace the per-version inline retry with a multi-pass approach#3620

Merged
mdelapenya merged 4 commits intotestcontainers:mainfrom
mdelapenya:usage-metrics-refinement
Apr 1, 2026
Merged

fix(usage-metrics): replace the per-version inline retry with a multi-pass approach#3620
mdelapenya merged 4 commits intotestcontainers:mainfrom
mdelapenya:usage-metrics-refinement

Conversation

@mdelapenya
Copy link
Copy Markdown
Member

@mdelapenya mdelapenya commented Apr 1, 2026

What does this PR do?

Replace the per-version inline retry strategy in collect-metrics.go with a multi-pass approach. Instead of retrying each version up to 3 times with 60s backoffs inline (blocking the pipeline), the collector now:

  1. Pass 1: queries all versions sequentially with 7s inter-request delays
  2. Collects any versions that failed (HTTP 429, 403, etc.) into a failed list
  3. Waits 120s for the GitHub rate limit window to fully reset
  4. Pass 2+: retries only the failed versions with the same 7s inter-request delays
  5. Repeats up to 5 passes total

The queryGitHubUsageWithRetry wrapper is removed — queryGitHubUsage is called directly from the multi-pass loop.

Why is it important?

The April 2026 run showed cascading 429 failures: retrying one version inline consumed rate budget that subsequent versions needed, causing most versions to be skipped. The resulting PR #3619 only contained 9 out of ~30 versions.

The multi-pass approach is more efficient because successful versions proceed without unnecessary waits, and the 120s cooldown between passes lets the rate limit window fully reset before retrying failures.

Related issues

…-pass approach

Try all versions in a first pass, collect failures, wait for the rate limit window to reset,
then retry only the failed versions. Repeat until all succeed or a max number of passes is hit.
This avoids cascading failures where retries for one version burn rate budget for the next.
@mdelapenya mdelapenya requested a review from a team as a code owner April 1, 2026 11:16
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 1, 2026

Deploy Preview for testcontainers-go ready!

Name Link
🔨 Latest commit e76e151
🔍 Latest deploy log https://app.netlify.com/projects/testcontainers-go/deploys/69cd040821010f000813b3bc
😎 Deploy Preview https://deploy-preview-3620--testcontainers-go.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c38dfaa7-f2fa-4a98-8d98-36f7706d7d03

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8b979 and e76e151.

📒 Files selected for processing (1)
  • .github/workflows/usage-metrics.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/usage-metrics.yml

Summary by CodeRabbit

  • Refactor

    • Metrics collection now deduplicates and validates version input, retries transient failures in staged passes with controlled pacing, and aborts on unrecoverable errors
    • CSV output is deterministically sorted on disk (date asc, version asc) so exports are consistently ordered
  • Chores

    • Automation uses month-based update branches, restores/stages collected CSV onto existing branches, and skips creating duplicate branches or PRs when no changes exist

Walkthrough

Switches usage metrics collection to a deduplicating, multi-pass retry loop that queries pending versions sequentially and sorts the CSV on disk after appends. The GitHub Actions workflow now uses month-based branches and avoids recreating existing branches or open PRs.

Changes

Cohort / File(s) Summary
Usage metrics collection
usage-metrics/collect-metrics.go
Trim/filter/deduplicate input versions; replace ordered slice with map[string]usageMetric; perform up to maxPasses multi-version passes with interRequestWait between requests and passCooldown between passes; classify errors with isRetryableError and retry only retryable failures; abort on non-retryable errors; append successful metrics then run sortCSV to sort CSV by (date asc, version asc).
CI workflow: branch & PR handling
.github/workflows/usage-metrics.yml
Use month-based branch name (chore/update-usage-metrics-$MONTH); if branch exists, fetch/checkout and restore collected CSV onto that branch, stage and exit early if no diff; skip gh pr create when an open PR with --head $BRANCH_NAME exists; use $MONTH in PR title/body.

Sequence Diagram(s)

sequenceDiagram
    participant Collector as Usage Metrics Collector
    participant GitHub as GitHub API
    participant FS as Filesystem (CSV)
    Note over Collector: prepare versions → trim, filter, dedupe → pending set
    loop Passes (up to maxPasses)
        alt pending versions exist
            Collector->>GitHub: query usage for version V_i
            GitHub-->>Collector: usage result / error
            alt success
                Collector->>Collector: store in map[version]=metric
                Collector->>FS: append metric row to CSV
                FS-->>Collector: ack
            else retryable error
                Collector->>Collector: mark V_i pending for next pass
            else non-retryable error
                Collector-->>Collector: abort with error
            end
            Collector->>Collector: wait interRequestWait
        end
        Note right of Collector: after pass ends, if pending non-empty\nwait passCooldown then continue
    end
    Collector->>FS: sortCSV(file, by (date asc, version asc))
    FS-->>Collector: sorted file ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

chore, github_actions

Poem

I hop through passes, trimming names in line,
Retry the ones that wait and shine.
I write the rows, then sort with care,
Month-named branches float on air.
🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: replacing per-version inline retry logic with a multi-pass approach, which directly aligns with the primary objective of the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the multi-pass retry strategy, its motivation, and the specific changes to collect-metrics.go.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 58-67: Replace the current construction of pending (which only
trims empties but allows duplicates and can result in no queries) with logic
that trims each entry from versions, skips empty strings, and inserts into a
temporary map[string]struct{} to deduplicate; then build pending as a slice from
that map so pending contains a unique, non-empty set of versions to query
(references: the variables versions, pending, and metrics in
collect-metrics.go).
- Around line 93-98: queryGitHubUsage failures are currently always requeued;
change the flow so only retryable errors (e.g., rate limit) are retried: update
queryGitHubUsage (or wrap its call) to return or be examined for a
distinguishable retryable error (e.g., an error type, sentinel, or helper
isRetryableError(err)), then in the loop that increments queriesMade and appends
to failed, only append/continue for retryable errors and for non-retryable
errors log the failure for version and skip further retries (do not requeue
across passes 1–4); reference queryGitHubUsage, the err variable, failed slice,
queriesMade and pass when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3ab7b35-2828-4b2b-b9ec-7743f1261715

📥 Commits

Reviewing files that changed from the base of the PR and between 440ee5c and 31d00b2.

📒 Files selected for processing (1)
  • usage-metrics/collect-metrics.go

Comment thread usage-metrics/collect-metrics.go
Comment thread usage-metrics/collect-metrics.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/usage-metrics.yml (1)

78-80: Consider deduplication when re-running within the same month.

The month-based branch strategy means multiple workflow_dispatch runs in the same month will push additional commits to the same branch. Since appendToCSV (in collect-metrics.go) unconditionally appends rows without deduplication, re-runs on the same date will produce duplicate (date, version) entries in the CSV.

This is fine for the scheduled monthly run but may need attention if manual re-runs are common.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/usage-metrics.yml around lines 78 - 80, The CSV append
logic in appendToCSV (collect-metrics.go) unconditionally appends rows, causing
duplicate (date, version) entries when the monthly branch is reused; update
appendToCSV to read the existing CSV first, check for an existing row matching
the new row's date and version, and either skip adding the duplicate or replace
the existing row before writing back (e.g., load CSV into a map keyed by
date+version, update/insert the entry, then write the deduplicated rows back to
disk). Ensure you still create the file if missing and preserve header handling
in appendToCSV.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/usage-metrics.yml:
- Around line 82-97: The current checkout of an existing remote branch (git
checkout "$BRANCH_NAME") can overwrite the just-staged docs/usage-metrics.csv;
instead, when origin/$BRANCH_NAME exists, avoid switching branches and simply
create a commit locally and push it to that remote branch (use the staged file
docs/usage-metrics.csv and commit with git commit -m "Update usage metrics",
then push with git push origin HEAD:"$BRANCH_NAME"); keep the existing git diff
--staged check and only perform the commit+push when there are staged changes,
and reference BRANCH_NAME and docs/usage-metrics.csv to locate the change.

---

Nitpick comments:
In @.github/workflows/usage-metrics.yml:
- Around line 78-80: The CSV append logic in appendToCSV (collect-metrics.go)
unconditionally appends rows, causing duplicate (date, version) entries when the
monthly branch is reused; update appendToCSV to read the existing CSV first,
check for an existing row matching the new row's date and version, and either
skip adding the duplicate or replace the existing row before writing back (e.g.,
load CSV into a map keyed by date+version, update/insert the entry, then write
the deduplicated rows back to disk). Ensure you still create the file if missing
and preserve header handling in appendToCSV.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dace4a3e-4c78-4dbc-943a-cd60e250a576

📥 Commits

Reviewing files that changed from the base of the PR and between 31d00b2 and 22c4a61.

📒 Files selected for processing (2)
  • .github/workflows/usage-metrics.yml
  • usage-metrics/collect-metrics.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • usage-metrics/collect-metrics.go

Comment thread .github/workflows/usage-metrics.yml
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
usage-metrics/collect-metrics.go (2)

221-225: Consider write-to-temp-then-rename for atomic updates.

os.Create truncates the file before writing. If the process crashes between truncation and WriteAll completion, the CSV data is lost. For higher durability, write to a temporary file and rename atomically.

♻️ Atomic write pattern
-	out, err := os.Create(absPath)
+	tmpPath := absPath + ".tmp"
+	out, err := os.Create(tmpPath)
 	if err != nil {
 		return fmt.Errorf("create file: %w", err)
 	}
-	defer out.Close()
 
 	writer := csv.NewWriter(out)
 	if err := writer.Write(header); err != nil {
+		out.Close()
+		os.Remove(tmpPath)
 		return fmt.Errorf("write header: %w", err)
 	}
 	if err := writer.WriteAll(data); err != nil {
+		out.Close()
+		os.Remove(tmpPath)
 		return fmt.Errorf("write records: %w", err)
 	}
+	if err := out.Close(); err != nil {
+		os.Remove(tmpPath)
+		return fmt.Errorf("close temp file: %w", err)
+	}
+	if err := os.Rename(tmpPath, absPath); err != nil {
+		return fmt.Errorf("rename temp to final: %w", err)
+	}
 
 	return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@usage-metrics/collect-metrics.go` around lines 221 - 225, Replace the direct
os.Create(absPath) usage with an atomic write-to-temp-then-rename pattern:
create a temp file in the same directory (use os.CreateTemp with the directory
from absPath), write your data to that temp file (the code that currently calls
WriteAll should target the temp file), sync/close the temp file, then atomically
rename it to absPath with os.Rename; ensure you still handle and return errors
from create/write/sync/close/rename and use the same file permissions as the
original if needed (references: absPath, out variable, and the write call that
currently writes the CSV).

127-146: Consider returning a distinct error or exit code for partial failures.

When versions remain pending after all passes, the function logs a warning but returns nil. This allows partial results to be written, which matches the PR objective. However, it also means CI workflows cannot distinguish between full success and partial failure without parsing logs.

If partial failure should surface in CI (e.g., to flag data gaps), consider returning a wrapped error or setting a non-zero exit code while still writing collected metrics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@usage-metrics/collect-metrics.go` around lines 127 - 146, The code logs
pending versions but returns nil, hiding partial failures; modify the function
that contains this block so that after appending metrics and successfully
calling sortCSV(csvPath) it checks len(pending)>0 and returns a distinct,
wrapped error (e.g., fmt.Errorf("partial failure: %d version(s) pending after %d
passes: %s", len(pending), maxPasses, strings.Join(pending, ", "))) or a
sentinel ErrPartialFailure while still writing metrics via appendToCSV and
sorting via sortCSV; ensure the error is returned after sortCSV succeeds so CI
can detect partial failures without losing written data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 214-219: The sort comparator for sort.SliceStable over variable
data currently indexes data[i][0] and data[i][1] directly and can panic on
malformed rows; update the comparator used in sort.SliceStable to first check
that both data[i] and data[j] have length >= 2 (e.g., len(data[i]) >= 2 &&
len(data[j]) >= 2) and handle cases where a row is short by treating missing
values as less/greater (consistent ordering) or placing malformed rows at the
end, then perform the date-then-version comparison only after those bounds
checks so indexing is always safe.

---

Nitpick comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 221-225: Replace the direct os.Create(absPath) usage with an
atomic write-to-temp-then-rename pattern: create a temp file in the same
directory (use os.CreateTemp with the directory from absPath), write your data
to that temp file (the code that currently calls WriteAll should target the temp
file), sync/close the temp file, then atomically rename it to absPath with
os.Rename; ensure you still handle and return errors from
create/write/sync/close/rename and use the same file permissions as the original
if needed (references: absPath, out variable, and the write call that currently
writes the CSV).
- Around line 127-146: The code logs pending versions but returns nil, hiding
partial failures; modify the function that contains this block so that after
appending metrics and successfully calling sortCSV(csvPath) it checks
len(pending)>0 and returns a distinct, wrapped error (e.g., fmt.Errorf("partial
failure: %d version(s) pending after %d passes: %s", len(pending), maxPasses,
strings.Join(pending, ", "))) or a sentinel ErrPartialFailure while still
writing metrics via appendToCSV and sorting via sortCSV; ensure the error is
returned after sortCSV succeeds so CI can detect partial failures without losing
written data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62e2e17c-417a-470a-a453-3a2a45255f56

📥 Commits

Reviewing files that changed from the base of the PR and between 22c4a61 and 8c8b979.

📒 Files selected for processing (1)
  • usage-metrics/collect-metrics.go

Comment thread usage-metrics/collect-metrics.go
@mdelapenya mdelapenya self-assigned this Apr 1, 2026
@mdelapenya mdelapenya added the chore Changes that do not impact the existing functionality label Apr 1, 2026
@mdelapenya mdelapenya merged commit cc0e33d into testcontainers:main Apr 1, 2026
16 checks passed
@mdelapenya mdelapenya deleted the usage-metrics-refinement branch April 1, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Changes that do not impact the existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant