fix(usage-metrics): replace the per-version inline retry with a multi-pass approach by mdelapenya · Pull Request #3620 · testcontainers/testcontainers-go

mdelapenya · 2026-04-01T11:16:38Z

What does this PR do?

Replace the per-version inline retry strategy in collect-metrics.go with a multi-pass approach. Instead of retrying each version up to 3 times with 60s backoffs inline (blocking the pipeline), the collector now:

Pass 1: queries all versions sequentially with 7s inter-request delays
Collects any versions that failed (HTTP 429, 403, etc.) into a failed list
Waits 120s for the GitHub rate limit window to fully reset
Pass 2+: retries only the failed versions with the same 7s inter-request delays
Repeats up to 5 passes total

The queryGitHubUsageWithRetry wrapper is removed — queryGitHubUsage is called directly from the multi-pass loop.

Why is it important?

The April 2026 run showed cascading 429 failures: retrying one version inline consumed rate budget that subsequent versions needed, causing most versions to be skipped. The resulting PR #3619 only contained 9 out of ~30 versions.

The multi-pass approach is more efficient because successful versions proceed without unnecessary waits, and the 120s cooldown between passes lets the rate limit window fully reset before retrying failures.

Related issues

Supersedes chore: update usage metrics (2026-04-01) #3619

…-pass approach Try all versions in a first pass, collect failures, wait for the rate limit window to reset, then retry only the failed versions. Repeat until all succeed or a max number of passes is hit. This avoids cascading failures where retries for one version burn rate budget for the next.

netlify · 2026-04-01T11:16:43Z

✅ Deploy Preview for testcontainers-go ready!

Name	Link
🔨 Latest commit	`e76e151`
🔍 Latest deploy log	https://app.netlify.com/projects/testcontainers-go/deploys/69cd040821010f000813b3bc
😎 Deploy Preview	https://deploy-preview-3620--testcontainers-go.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-04-01T11:16:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c38dfaa7-f2fa-4a98-8d98-36f7706d7d03

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8b979 and e76e151.

📒 Files selected for processing (1)

.github/workflows/usage-metrics.yml

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/usage-metrics.yml

Summary by CodeRabbit

Refactor
- Metrics collection now deduplicates and validates version input, retries transient failures in staged passes with controlled pacing, and aborts on unrecoverable errors
- CSV output is deterministically sorted on disk (date asc, version asc) so exports are consistently ordered
Chores
- Automation uses month-based update branches, restores/stages collected CSV onto existing branches, and skips creating duplicate branches or PRs when no changes exist

Walkthrough

Switches usage metrics collection to a deduplicating, multi-pass retry loop that queries pending versions sequentially and sorts the CSV on disk after appends. The GitHub Actions workflow now uses month-based branches and avoids recreating existing branches or open PRs.

Changes

Cohort / File(s)	Summary
Usage metrics collection `usage-metrics/collect-metrics.go`	Trim/filter/deduplicate input versions; replace ordered slice with `map[string]usageMetric`; perform up to `maxPasses` multi-version passes with `interRequestWait` between requests and `passCooldown` between passes; classify errors with `isRetryableError` and retry only retryable failures; abort on non-retryable errors; append successful metrics then run `sortCSV` to sort CSV by (date asc, version asc).
CI workflow: branch & PR handling `.github/workflows/usage-metrics.yml`	Use month-based branch name (`chore/update-usage-metrics-$MONTH`); if branch exists, fetch/checkout and restore collected CSV onto that branch, stage and exit early if no diff; skip `gh pr create` when an open PR with `--head $BRANCH_NAME` exists; use `$MONTH` in PR title/body.

Sequence Diagram(s)

sequenceDiagram
    participant Collector as Usage Metrics Collector
    participant GitHub as GitHub API
    participant FS as Filesystem (CSV)
    Note over Collector: prepare versions → trim, filter, dedupe → pending set
    loop Passes (up to maxPasses)
        alt pending versions exist
            Collector->>GitHub: query usage for version V_i
            GitHub-->>Collector: usage result / error
            alt success
                Collector->>Collector: store in map[version]=metric
                Collector->>FS: append metric row to CSV
                FS-->>Collector: ack
            else retryable error
                Collector->>Collector: mark V_i pending for next pass
            else non-retryable error
                Collector-->>Collector: abort with error
            end
            Collector->>Collector: wait interRequestWait
        end
        Note right of Collector: after pass ends, if pending non-empty\nwait passCooldown then continue
    end
    Collector->>FS: sortCSV(file, by (date asc, version asc))
    FS-->>Collector: sorted file ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

chore(metrics): even better rate limit strategy #3500: Modifies request timing and retry/backoff behavior in usage-metrics/collect-metrics.go (multi-pass vs per-version retry).
chore(metrics): properly detect rate limits #3499: Adds inter-request delays and improves rate-limit detection (e.g., HTTP 429) in usage metrics querying.

Suggested labels

chore, github_actions

Poem

I hop through passes, trimming names in line,
Retry the ones that wait and shine.
I write the rows, then sort with care,
Month-named branches float on air.
🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: replacing per-version inline retry logic with a multi-pass approach, which directly aligns with the primary objective of the changeset.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the multi-pass retry strategy, its motivation, and the specific changes to collect-metrics.go.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 58-67: Replace the current construction of pending (which only
trims empties but allows duplicates and can result in no queries) with logic
that trims each entry from versions, skips empty strings, and inserts into a
temporary map[string]struct{} to deduplicate; then build pending as a slice from
that map so pending contains a unique, non-empty set of versions to query
(references: the variables versions, pending, and metrics in
collect-metrics.go).
- Around line 93-98: queryGitHubUsage failures are currently always requeued;
change the flow so only retryable errors (e.g., rate limit) are retried: update
queryGitHubUsage (or wrap its call) to return or be examined for a
distinguishable retryable error (e.g., an error type, sentinel, or helper
isRetryableError(err)), then in the loop that increments queriesMade and appends
to failed, only append/continue for retryable errors and for non-retryable
errors log the failure for version and skip further retries (do not requeue
across passes 1–4); reference queryGitHubUsage, the err variable, failed slice,
queriesMade and pass when making this change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3ab7b35-2828-4b2b-b9ec-7743f1261715

📥 Commits

Reviewing files that changed from the base of the PR and between 440ee5c and 31d00b2.

📒 Files selected for processing (1)

usage-metrics/collect-metrics.go

…isting PR

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

.github/workflows/usage-metrics.yml (1)
78-80: Consider deduplication when re-running within the same month.

The month-based branch strategy means multiple workflow_dispatch runs in the same month will push additional commits to the same branch. Since appendToCSV (in collect-metrics.go) unconditionally appends rows without deduplication, re-runs on the same date will produce duplicate (date, version) entries in the CSV.

This is fine for the scheduled monthly run but may need attention if manual re-runs are common.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/usage-metrics.yml around lines 78 - 80, The CSV append
logic in appendToCSV (collect-metrics.go) unconditionally appends rows, causing
duplicate (date, version) entries when the monthly branch is reused; update
appendToCSV to read the existing CSV first, check for an existing row matching
the new row's date and version, and either skip adding the duplicate or replace
the existing row before writing back (e.g., load CSV into a map keyed by
date+version, update/insert the entry, then write the deduplicated rows back to
disk). Ensure you still create the file if missing and preserve header handling
in appendToCSV.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/usage-metrics.yml:
- Around line 82-97: The current checkout of an existing remote branch (git
checkout "$BRANCH_NAME") can overwrite the just-staged docs/usage-metrics.csv;
instead, when origin/$BRANCH_NAME exists, avoid switching branches and simply
create a commit locally and push it to that remote branch (use the staged file
docs/usage-metrics.csv and commit with git commit -m "Update usage metrics",
then push with git push origin HEAD:"$BRANCH_NAME"); keep the existing git diff
--staged check and only perform the commit+push when there are staged changes,
and reference BRANCH_NAME and docs/usage-metrics.csv to locate the change.

---

Nitpick comments:
In @.github/workflows/usage-metrics.yml:
- Around line 78-80: The CSV append logic in appendToCSV (collect-metrics.go)
unconditionally appends rows, causing duplicate (date, version) entries when the
monthly branch is reused; update appendToCSV to read the existing CSV first,
check for an existing row matching the new row's date and version, and either
skip adding the duplicate or replace the existing row before writing back (e.g.,
load CSV into a map keyed by date+version, update/insert the entry, then write
the deduplicated rows back to disk). Ensure you still create the file if missing
and preserve header handling in appendToCSV.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dace4a3e-4c78-4dbc-943a-cd60e250a576

📥 Commits

Reviewing files that changed from the base of the PR and between 31d00b2 and 22c4a61.

📒 Files selected for processing (2)

.github/workflows/usage-metrics.yml
usage-metrics/collect-metrics.go

🚧 Files skipped from review as they are similar to previous changes (1)

usage-metrics/collect-metrics.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

usage-metrics/collect-metrics.go (2)

221-225: Consider write-to-temp-then-rename for atomic updates.

os.Create truncates the file before writing. If the process crashes between truncation and WriteAll completion, the CSV data is lost. For higher durability, write to a temporary file and rename atomically.

♻️ Atomic write pattern

-	out, err := os.Create(absPath)
+	tmpPath := absPath + ".tmp"
+	out, err := os.Create(tmpPath)
 	if err != nil {
 		return fmt.Errorf("create file: %w", err)
 	}
-	defer out.Close()
 
 	writer := csv.NewWriter(out)
 	if err := writer.Write(header); err != nil {
+		out.Close()
+		os.Remove(tmpPath)
 		return fmt.Errorf("write header: %w", err)
 	}
 	if err := writer.WriteAll(data); err != nil {
+		out.Close()
+		os.Remove(tmpPath)
 		return fmt.Errorf("write records: %w", err)
 	}
+	if err := out.Close(); err != nil {
+		os.Remove(tmpPath)
+		return fmt.Errorf("close temp file: %w", err)
+	}
+	if err := os.Rename(tmpPath, absPath); err != nil {
+		return fmt.Errorf("rename temp to final: %w", err)
+	}
 
 	return nil

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@usage-metrics/collect-metrics.go` around lines 221 - 225, Replace the direct
os.Create(absPath) usage with an atomic write-to-temp-then-rename pattern:
create a temp file in the same directory (use os.CreateTemp with the directory
from absPath), write your data to that temp file (the code that currently calls
WriteAll should target the temp file), sync/close the temp file, then atomically
rename it to absPath with os.Rename; ensure you still handle and return errors
from create/write/sync/close/rename and use the same file permissions as the
original if needed (references: absPath, out variable, and the write call that
currently writes the CSV).

127-146: Consider returning a distinct error or exit code for partial failures.

When versions remain pending after all passes, the function logs a warning but returns nil. This allows partial results to be written, which matches the PR objective. However, it also means CI workflows cannot distinguish between full success and partial failure without parsing logs.

If partial failure should surface in CI (e.g., to flag data gaps), consider returning a wrapped error or setting a non-zero exit code while still writing collected metrics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@usage-metrics/collect-metrics.go` around lines 127 - 146, The code logs
pending versions but returns nil, hiding partial failures; modify the function
that contains this block so that after appending metrics and successfully
calling sortCSV(csvPath) it checks len(pending)>0 and returns a distinct,
wrapped error (e.g., fmt.Errorf("partial failure: %d version(s) pending after %d
passes: %s", len(pending), maxPasses, strings.Join(pending, ", "))) or a
sentinel ErrPartialFailure while still writing metrics via appendToCSV and
sorting via sortCSV; ensure the error is returned after sortCSV succeeds so CI
can detect partial failures without losing written data.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 214-219: The sort comparator for sort.SliceStable over variable
data currently indexes data[i][0] and data[i][1] directly and can panic on
malformed rows; update the comparator used in sort.SliceStable to first check
that both data[i] and data[j] have length >= 2 (e.g., len(data[i]) >= 2 &&
len(data[j]) >= 2) and handle cases where a row is short by treating missing
values as less/greater (consistent ordering) or placing malformed rows at the
end, then perform the date-then-version comparison only after those bounds
checks so indexing is always safe.

---

Nitpick comments:
In `@usage-metrics/collect-metrics.go`:
- Around line 221-225: Replace the direct os.Create(absPath) usage with an
atomic write-to-temp-then-rename pattern: create a temp file in the same
directory (use os.CreateTemp with the directory from absPath), write your data
to that temp file (the code that currently calls WriteAll should target the temp
file), sync/close the temp file, then atomically rename it to absPath with
os.Rename; ensure you still handle and return errors from
create/write/sync/close/rename and use the same file permissions as the original
if needed (references: absPath, out variable, and the write call that currently
writes the CSV).
- Around line 127-146: The code logs pending versions but returns nil, hiding
partial failures; modify the function that contains this block so that after
appending metrics and successfully calling sortCSV(csvPath) it checks
len(pending)>0 and returns a distinct, wrapped error (e.g., fmt.Errorf("partial
failure: %d version(s) pending after %d passes: %s", len(pending), maxPasses,
strings.Join(pending, ", "))) or a sentinel ErrPartialFailure while still
writing metrics via appendToCSV and sorting via sortCSV; ensure the error is
returned after sortCSV succeeds so CI can detect partial failures without losing
written data.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62e2e17c-417a-470a-a453-3a2a45255f56

📥 Commits

Reviewing files that changed from the base of the PR and between 22c4a61 and 8c8b979.

📒 Files selected for processing (1)

usage-metrics/collect-metrics.go

mdelapenya requested a review from a team as a code owner April 1, 2026 11:16

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Comment thread usage-metrics/collect-metrics.go

Comment thread usage-metrics/collect-metrics.go Outdated

mdelapenya added 2 commits April 1, 2026 13:32

chore(usage-data): manually trigger the workflow contributes to an ex…

22c4a61

…isting PR

fix: deduplicate versions and only retry retryable errors

8c8b979

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Comment thread .github/workflows/usage-metrics.yml

fix: preserve collected metrics when checking out existing branch

e76e151

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Comment thread usage-metrics/collect-metrics.go

mdelapenya self-assigned this Apr 1, 2026

mdelapenya added the chore Changes that do not impact the existing functionality label Apr 1, 2026

mdelapenya merged commit cc0e33d into testcontainers:main Apr 1, 2026
16 checks passed

mdelapenya deleted the usage-metrics-refinement branch April 1, 2026 11:44

This was referenced Apr 1, 2026

fix(usage-metrics): reduce rate-limit cascade errors #3622

Merged

fix(usage-metrics): order of actions matters #3623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(usage-metrics): replace the per-version inline retry with a multi-pass approach#3620

fix(usage-metrics): replace the per-version inline retry with a multi-pass approach#3620
mdelapenya merged 4 commits intotestcontainers:mainfrom
mdelapenya:usage-metrics-refinement

mdelapenya commented Apr 1, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mdelapenya commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Related issues

Uh oh!

netlify bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for testcontainers-go ready!

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mdelapenya commented Apr 1, 2026 •

edited

Loading

netlify bot commented Apr 1, 2026 •

edited

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading