Skip to content

deploy: retry uncordon with exponential backoff in bluegreen strategy#4851

Merged
lubien merged 2 commits intomasterfrom
retry-uncordon-with-exponential-backoff
Apr 22, 2026
Merged

deploy: retry uncordon with exponential backoff in bluegreen strategy#4851
lubien merged 2 commits intomasterfrom
retry-uncordon-with-exponential-backoff

Conversation

@lubien
Copy link
Copy Markdown
Member

@lubien lubien commented Apr 22, 2026

Change Summary

What and Why** — the problem (transient 408 → unnecessary full rollback) and the goal (resilience without touching rollback logic)

  • How — implementation details: retry.Do, the two new fields and their defaults, how tests stay fast (zero delay in helper), the new mock counter, and the three new sub-tests
  • Related ton/a since this is a self-contained fix
  • Documentationn/a checked, since this is an internal deployment-engine change with no user-facing docs needed

A transient HTTP 408 from the uncordon API endpoint was enough to fail an otherwise healthy bluegreen deployment and trigger a full rollback, destroying every newly-created green machine.

MarkGreenMachinesAsReadyForTraffic now wraps each Uncordon call with retry.Do using exponential backoff (up to 5 attempts, starting at 500 ms, capped at 30 s). A retry line is printed to stderr so the operator can see what is happening.

Two new fields on blueGreen control the retry behaviour:
uncordonRetryAttempts – total attempts (default 5)
uncordonRetryDelay – initial delay (default 500 ms)

Both are zeroed out in the test helper so existing and new tests finish instantly.

Tests added in TestMarkGreenMachinesAsReadyForTrafficRetries:

  • succeeds immediately when no errors occur
  • succeeds after transient uncordon failures are retried
  • fails after all retry attempts are exhausted

The mock gains an uncordonTransientFailures counter that fails exactly N times before succeeding, allowing precise retry-path coverage.

A transient HTTP 408 from the uncordon API endpoint was enough to fail
an otherwise healthy bluegreen deployment and trigger a full rollback,
destroying every newly-created green machine.

MarkGreenMachinesAsReadyForTraffic now wraps each Uncordon call with
retry.Do using exponential backoff (up to 5 attempts, starting at 500 ms,
capped at 30 s).  A retry line is printed to stderr so the operator can
see what is happening.

Two new fields on blueGreen control the retry behaviour:
  uncordonRetryAttempts – total attempts (default 5)
  uncordonRetryDelay    – initial delay   (default 500 ms)

Both are zeroed out in the test helper so existing and new tests finish
instantly.

Tests added in TestMarkGreenMachinesAsReadyForTrafficRetries:
  - succeeds immediately when no errors occur
  - succeeds after transient uncordon failures are retried
  - fails after all retry attempts are exhausted

The mock gains an uncordonTransientFailures counter that fails exactly N
times before succeeding, allowing precise retry-path coverage.
@lubien lubien requested review from dangra and rianmcguirefly April 22, 2026 12:52
What and Why:
A transient HTTP 408 from the uncordon API endpoint was enough to fail
an otherwise healthy bluegreen deployment and trigger a full rollback,
destroying every newly-created green machine even though they had all
started and passed health checks. Retrying the uncordon call with
exponential backoff makes the deployment resilient to transient API
timeouts without any change to the rollback logic.

How:
MarkGreenMachinesAsReadyForTraffic wraps each Uncordon call in
retry.Do (already imported) with exponential backoff: up to 5 attempts,
initial delay 500 ms, capped at 30 s. A message is printed to stderr on
each retry so the operator can see what is happening.

Two new fields on blueGreen control the behaviour:
  uncordonRetryAttempts – total attempts   (default 5)
  uncordonRetryDelay    – initial delay    (default 500 ms)

Both are zeroed in the test helper so all tests remain instant.
The mock gains an uncordonTransientFailures counter (protected by the
existing mutex) that fails exactly N times then succeeds, enabling
targeted retry-path coverage in TestMarkGreenMachinesAsReadyForTrafficRetries:
  - succeeds immediately when no errors occur
  - succeeds after transient uncordon failures are retried
  - fails after all retry attempts are exhausted

Related to: n/a

---

- [x] n/a
Copy link
Copy Markdown
Member

@dangra dangra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is marked as draft but it looks good to me.

@lubien lubien marked this pull request as ready for review April 22, 2026 15:15
@lubien lubien merged commit d9b20ae into master Apr 22, 2026
39 of 41 checks passed
@lubien lubien deleted the retry-uncordon-with-exponential-backoff branch April 22, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants