deploy: retry uncordon with exponential backoff in bluegreen strategy#4851
Merged
deploy: retry uncordon with exponential backoff in bluegreen strategy#4851
Conversation
A transient HTTP 408 from the uncordon API endpoint was enough to fail an otherwise healthy bluegreen deployment and trigger a full rollback, destroying every newly-created green machine. MarkGreenMachinesAsReadyForTraffic now wraps each Uncordon call with retry.Do using exponential backoff (up to 5 attempts, starting at 500 ms, capped at 30 s). A retry line is printed to stderr so the operator can see what is happening. Two new fields on blueGreen control the retry behaviour: uncordonRetryAttempts – total attempts (default 5) uncordonRetryDelay – initial delay (default 500 ms) Both are zeroed out in the test helper so existing and new tests finish instantly. Tests added in TestMarkGreenMachinesAsReadyForTrafficRetries: - succeeds immediately when no errors occur - succeeds after transient uncordon failures are retried - fails after all retry attempts are exhausted The mock gains an uncordonTransientFailures counter that fails exactly N times before succeeding, allowing precise retry-path coverage.
What and Why: A transient HTTP 408 from the uncordon API endpoint was enough to fail an otherwise healthy bluegreen deployment and trigger a full rollback, destroying every newly-created green machine even though they had all started and passed health checks. Retrying the uncordon call with exponential backoff makes the deployment resilient to transient API timeouts without any change to the rollback logic. How: MarkGreenMachinesAsReadyForTraffic wraps each Uncordon call in retry.Do (already imported) with exponential backoff: up to 5 attempts, initial delay 500 ms, capped at 30 s. A message is printed to stderr on each retry so the operator can see what is happening. Two new fields on blueGreen control the behaviour: uncordonRetryAttempts – total attempts (default 5) uncordonRetryDelay – initial delay (default 500 ms) Both are zeroed in the test helper so all tests remain instant. The mock gains an uncordonTransientFailures counter (protected by the existing mutex) that fails exactly N times then succeeds, enabling targeted retry-path coverage in TestMarkGreenMachinesAsReadyForTrafficRetries: - succeeds immediately when no errors occur - succeeds after transient uncordon failures are retried - fails after all retry attempts are exhausted Related to: n/a --- - [x] n/a
dangra
approved these changes
Apr 22, 2026
Member
dangra
left a comment
There was a problem hiding this comment.
The PR is marked as draft but it looks good to me.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Summary
What and Why** — the problem (transient 408 → unnecessary full rollback) and the goal (resilience without touching rollback logic)
retry.Do, the two new fields and their defaults, how tests stay fast (zero delay in helper), the new mock counter, and the three new sub-testsn/asince this is a self-contained fixn/achecked, since this is an internal deployment-engine change with no user-facing docs neededA transient HTTP 408 from the uncordon API endpoint was enough to fail an otherwise healthy bluegreen deployment and trigger a full rollback, destroying every newly-created green machine.
MarkGreenMachinesAsReadyForTraffic now wraps each Uncordon call with retry.Do using exponential backoff (up to 5 attempts, starting at 500 ms, capped at 30 s). A retry line is printed to stderr so the operator can see what is happening.
Two new fields on blueGreen control the retry behaviour:
uncordonRetryAttempts – total attempts (default 5)
uncordonRetryDelay – initial delay (default 500 ms)
Both are zeroed out in the test helper so existing and new tests finish instantly.
Tests added in TestMarkGreenMachinesAsReadyForTrafficRetries:
The mock gains an uncordonTransientFailures counter that fails exactly N times before succeeding, allowing precise retry-path coverage.