Fix flaky upgrade test: replace fixed sleeps with polling and increase timeouts#12245
Conversation
…th polling - Bump helmTimeout from "3m" to "5m" to handle slow CI runners - Add controlPlaneTimeout (4m) for waitForControlPlane, up from 2m - Replace fixed 3s apiServiceDeregistrationWait sleep with polling that checks Discovery API until the api.ucp.dev/v1alpha3 group is gone - Replace fixed 5s sleep in testPreflightDisabled with a polling loop that checks for the job across multiple attempts These changes address the root causes of the intermittent 503 errors: the helm install/upgrade exceeding the tight 3m timeout, and the aggregated APIService still being registered (returning 503) when the control plane readiness check runs. Closes #11841 Co-authored-by: sylvainsf <540991+sylvainsf@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12245 +/- ##
==========================================
- Coverage 52.88% 52.87% -0.02%
==========================================
Files 751 751
Lines 48353 48353
==========================================
- Hits 25573 25568 -5
- Misses 20383 20386 +3
- Partials 2397 2399 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ated Re-point the `rad startup` / `rad shutdown` state-storage lifecycle test at this PR's deploy workflow model and remove the `RADIUS_STATE_E2E` gate so it runs in CI on its own dedicated, isolated cluster. - Add a `statestore-noncloud` leg to the non-cloud functional matrix. Each matrix leg runs on its own runner with its own KinD cluster, so the test's destructive install/uninstall/reinstall cycle never affects other legs. The shared "Install Radius" step is skipped for this leg because the test drives its own install. - Drive install with the build-under-test images (chart + per-RP image flags from testutil.SetDefault, DE_IMAGE/DE_TAG, and the secure local registry CA), mirroring the shared Install Radius step, plus `database.enabled=true` for the state backend. - Harden the lifecycle against the flakes seen in the upgrade test (#12245): replace fixed sleeps with polling — wait for the control plane treating 503 from the UCP aggregated APIService as retryable, and poll discovery until `api.ucp.dev/v1alpha3` deregisters before reinstalling so the next install doesn't race the teardown. - Add the `test-functional-statestore-noncloud` make target and a 40m timeout for the leg. Related: #12118 Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
There was a problem hiding this comment.
Pull request overview
This PR reduces flakiness in the functional upgrade test suite by replacing fixed sleeps with polling and by increasing timeouts around Helm operations and control plane availability, targeting intermittent 503s during aggregated APIService lifecycle transitions.
Changes:
- Increase Helm install/upgrade
--timeoutfrom 3m to 5m. - Increase post-Helm control plane readiness wait from 2m to 4m.
- Replace fixed sleeps with polling for (a) “no preflight job created” checks and (b) aggregated APIService deregistration.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Description
Test_PreflightContainerflakes with 503 "service unavailable" because the 3m helm timeout is too tight for shared CI runners and fixed sleeps race against actual API service lifecycle transitions.Changes:
helmTimeout3m → 5m — cold kind cluster regularly exceeds 3m installing the full chart under--waitcontrolPlaneTimeout2m → 4m — the UCP aggregated APIService briefly returns 503 while pods roll after helm completesapiServiceDeregistrationWaitsleep with Discovery API polling — pollsServerGroupsAndResources()untilapi.ucp.dev/v1alpha3is gone (30s timeout), eliminating the race where a new install starts while the old APIService is still registeredtestPreflightDisabledwith polling loop —helm upgrade --waitalready synchronizes completion; poll for the job across 15 attempts instead of sleeping an arbitrary durationType of change
Contributor checklist
Please verify that the PR meets the following requirements, where applicable:
eng/design-notes/in this repository, if new APIs are being introduced.