Skip to content

Fix flaky upgrade test: replace fixed sleeps with polling and increase timeouts#12245

Merged
brooke-hamilton merged 5 commits into
mainfrom
copilot/fix-flaky-upgrade-noncloud-tests
Jun 25, 2026
Merged

Fix flaky upgrade test: replace fixed sleeps with polling and increase timeouts#12245
brooke-hamilton merged 5 commits into
mainfrom
copilot/fix-flaky-upgrade-noncloud-tests

Conversation

Copilot AI commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Description

Test_PreflightContainer flakes with 503 "service unavailable" because the 3m helm timeout is too tight for shared CI runners and fixed sleeps race against actual API service lifecycle transitions.

Changes:

  • helmTimeout 3m → 5m — cold kind cluster regularly exceeds 3m installing the full chart under --wait
  • controlPlaneTimeout 2m → 4m — the UCP aggregated APIService briefly returns 503 while pods roll after helm completes
  • Replace 3s apiServiceDeregistrationWait sleep with Discovery API polling — polls ServerGroupsAndResources() until api.ucp.dev/v1alpha3 is gone (30s timeout), eliminating the race where a new install starts while the old APIService is still registered
  • Replace 5s sleep in testPreflightDisabled with polling loophelm upgrade --wait already synchronizes completion; poll for the job across 15 attempts instead of sleeping an arbitrary duration

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Not applicable
  • A design document is added or updated under eng/design-notes/ in this repository, if new APIs are being introduced.
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Not applicable
  • A PR for resource-types-contrib is created, if resource types or recipes are affected by the changes in this PR.
    • Not applicable
  • A PR for dashboard is created, if the Radius Dashboard is affected by the changes in this PR.
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Not applicable

…th polling

- Bump helmTimeout from "3m" to "5m" to handle slow CI runners
- Add controlPlaneTimeout (4m) for waitForControlPlane, up from 2m
- Replace fixed 3s apiServiceDeregistrationWait sleep with polling that
  checks Discovery API until the api.ucp.dev/v1alpha3 group is gone
- Replace fixed 5s sleep in testPreflightDisabled with a polling loop
  that checks for the job across multiple attempts

These changes address the root causes of the intermittent 503 errors:
the helm install/upgrade exceeding the tight 3m timeout, and the
aggregated APIService still being registered (returning 503) when
the control plane readiness check runs.

Closes #11841

Co-authored-by: sylvainsf <540991+sylvainsf@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix flaky test for upgrade noncloud functional tests Fix flaky upgrade test: replace fixed sleeps with polling and increase timeouts Jun 25, 2026
Copilot AI requested a review from sylvainsf June 25, 2026 00:58
@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Unit Tests

    2 files  ±0    450 suites  ±0   7m 42s ⏱️ +5s
5 591 tests ±0  5 589 ✅ ±0  2 💤 ±0  0 ❌ ±0 
6 788 runs  ±0  6 786 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit b98b208. ± Comparison against base commit 2626297.

♻️ This comment has been updated with latest results.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.87%. Comparing base (2626297) to head (b98b208).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12245      +/-   ##
==========================================
- Coverage   52.88%   52.87%   -0.02%     
==========================================
  Files         751      751              
  Lines       48353    48353              
==========================================
- Hits        25573    25568       -5     
- Misses      20383    20386       +3     
- Partials     2397     2399       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sylvainsf added a commit that referenced this pull request Jun 25, 2026
…ated

Re-point the `rad startup` / `rad shutdown` state-storage lifecycle test at this
PR's deploy workflow model and remove the `RADIUS_STATE_E2E` gate so it runs in
CI on its own dedicated, isolated cluster.

- Add a `statestore-noncloud` leg to the non-cloud functional matrix. Each matrix
  leg runs on its own runner with its own KinD cluster, so the test's destructive
  install/uninstall/reinstall cycle never affects other legs. The shared "Install
  Radius" step is skipped for this leg because the test drives its own install.
- Drive install with the build-under-test images (chart + per-RP image flags from
  testutil.SetDefault, DE_IMAGE/DE_TAG, and the secure local registry CA), mirroring
  the shared Install Radius step, plus `database.enabled=true` for the state backend.
- Harden the lifecycle against the flakes seen in the upgrade test (#12245): replace
  fixed sleeps with polling — wait for the control plane treating 503 from the UCP
  aggregated APIService as retryable, and poll discovery until `api.ucp.dev/v1alpha3`
  deregisters before reinstalling so the next install doesn't race the teardown.
- Add the `test-functional-statestore-noncloud` make target and a 40m timeout for
  the leg.

Related: #12118
Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
@sylvainsf sylvainsf mentioned this pull request Jun 25, 2026
12 tasks
@sylvainsf sylvainsf marked this pull request as ready for review June 25, 2026 05:35
@sylvainsf sylvainsf requested review from a team as code owners June 25, 2026 05:35
Copilot AI review requested due to automatic review settings June 25, 2026 05:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flakiness in the functional upgrade test suite by replacing fixed sleeps with polling and by increasing timeouts around Helm operations and control plane availability, targeting intermittent 503s during aggregated APIService lifecycle transitions.

Changes:

  • Increase Helm install/upgrade --timeout from 3m to 5m.
  • Increase post-Helm control plane readiness wait from 2m to 4m.
  • Replace fixed sleeps with polling for (a) “no preflight job created” checks and (b) aggregated APIService deregistration.

Comment thread test/functional-portable/upgrade/upgrade_test.go Outdated
Comment thread test/functional-portable/upgrade/upgrade_test.go Outdated
sylvainsf and others added 2 commits June 24, 2026 22:46
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Sylvain Niles <sylvainniles@microsoft.com>
@radius-functional-tests

radius-functional-tests Bot commented Jun 25, 2026

Copy link
Copy Markdown

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository radius-project/radius
Commit ref b98b208
Unique ID func7e7bfcd24e
Image tag pr-func7e7bfcd24e
  • KinD: v0.29.0
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func7e7bfcd24e
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func7e7bfcd24e
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func7e7bfcd24e
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func7e7bfcd24e
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func7e7bfcd24e
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting ucp-cloud functional tests...
⌛ Starting corerp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@brooke-hamilton brooke-hamilton added this pull request to the merge queue Jun 25, 2026
Merged via the queue into main with commit 98f5623 Jun 25, 2026
75 checks passed
@brooke-hamilton brooke-hamilton deleted the copilot/fix-flaky-upgrade-noncloud-tests branch June 25, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants