Fix Test_Flux_Complex namespace teardown flake by willdavsmith · Pull Request #12013 · radius-project/radius

willdavsmith · 2026-05-28T22:07:14Z

Description

Test_Flux_Complex (and the other tests sharing testFluxIntegration) intermittently times out for 10 minutes in CI on the final namespace-delete wait. The root cause is three async Radius delete cascades racing against each other — and against the K8s namespace delete the test fires the instant step 3 returns — while the DeploymentTemplate reconciler publishes Status.Phrase = Ready before residual DeploymentResource deletions have actually drained. controller-runtime's default exponential rate-limiter then turns a couple of early dependency errors into multi-minute backoffs that blow past any practical test budget.

This PR ships three layered fixes:

1. Sequential test teardown (`testFluxIntegration`)

testFluxIntegration now tears down each DeploymentTemplate (and waits for full drainage of its owned DeploymentResources, finalizers, and underlying Radius resources) before deleting the K8s namespaces. The namespace delete no longer competes with in-flight Radius cascades, so the previously-flaky Eventually becomes a trivial wait for K8s GC.

We union DTs across all steps when iterating, because some tests (e.g. Test_Flux_Complex step 3) intentionally remove DTs in later steps and iterating only the last step would miss DTs created by earlier steps.

The per-step defer opts.Client.Delete(ctx, deploymentTemplate) is left in place as a belt-and-suspenders for tests that fail mid-loop.

2. Make `DeploymentTemplate.Status.Phrase = Ready` honest

Today, DeploymentTemplateReconciler.reconcileOperation issues r.Client.Delete() on residual DRs from a removed output and then immediately stamps Phrase = Ready. The K8s Delete only sets a DeletionTimestamp; the underlying UCP delete cascade is still in flight. This causes user-visible inconsistency on every rename/remove cycle (status reports success while orphan resources still exist) and feeds the test race above.

This PR adds a new transitional phrase:

DeploymentTemplatePhraseReadyPendingCleanup DeploymentTemplatePhrase = "ReadyPendingCleanup"

After the diff-delete loop, the reconciler now lists owned DRs and compares their Spec.Id against the new outputResources set. If any DR's ID is not in the new set, the DT holds ReadyPendingCleanup and requeues. On the next reconcile (woken by the existing Owns(&DeploymentResource{}) watch when an owned DR is finally GC'd), reconcileUpdate re-evaluates and flips to Ready only once the residuals are gone.

Backwards-compatible: anything reading Phrase and treating non-Ready as "not done yet" continues to work — better, in fact, because Ready no longer lies.

3. Cap DR delete-retry backoff with explicit `RequeueAfter`

DeploymentResourceReconciler previously returned UCP delete-path errors directly to controller-runtime, which applied its default rate-limiter (exponential, climbing to ~16 minutes). A couple of early dependency errors during a delete cascade could burn 5+ minutes before another retry — retries that would have succeeded immediately.

This PR replaces all three delete-path return ctrl.Result{}, err sites with return ctrl.Result{RequeueAfter: r.deleteRetryDelay()}, nil. The default is 30s, overridable in tests via a new DeleteRetryInterval reconciler field (parallel to the existing DelayInterval/PollingDelay/requeueDelay() pattern).

Type of change

This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #11874

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

An overview of proposed schema changes is included in a linked GitHub issue.
- Not applicable
A design document is added or updated under eng/design-notes/ in this repository, if new APIs are being introduced.
- Not applicable
The design document has been reviewed and approved by Radius maintainers/approvers.
- Not applicable
A PR for resource-types-contrib is created, if resource types or recipes are affected by the changes in this PR.
- Not applicable
A PR for dashboard is created, if the Radius Dashboard is affected by the changes in this PR.
- Not applicable
A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
- Not applicable

Verification

New envtest: Test_DeploymentTemplateReconciler_ReadyPendingCleanup (pkg/controller/reconciler) — exercises the new phrase transition.
New envtest: Test_DeploymentResourceReconciler_DeleteRetryBackoff — verifies a failed UCP delete triggers a fresh retry within the bounded interval rather than controller-runtime's exponential rate-limiter.
Full pkg/controller/... envtest suite passes locally (~30s).
Functional verification (recommended before merge): make test-functional-kubernetes-noncloud GOTEST_OPTS='-run Test_Flux_Complex -v -count=5'.

Out of scope

Delete sub-resources when application is deleted via API #8164 (cascade-safe env/app deletion in applications-rp) — the architecturally correct long-term fix. Change Create a single integration test for environment setup #2 above substantially reduces the window in which it matters; Create a single integration test for application deployment #3 makes the remaining failures cheap; but the underlying same-owner-only dependency check in checkForDeploymentResourceDependencies is still load-bearing until Delete sub-resources when application is deleted via API #8164 lands.

Three layered changes to make Test_Flux_Complex deterministic and to fix the user-observable races it exposes: * testFluxIntegration: tear down each DeploymentTemplate (and wait for full drainage of its owned DeploymentResources) before deleting the K8s namespaces, so the namespace delete no longer races parallel Radius delete cascades. * DeploymentTemplateReconciler: introduce a new ReadyPendingCleanup phrase and gate the transition to Ready on full drainage of DeploymentResources removed by the latest diff, so observers (Flux, the test harness, custom operators) do not act on a premature 'Ready' signal. * DeploymentResourceReconciler: bound delete-path retries with an explicit RequeueAfter (default 30s, test-overridable via DeleteRetryInterval) instead of returning the error to controller-runtime, whose default exponential rate-limiter climbs to ~16 minutes and blows past practical drain budgets. Fixes: radius-project#11874 Signed-off-by: willdavsmith <willdavsmith@gmail.com>

codecov · 2026-05-28T22:22:44Z

Codecov Report

❌ Patch coverage is 73.33333% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.97%. Comparing base (fc4f38b) to head (498c8e3).

Files with missing lines	Patch %	Lines
...roller/reconciler/deploymenttemplate_reconciler.go	77.27%	7 Missing and 8 partials ⚠️
...roller/reconciler/deploymentresource_reconciler.go	44.44%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #12013      +/-   ##
==========================================
+ Coverage   51.90%   51.97%   +0.06%     
==========================================
  Files         732      732              
  Lines       46272    46305      +33     
==========================================
+ Hits        24016    24065      +49     
+ Misses      19957    19945      -12     
+ Partials     2299     2295       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Fixes intermittent Test_Flux_Complex teardown timeouts by removing races between Kubernetes namespace deletion and in-flight Radius delete cascades, and by making controller status/backoff behavior more accurate and bounded during delete/cleanup paths.

Changes:

Sequence functional-test teardown to delete DeploymentTemplates (and wait for their cascades to drain) before deleting Kubernetes namespaces.
Add DeploymentTemplatePhraseReadyPendingCleanup and keep Status.Phrase from reporting Ready until residual DeploymentResource deletions are fully drained.
Replace controller-runtime exponential backoff on DeploymentResource delete-path errors with a bounded RequeueAfter via a configurable DeleteRetryInterval.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/functional-portable/kubernetes/noncloud/flux_test.go	Collect DTs across steps and delete/wait them before namespace deletion to avoid teardown races.
test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go	Adds `deleteDeploymentTemplateAndWait` helper used to sequence Radius-side teardown.
pkg/controller/reconciler/deploymenttemplate_reconciler.go	Introduces ReadyPendingCleanup logic and residual DR detection before setting DT to Ready.
pkg/controller/reconciler/deploymenttemplate_reconciler_test.go	Adds envtest coverage for Ready → ReadyPendingCleanup → Ready transition.
pkg/controller/reconciler/deploymentresource_reconciler.go	Adds bounded delete retry (`DeleteRetryInterval` + `RequeueAfter`) to avoid exponential rate limiting.
pkg/controller/reconciler/deploymentresource_reconciler_test.go	Adds envtest ensuring failed delete retries occur within the bounded interval.
pkg/controller/reconciler/const.go	Adds `DeleteRetryDelay` default for bounded delete retries.
pkg/controller/api/radapp.io/v1alpha3/deploymenttemplate_types.go	Adds the new `ReadyPendingCleanup` status phrase constant.

DariuszPorowski

LGTM, one non-blocking comment in-line

* Use require.NoErrorf / require.Eventuallyf in deleteDeploymentTemplateAndWait so the %s/%s placeholders are formatted in failure messages (the variadic msgAndArgs form of NoError / Eventually does not run through fmt.Sprintf). * Tighten isOwnedBy to also match OwnerReference.UID. Without this, a DeploymentTemplate that is deleted and recreated with the same name would see its predecessor's still-draining DeploymentResources as residuals, holding the new DT in ReadyPendingCleanup indefinitely. Signed-off-by: willdavsmith <willdavsmith@gmail.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

* DeploymentTemplateReconciler.reconcileOperation: run the diff-delete loop even when the deployment response has nil OutputResources. Treating a nil response as an empty set ensures that a redeploy which produces no outputs still tears down the previously-tracked DRs; without this, the new hasResidualDeploymentResources check would flag the prior DRs as residuals forever and stick the DT in ReadyPendingCleanup. * Add Test_DeploymentTemplateReconciler_NilOutputResourcesTriggersCleanup to lock in the behavior above. * Fix grammar in deleteDeploymentTemplateAndWait godoc (plural subject). Signed-off-by: willdavsmith <willdavsmith@gmail.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Capture the polling error and surface it via require.NoErrorf outside the Eventually loop, so a real API/RBAC/connectivity failure during teardown is reported immediately rather than masquerading as a 10-minute timeout. Signed-off-by: willdavsmith <willdavsmith@gmail.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

A DR removed by the latest deployment diff may already be gone (GC-ed or manually deleted); returning the NotFound error from r.Client.Delete would fail the reconcile and reintroduce controller-runtime backoff during cleanup. Wrap with client.IgnoreNotFound so the diff-delete loop is idempotent. Signed-off-by: willdavsmith <willdavsmith@gmail.com>

radius-functional-tests · 2026-05-29T20:09:12Z

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details

Name	Value
Repository	willdavsmith/radius
Commit ref	`498c8e3`
Unique ID	funcbf8b73bbd2
Image tag	pr-funcbf8b73bbd2

gotestsum 1.13.0
KinD: v0.29.0
Dapr: 1.14.4
Azure KeyVault CSI driver: 1.4.2
Azure Workload identity webhook: 1.3.0
Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcbf8b73bbd2
Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcbf8b73bbd2
dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcbf8b73bbd2
controller test image location: ghcr.io/radius-project/dev/controller:pr-funcbf8b73bbd2
ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcbf8b73bbd2
deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ corerp-cloud functional tests succeeded
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

willdavsmith added the pr:standard label May 29, 2026

willdavsmith marked this pull request as ready for review May 29, 2026 17:48

Copilot AI review requested due to automatic review settings May 29, 2026 17:48

willdavsmith requested review from a team as code owners May 29, 2026 17:48

Copilot started reviewing on behalf of willdavsmith May 29, 2026 17:48 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go

Comment thread test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go Outdated

Comment thread pkg/controller/reconciler/deploymenttemplate_reconciler.go

DariuszPorowski previously approved these changes May 29, 2026

View reviewed changes

Comment thread pkg/controller/reconciler/deploymentresource_reconciler.go

willdavsmith dismissed DariuszPorowski’s stale review via f01d127 May 29, 2026 18:20

Merge branch 'main' into willdavsmith/flux-fix-052826

b321f50

willdavsmith requested review from DariuszPorowski and Copilot May 29, 2026 18:28

Copilot started reviewing on behalf of willdavsmith May 29, 2026 18:28 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread pkg/controller/reconciler/deploymenttemplate_reconciler.go

Comment thread test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go

willdavsmith requested a review from Copilot May 29, 2026 18:45

Copilot started reviewing on behalf of willdavsmith May 29, 2026 18:46 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go

willdavsmith requested a review from Copilot May 29, 2026 19:55

Copilot started reviewing on behalf of willdavsmith May 29, 2026 19:55 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread pkg/controller/reconciler/deploymenttemplate_reconciler.go

willdavsmith requested a review from Copilot May 29, 2026 20:10

Copilot started reviewing on behalf of willdavsmith May 29, 2026 20:10 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

DariuszPorowski approved these changes May 29, 2026

View reviewed changes

willdavsmith enabled auto-merge (squash) May 29, 2026 20:33

willdavsmith merged commit 52d5ac5 into radius-project:main May 29, 2026
56 checks passed

rad-ci-bot mentioned this pull request May 29, 2026

Update auto-generated documentation (PR #12013) radius-project/docs#1896

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Test_Flux_Complex namespace teardown flake#12013

Fix Test_Flux_Complex namespace teardown flake#12013
willdavsmith merged 6 commits into
radius-project:mainfrom
willdavsmith:willdavsmith/flux-fix-052826

willdavsmith commented May 28, 2026

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DariuszPorowski left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

radius-functional-tests Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

willdavsmith commented May 28, 2026

Description

1. Sequential test teardown (testFluxIntegration)

2. Make DeploymentTemplate.Status.Phrase = Ready honest

3. Cap DR delete-retry backoff with explicit RequeueAfter

Type of change

Contributor checklist

Verification

Out of scope

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DariuszPorowski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

radius-functional-tests Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Radius functional test overview

Test Status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Sequential test teardown (`testFluxIntegration`)

2. Make `DeploymentTemplate.Status.Phrase = Ready` honest

3. Cap DR delete-retry backoff with explicit `RequeueAfter`

codecov Bot commented May 28, 2026 •

edited

Loading

radius-functional-tests Bot commented May 29, 2026 •

edited

Loading