Skip to content

Fix Test_Flux_Complex namespace teardown flake#12013

Merged
willdavsmith merged 6 commits into
radius-project:mainfrom
willdavsmith:willdavsmith/flux-fix-052826
May 29, 2026
Merged

Fix Test_Flux_Complex namespace teardown flake#12013
willdavsmith merged 6 commits into
radius-project:mainfrom
willdavsmith:willdavsmith/flux-fix-052826

Conversation

@willdavsmith
Copy link
Copy Markdown
Contributor

Description

Test_Flux_Complex (and the other tests sharing testFluxIntegration) intermittently times out for 10 minutes in CI on the final namespace-delete wait. The root cause is three async Radius delete cascades racing against each other — and against the K8s namespace delete the test fires the instant step 3 returns — while the DeploymentTemplate reconciler publishes Status.Phrase = Ready before residual DeploymentResource deletions have actually drained. controller-runtime's default exponential rate-limiter then turns a couple of early dependency errors into multi-minute backoffs that blow past any practical test budget.

This PR ships three layered fixes:

1. Sequential test teardown (testFluxIntegration)

testFluxIntegration now tears down each DeploymentTemplate (and waits for full drainage of its owned DeploymentResources, finalizers, and underlying Radius resources) before deleting the K8s namespaces. The namespace delete no longer competes with in-flight Radius cascades, so the previously-flaky Eventually becomes a trivial wait for K8s GC.

We union DTs across all steps when iterating, because some tests (e.g. Test_Flux_Complex step 3) intentionally remove DTs in later steps and iterating only the last step would miss DTs created by earlier steps.

The per-step defer opts.Client.Delete(ctx, deploymentTemplate) is left in place as a belt-and-suspenders for tests that fail mid-loop.

2. Make DeploymentTemplate.Status.Phrase = Ready honest

Today, DeploymentTemplateReconciler.reconcileOperation issues r.Client.Delete() on residual DRs from a removed output and then immediately stamps Phrase = Ready. The K8s Delete only sets a DeletionTimestamp; the underlying UCP delete cascade is still in flight. This causes user-visible inconsistency on every rename/remove cycle (status reports success while orphan resources still exist) and feeds the test race above.

This PR adds a new transitional phrase:

DeploymentTemplatePhraseReadyPendingCleanup DeploymentTemplatePhrase = "ReadyPendingCleanup"

After the diff-delete loop, the reconciler now lists owned DRs and compares their Spec.Id against the new outputResources set. If any DR's ID is not in the new set, the DT holds ReadyPendingCleanup and requeues. On the next reconcile (woken by the existing Owns(&DeploymentResource{}) watch when an owned DR is finally GC'd), reconcileUpdate re-evaluates and flips to Ready only once the residuals are gone.

Backwards-compatible: anything reading Phrase and treating non-Ready as "not done yet" continues to work — better, in fact, because Ready no longer lies.

3. Cap DR delete-retry backoff with explicit RequeueAfter

DeploymentResourceReconciler previously returned UCP delete-path errors directly to controller-runtime, which applied its default rate-limiter (exponential, climbing to ~16 minutes). A couple of early dependency errors during a delete cascade could burn 5+ minutes before another retry — retries that would have succeeded immediately.

This PR replaces all three delete-path return ctrl.Result{}, err sites with return ctrl.Result{RequeueAfter: r.deleteRetryDelay()}, nil. The default is 30s, overridable in tests via a new DeleteRetryInterval reconciler field (parallel to the existing DelayInterval/PollingDelay/requeueDelay() pattern).

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #11874

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Not applicable
  • A design document is added or updated under eng/design-notes/ in this repository, if new APIs are being introduced.
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Not applicable
  • A PR for resource-types-contrib is created, if resource types or recipes are affected by the changes in this PR.
    • Not applicable
  • A PR for dashboard is created, if the Radius Dashboard is affected by the changes in this PR.
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Not applicable

Verification

  • New envtest: Test_DeploymentTemplateReconciler_ReadyPendingCleanup (pkg/controller/reconciler) — exercises the new phrase transition.
  • New envtest: Test_DeploymentResourceReconciler_DeleteRetryBackoff — verifies a failed UCP delete triggers a fresh retry within the bounded interval rather than controller-runtime's exponential rate-limiter.
  • Full pkg/controller/... envtest suite passes locally (~30s).
  • Functional verification (recommended before merge): make test-functional-kubernetes-noncloud GOTEST_OPTS='-run Test_Flux_Complex -v -count=5'.

Out of scope

Three layered changes to make Test_Flux_Complex deterministic and to fix
the user-observable races it exposes:

* testFluxIntegration: tear down each DeploymentTemplate (and wait for full
  drainage of its owned DeploymentResources) before deleting the K8s
  namespaces, so the namespace delete no longer races parallel Radius delete
  cascades.

* DeploymentTemplateReconciler: introduce a new ReadyPendingCleanup phrase
  and gate the transition to Ready on full drainage of DeploymentResources
  removed by the latest diff, so observers (Flux, the test harness, custom
  operators) do not act on a premature 'Ready' signal.

* DeploymentResourceReconciler: bound delete-path retries with an explicit
  RequeueAfter (default 30s, test-overridable via DeleteRetryInterval)
  instead of returning the error to controller-runtime, whose default
  exponential rate-limiter climbs to ~16 minutes and blows past practical
  drain budgets.

Fixes: radius-project#11874
Signed-off-by: willdavsmith <willdavsmith@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 73.33333% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.97%. Comparing base (fc4f38b) to head (498c8e3).

Files with missing lines Patch % Lines
...roller/reconciler/deploymenttemplate_reconciler.go 77.27% 7 Missing and 8 partials ⚠️
...roller/reconciler/deploymentresource_reconciler.go 44.44% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12013      +/-   ##
==========================================
+ Coverage   51.90%   51.97%   +0.06%     
==========================================
  Files         732      732              
  Lines       46272    46305      +33     
==========================================
+ Hits        24016    24065      +49     
+ Misses      19957    19945      -12     
+ Partials     2299     2295       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@willdavsmith willdavsmith marked this pull request as ready for review May 29, 2026 17:48
Copilot AI review requested due to automatic review settings May 29, 2026 17:48
@willdavsmith willdavsmith requested review from a team as code owners May 29, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes intermittent Test_Flux_Complex teardown timeouts by removing races between Kubernetes namespace deletion and in-flight Radius delete cascades, and by making controller status/backoff behavior more accurate and bounded during delete/cleanup paths.

Changes:

  • Sequence functional-test teardown to delete DeploymentTemplates (and wait for their cascades to drain) before deleting Kubernetes namespaces.
  • Add DeploymentTemplatePhraseReadyPendingCleanup and keep Status.Phrase from reporting Ready until residual DeploymentResource deletions are fully drained.
  • Replace controller-runtime exponential backoff on DeploymentResource delete-path errors with a bounded RequeueAfter via a configurable DeleteRetryInterval.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/functional-portable/kubernetes/noncloud/flux_test.go Collect DTs across steps and delete/wait them before namespace deletion to avoid teardown races.
test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go Adds deleteDeploymentTemplateAndWait helper used to sequence Radius-side teardown.
pkg/controller/reconciler/deploymenttemplate_reconciler.go Introduces ReadyPendingCleanup logic and residual DR detection before setting DT to Ready.
pkg/controller/reconciler/deploymenttemplate_reconciler_test.go Adds envtest coverage for Ready → ReadyPendingCleanup → Ready transition.
pkg/controller/reconciler/deploymentresource_reconciler.go Adds bounded delete retry (DeleteRetryInterval + RequeueAfter) to avoid exponential rate limiting.
pkg/controller/reconciler/deploymentresource_reconciler_test.go Adds envtest ensuring failed delete retries occur within the bounded interval.
pkg/controller/reconciler/const.go Adds DeleteRetryDelay default for bounded delete retries.
pkg/controller/api/radapp.io/v1alpha3/deploymenttemplate_types.go Adds the new ReadyPendingCleanup status phrase constant.

Comment thread test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go Outdated
Comment thread pkg/controller/reconciler/deploymenttemplate_reconciler.go
Copy link
Copy Markdown
Member

@DariuszPorowski DariuszPorowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one non-blocking comment in-line

Comment thread pkg/controller/reconciler/deploymentresource_reconciler.go
* Use require.NoErrorf / require.Eventuallyf in deleteDeploymentTemplateAndWait
  so the %s/%s placeholders are formatted in failure messages (the variadic
  msgAndArgs form of NoError / Eventually does not run through fmt.Sprintf).

* Tighten isOwnedBy to also match OwnerReference.UID. Without this, a
  DeploymentTemplate that is deleted and recreated with the same name would
  see its predecessor's still-draining DeploymentResources as residuals,
  holding the new DT in ReadyPendingCleanup indefinitely.

Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread pkg/controller/reconciler/deploymenttemplate_reconciler.go
* DeploymentTemplateReconciler.reconcileOperation: run the diff-delete loop
  even when the deployment response has nil OutputResources. Treating a nil
  response as an empty set ensures that a redeploy which produces no outputs
  still tears down the previously-tracked DRs; without this, the new
  hasResidualDeploymentResources check would flag the prior DRs as residuals
  forever and stick the DT in ReadyPendingCleanup.

* Add Test_DeploymentTemplateReconciler_NilOutputResourcesTriggersCleanup
  to lock in the behavior above.

* Fix grammar in deleteDeploymentTemplateAndWait godoc (plural subject).

Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Capture the polling error and surface it via require.NoErrorf outside the Eventually loop, so a real API/RBAC/connectivity failure during teardown is reported immediately rather than masquerading as a 10-minute timeout.

Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread pkg/controller/reconciler/deploymenttemplate_reconciler.go
A DR removed by the latest deployment diff may already be gone (GC-ed or manually deleted); returning the NotFound error from r.Client.Delete would fail the reconcile and reintroduce controller-runtime backoff during cleanup. Wrap with client.IgnoreNotFound so the diff-delete loop is idempotent.

Signed-off-by: willdavsmith <willdavsmith@gmail.com>
@radius-functional-tests
Copy link
Copy Markdown

radius-functional-tests Bot commented May 29, 2026

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository willdavsmith/radius
Commit ref 498c8e3
Unique ID funcbf8b73bbd2
Image tag pr-funcbf8b73bbd2
  • gotestsum 1.13.0
  • KinD: v0.29.0
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcbf8b73bbd2
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcbf8b73bbd2
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcbf8b73bbd2
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcbf8b73bbd2
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcbf8b73bbd2
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ corerp-cloud functional tests succeeded
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@willdavsmith willdavsmith enabled auto-merge (squash) May 29, 2026 20:33
@willdavsmith willdavsmith merged commit 52d5ac5 into radius-project:main May 29, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Functional test failure: stuck namespace deletion in Test_Flux_Complex

3 participants