Fix Test_Flux_Complex namespace teardown flake#12013
Conversation
Three layered changes to make Test_Flux_Complex deterministic and to fix the user-observable races it exposes: * testFluxIntegration: tear down each DeploymentTemplate (and wait for full drainage of its owned DeploymentResources) before deleting the K8s namespaces, so the namespace delete no longer races parallel Radius delete cascades. * DeploymentTemplateReconciler: introduce a new ReadyPendingCleanup phrase and gate the transition to Ready on full drainage of DeploymentResources removed by the latest diff, so observers (Flux, the test harness, custom operators) do not act on a premature 'Ready' signal. * DeploymentResourceReconciler: bound delete-path retries with an explicit RequeueAfter (default 30s, test-overridable via DeleteRetryInterval) instead of returning the error to controller-runtime, whose default exponential rate-limiter climbs to ~16 minutes and blows past practical drain budgets. Fixes: radius-project#11874 Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #12013 +/- ##
==========================================
+ Coverage 51.90% 51.97% +0.06%
==========================================
Files 732 732
Lines 46272 46305 +33
==========================================
+ Hits 24016 24065 +49
+ Misses 19957 19945 -12
+ Partials 2299 2295 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Fixes intermittent Test_Flux_Complex teardown timeouts by removing races between Kubernetes namespace deletion and in-flight Radius delete cascades, and by making controller status/backoff behavior more accurate and bounded during delete/cleanup paths.
Changes:
- Sequence functional-test teardown to delete
DeploymentTemplates (and wait for their cascades to drain) before deleting Kubernetes namespaces. - Add
DeploymentTemplatePhraseReadyPendingCleanupand keepStatus.Phrasefrom reportingReadyuntil residualDeploymentResourcedeletions are fully drained. - Replace controller-runtime exponential backoff on
DeploymentResourcedelete-path errors with a boundedRequeueAftervia a configurableDeleteRetryInterval.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/functional-portable/kubernetes/noncloud/flux_test.go | Collect DTs across steps and delete/wait them before namespace deletion to avoid teardown races. |
| test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go | Adds deleteDeploymentTemplateAndWait helper used to sequence Radius-side teardown. |
| pkg/controller/reconciler/deploymenttemplate_reconciler.go | Introduces ReadyPendingCleanup logic and residual DR detection before setting DT to Ready. |
| pkg/controller/reconciler/deploymenttemplate_reconciler_test.go | Adds envtest coverage for Ready → ReadyPendingCleanup → Ready transition. |
| pkg/controller/reconciler/deploymentresource_reconciler.go | Adds bounded delete retry (DeleteRetryInterval + RequeueAfter) to avoid exponential rate limiting. |
| pkg/controller/reconciler/deploymentresource_reconciler_test.go | Adds envtest ensuring failed delete retries occur within the bounded interval. |
| pkg/controller/reconciler/const.go | Adds DeleteRetryDelay default for bounded delete retries. |
| pkg/controller/api/radapp.io/v1alpha3/deploymenttemplate_types.go | Adds the new ReadyPendingCleanup status phrase constant. |
DariuszPorowski
left a comment
There was a problem hiding this comment.
LGTM, one non-blocking comment in-line
* Use require.NoErrorf / require.Eventuallyf in deleteDeploymentTemplateAndWait so the %s/%s placeholders are formatted in failure messages (the variadic msgAndArgs form of NoError / Eventually does not run through fmt.Sprintf). * Tighten isOwnedBy to also match OwnerReference.UID. Without this, a DeploymentTemplate that is deleted and recreated with the same name would see its predecessor's still-draining DeploymentResources as residuals, holding the new DT in ReadyPendingCleanup indefinitely. Signed-off-by: willdavsmith <willdavsmith@gmail.com>
* DeploymentTemplateReconciler.reconcileOperation: run the diff-delete loop even when the deployment response has nil OutputResources. Treating a nil response as an empty set ensures that a redeploy which produces no outputs still tears down the previously-tracked DRs; without this, the new hasResidualDeploymentResources check would flag the prior DRs as residuals forever and stick the DT in ReadyPendingCleanup. * Add Test_DeploymentTemplateReconciler_NilOutputResourcesTriggersCleanup to lock in the behavior above. * Fix grammar in deleteDeploymentTemplateAndWait godoc (plural subject). Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Capture the polling error and surface it via require.NoErrorf outside the Eventually loop, so a real API/RBAC/connectivity failure during teardown is reported immediately rather than masquerading as a 10-minute timeout. Signed-off-by: willdavsmith <willdavsmith@gmail.com>
A DR removed by the latest deployment diff may already be gone (GC-ed or manually deleted); returning the NotFound error from r.Client.Delete would fail the reconcile and reintroduce controller-runtime backoff during cleanup. Wrap with client.IgnoreNotFound so the diff-delete loop is idempotent. Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Description
Test_Flux_Complex(and the other tests sharingtestFluxIntegration) intermittently times out for 10 minutes in CI on the final namespace-delete wait. The root cause is three async Radius delete cascades racing against each other — and against the K8s namespace delete the test fires the instant step 3 returns — while theDeploymentTemplatereconciler publishesStatus.Phrase = Readybefore residualDeploymentResourcedeletions have actually drained. controller-runtime's default exponential rate-limiter then turns a couple of early dependency errors into multi-minute backoffs that blow past any practical test budget.This PR ships three layered fixes:
1. Sequential test teardown (
testFluxIntegration)testFluxIntegrationnow tears down eachDeploymentTemplate(and waits for full drainage of its ownedDeploymentResources, finalizers, and underlying Radius resources) before deleting the K8s namespaces. The namespace delete no longer competes with in-flight Radius cascades, so the previously-flakyEventuallybecomes a trivial wait for K8s GC.We union DTs across all steps when iterating, because some tests (e.g.
Test_Flux_Complexstep 3) intentionally remove DTs in later steps and iterating only the last step would miss DTs created by earlier steps.The per-step
defer opts.Client.Delete(ctx, deploymentTemplate)is left in place as a belt-and-suspenders for tests that fail mid-loop.2. Make
DeploymentTemplate.Status.Phrase = ReadyhonestToday,
DeploymentTemplateReconciler.reconcileOperationissuesr.Client.Delete()on residual DRs from a removed output and then immediately stampsPhrase = Ready. The K8sDeleteonly sets aDeletionTimestamp; the underlying UCP delete cascade is still in flight. This causes user-visible inconsistency on every rename/remove cycle (status reports success while orphan resources still exist) and feeds the test race above.This PR adds a new transitional phrase:
After the diff-delete loop, the reconciler now lists owned DRs and compares their
Spec.Idagainst the newoutputResourcesset. If any DR's ID is not in the new set, the DT holdsReadyPendingCleanupand requeues. On the next reconcile (woken by the existingOwns(&DeploymentResource{})watch when an owned DR is finally GC'd),reconcileUpdatere-evaluates and flips toReadyonly once the residuals are gone.Backwards-compatible: anything reading
Phraseand treating non-Readyas "not done yet" continues to work — better, in fact, becauseReadyno longer lies.3. Cap DR delete-retry backoff with explicit
RequeueAfterDeploymentResourceReconcilerpreviously returned UCP delete-path errors directly to controller-runtime, which applied its default rate-limiter (exponential, climbing to ~16 minutes). A couple of early dependency errors during a delete cascade could burn 5+ minutes before another retry — retries that would have succeeded immediately.This PR replaces all three delete-path
return ctrl.Result{}, errsites withreturn ctrl.Result{RequeueAfter: r.deleteRetryDelay()}, nil. The default is 30s, overridable in tests via a newDeleteRetryIntervalreconciler field (parallel to the existingDelayInterval/PollingDelay/requeueDelay()pattern).Type of change
Fixes: #11874
Contributor checklist
Please verify that the PR meets the following requirements, where applicable:
eng/design-notes/in this repository, if new APIs are being introduced.Verification
Test_DeploymentTemplateReconciler_ReadyPendingCleanup(pkg/controller/reconciler) — exercises the new phrase transition.Test_DeploymentResourceReconciler_DeleteRetryBackoff— verifies a failed UCP delete triggers a fresh retry within the bounded interval rather than controller-runtime's exponential rate-limiter.pkg/controller/...envtest suite passes locally (~30s).make test-functional-kubernetes-noncloud GOTEST_OPTS='-run Test_Flux_Complex -v -count=5'.Out of scope
applications-rp) — the architecturally correct long-term fix. Change Create a single integration test for environment setup #2 above substantially reduces the window in which it matters; Create a single integration test for application deployment #3 makes the remaining failures cheap; but the underlying same-owner-only dependency check incheckForDeploymentResourceDependenciesis still load-bearing until Delete sub-resources when application is deleted via API #8164 lands.