rvs: Fix race of ApprovedImage adoption with TEC object deletion#302
rvs: Fix race of ApprovedImage adoption with TEC object deletion#302iroykaufman wants to merge 1 commit into
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideMoves ApprovedImage ownership adoption into the ApprovedImage controller reconcile path (before finalizer registration) to avoid races with TrustedExecutionCluster deletion, and removes the now-redundant bulk adoption logic and its tests, updating controller tests accordingly. Sequence diagram for ApprovedImage adoption on first reconcilesequenceDiagram
actor User
participant KubeAPI
participant ApprovedImageController as ApprovedImageController_image_reconcile
participant TEC as TrustedExecutionCluster
participant GC as KubernetesGC
User->>KubeAPI: create ApprovedImage
Note over KubeAPI,ApprovedImageController: Reconcile event for ApprovedImage
KubeAPI-->>ApprovedImageController: image_reconcile(image)
ApprovedImageController->>KubeAPI: get TrustedExecutionCluster
KubeAPI-->>ApprovedImageController: TrustedExecutionCluster (cluster)
alt cluster exists and not owner
ApprovedImageController->>KubeAPI: adopt_approved_image(name, cluster)
KubeAPI-->>ApprovedImageController: ApprovedImage with ownerReferences
end
ApprovedImageController->>KubeAPI: finalizer(APPROVED_IMAGE_FINALIZER, image, handler)
KubeAPI-->>ApprovedImageController: ApprovedImage with finalizer
User->>KubeAPI: delete TrustedExecutionCluster
KubeAPI-->>GC: TEC deleted, cascade delete owned ApprovedImage
GC-->>KubeAPI: delete ApprovedImage
KubeAPI-->>ApprovedImageController: image_reconcile(delete event)
ApprovedImageController-->>KubeAPI: finalize and remove APPROVED_IMAGE_FINALIZER
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
image_reconcile, consider skipping adoption when theTrustedExecutionClusterhas adeletion_timestampset to avoid newly attaching an owner reference to a TEC that is already being deleted. - The ownership-check closures in
image_reconcile(uid_owns/cluster_owns) could be extracted into a small helper function to make this logic more readable and reusable if other controllers need to make the same check.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `image_reconcile`, consider skipping adoption when the `TrustedExecutionCluster` has a `deletion_timestamp` set to avoid newly attaching an owner reference to a TEC that is already being deleted.
- The ownership-check closures in `image_reconcile` (`uid_owns`/`cluster_owns`) could be extracted into a small helper function to make this logic more readable and reusable if other controllers need to make the same check.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| let kube_client = Arc::<Client>::unwrap_or_clone(client); | ||
| let err = "ApprovedImage had no name"; | ||
| let name = image.metadata.name.clone().context(err)?; | ||
| let cluster = get_opt_trusted_execution_cluster(kube_client.clone()) |
There was a problem hiding this comment.
I know I approved this in #273 and I'm sorry that I'm only complaining now, but I did wonder about something now that we have space to discuss it, so thank you for raising separately. Thus:
Is this really an improvement? When image creation and cluster deletion fall close to one another, this cluster might also be None, to the same effect.
Or did this cause a test failure? If it did, I wonder if that test should lower its expectations on how quick an adoption happens. test_approved_image_readoption has been doing that since 6664071 (which was only merged a few days ago -- did this patch make a test pass more consistently since that?)
There was a problem hiding this comment.
If the cluster has been deleted before adopt image was called, then yes, we will have the same problem. But we can add a condition where, if the cluster has been deleted and the image has not adopted yet, we will delete the image.
Before this patch, the test failed on test_approved_image_readoption. I don't remember if I used 6664071 but waiting for a bit longer probably can solve it but in my opinion it's not an ideal solution. WDYT
There was a problem hiding this comment.
If the cluster has been deleted before adopt image was called, then yes, we will have the same problem. But we can add a condition where, if the cluster has been deleted and the image has not adopted yet, we will delete the image. Before this patch, the test failed on
test_approved_image_readoption.
hmm the intention is that you may have ApprovedImages before a TrustedExecutionCluster exists, but with your suggestion (IIUIC) they'd always be removed, even if a TrustedExecutionCluster had never existed
I don't remember if I used 6664071 but waiting for a bit longer probably can solve it but in my opinion it's not an ideal solution. WDYT
FWIW that commit doesn't wait a fixed time in seconds, but awaits the ownership
|
Thanks for raising the PR, I like the fact that we are now adopting the image as part of its reconciliation loop. @Jakob-Naucke I know I introduce this, but probably we have the same pattern also on other object we should check it. |
I realize the tests passed on this one, but logically, should the image even be adopted as part of its reconciliation when a cluster is created? What if the cluster is created later? |
Ah I see, you mean it should be part of the reconciliation loop of TEC right? I guess it should be on both depending who is created first |
Yes, and I didn't clarify this in my earlier answer, but this is already what we do today: Line 140 in 103fa3f operator/operator/src/reference_values.rs Line 290 in 103fa3f |
We probably need both. Roy's fix looks correct to me but we need to leave the adoption of the approved images also in the TEC reconcile loop |
This make sense. I'm on it. |
The race scenario: - Assume the operator is installed and TEC object exists - Apply an approved-image yaml file (create a new CR instance) - Delete the TEC object (immediately) - The new approved-image is not deleted with the TEC object This happens because kube-rs finalizer() does not call the Apply handler on the first reconcile — it only adds the finalizer and returns. This means adopt_approved_image inside image_add_reconcile never runs on the first reconcile. If the TEC is deleted before the second reconcile, the owner reference is never set and GC (garbage collector) cannot cascade-delete the ApprovedImage. Move adoption before the finalizer() call in image_reconcile so it runs on the very first reconcile. For more information see https://docs.rs/kube/latest/kube/runtime/finalizer/fn.finalizer.html in sections Guarantees, Assumptions (Third point) and Expected Flow (from 1 to 6). Signed-off-by: Roy Kaufman <rkaufman@redhat.com>
Jakob-Naucke
left a comment
There was a problem hiding this comment.
Fine. I still think 6664071 fixed the test failure that this patch tries to improve, and that the race is still possible after this patch, but can see how it might be less likely.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alicefr, iroykaufman, Jakob-Naucke The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
was added in error, thinking that the `labeled` event could not apply to `pull_request` when it can. This lead to failures because checkout is not enabled by default on `pull_request_target` for security reasons, as happened in trusted-execution-clusters#302. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
The race scenario:
This happens because kube-rs finalizer() does not call the Apply handler on the first reconcile — it only adds the finalizer and returns. This means adopt_approved_image inside image_add_reconcile never runs on the first reconcile. If the TEC is deleted before the second reconcile, the owner reference is never set and GC (garbage collector) cannot cascade-delete the ApprovedImage.
Move adoption before the finalizer() call in image_reconcile so it runs on the very first reconcile.
For more information see https://docs.rs/kube/latest/kube/runtime/finalizer/fn.finalizer.html in sections Guarantees, Assumptions (Third point) and Expected Flow(from 1 to 6).
Summary by Sourcery
Ensure ApprovedImage resources are adopted by the owning TrustedExecutionCluster on first reconcile to avoid orphaned images during TEC deletion, and remove the now-unnecessary bulk adoption logic from the main reconcile path.
Bug Fixes:
Enhancements: