Skip to content

rvs: Fix race of ApprovedImage adoption with TEC object deletion#302

Open
iroykaufman wants to merge 1 commit into
trusted-execution-clusters:mainfrom
iroykaufman:rvs-race
Open

rvs: Fix race of ApprovedImage adoption with TEC object deletion#302
iroykaufman wants to merge 1 commit into
trusted-execution-clusters:mainfrom
iroykaufman:rvs-race

Conversation

@iroykaufman

@iroykaufman iroykaufman commented Jul 1, 2026

Copy link
Copy Markdown
Member

The race scenario:

  • Assume the operator is installed and TEC object exists
  • Apply an approved-image yaml file (create a new CR instance)
  • Delete the TEC object (immediately)
  • The new approved-image is not deleted with the TEC object

This happens because kube-rs finalizer() does not call the Apply handler on the first reconcile — it only adds the finalizer and returns. This means adopt_approved_image inside image_add_reconcile never runs on the first reconcile. If the TEC is deleted before the second reconcile, the owner reference is never set and GC (garbage collector) cannot cascade-delete the ApprovedImage.

Move adoption before the finalizer() call in image_reconcile so it runs on the very first reconcile.

For more information see https://docs.rs/kube/latest/kube/runtime/finalizer/fn.finalizer.html in sections Guarantees, Assumptions (Third point) and Expected Flow(from 1 to 6).

Summary by Sourcery

Ensure ApprovedImage resources are adopted by the owning TrustedExecutionCluster on first reconcile to avoid orphaned images during TEC deletion, and remove the now-unnecessary bulk adoption logic from the main reconcile path.

Bug Fixes:

  • Fix race where ApprovedImage resources could miss owner references and not be garbage-collected when their TrustedExecutionCluster is deleted.

Enhancements:

  • Adopt ApprovedImage resources within the image reconcile flow before finalizer handling, instead of via a separate bulk adoption step in the main reconcile.
  • Simplify tests and controller interactions by removing bulk ApprovedImage adoption and its dedicated test coverage.

@sourcery-ai

sourcery-ai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Moves ApprovedImage ownership adoption into the ApprovedImage controller reconcile path (before finalizer registration) to avoid races with TrustedExecutionCluster deletion, and removes the now-redundant bulk adoption logic and its tests, updating controller tests accordingly.

Sequence diagram for ApprovedImage adoption on first reconcile

sequenceDiagram
    actor User
    participant KubeAPI
    participant ApprovedImageController as ApprovedImageController_image_reconcile
    participant TEC as TrustedExecutionCluster
    participant GC as KubernetesGC

    User->>KubeAPI: create ApprovedImage
    Note over KubeAPI,ApprovedImageController: Reconcile event for ApprovedImage
    KubeAPI-->>ApprovedImageController: image_reconcile(image)

    ApprovedImageController->>KubeAPI: get TrustedExecutionCluster
    KubeAPI-->>ApprovedImageController: TrustedExecutionCluster (cluster)

    alt cluster exists and not owner
        ApprovedImageController->>KubeAPI: adopt_approved_image(name, cluster)
        KubeAPI-->>ApprovedImageController: ApprovedImage with ownerReferences
    end

    ApprovedImageController->>KubeAPI: finalizer(APPROVED_IMAGE_FINALIZER, image, handler)
    KubeAPI-->>ApprovedImageController: ApprovedImage with finalizer

    User->>KubeAPI: delete TrustedExecutionCluster
    KubeAPI-->>GC: TEC deleted, cascade delete owned ApprovedImage
    GC-->>KubeAPI: delete ApprovedImage

    KubeAPI-->>ApprovedImageController: image_reconcile(delete event)
    ApprovedImageController-->>KubeAPI: finalize and remove APPROVED_IMAGE_FINALIZER
Loading

File-Level Changes

Change Details Files
Adopt ApprovedImage resources during image reconcile before finalizer registration to ensure owner references are set on the first reconcile.
  • Introduce helper closures to check if an ApprovedImage is already owned by a given TrustedExecutionCluster UID.
  • In image_reconcile, if a cluster exists and does not already own the image, call adopt_approved_image before invoking kube::runtime::finalizer.
  • Remove equivalent owner-check/adoption logic from image_add_reconcile to avoid duplication.
operator/src/reference_values.rs
Remove bulk adoption of all ApprovedImages during cluster reconcile and its tests as adoption is now handled per-image.
  • Delete adopt_approved_images helper that listed and adopted all ApprovedImage objects for a cluster.
  • Remove the call to adopt_approved_images from the main TrustedExecutionCluster reconcile function.
  • Delete associated unit tests for adopt_approved_images and adjust existing tests to the new reconcile flow and API interaction counts.
operator/src/reference_values.rs
operator/src/main.rs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In image_reconcile, consider skipping adoption when the TrustedExecutionCluster has a deletion_timestamp set to avoid newly attaching an owner reference to a TEC that is already being deleted.
  • The ownership-check closures in image_reconcile (uid_owns/cluster_owns) could be extracted into a small helper function to make this logic more readable and reusable if other controllers need to make the same check.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `image_reconcile`, consider skipping adoption when the `TrustedExecutionCluster` has a `deletion_timestamp` set to avoid newly attaching an owner reference to a TEC that is already being deleted.
- The ownership-check closures in `image_reconcile` (`uid_owns`/`cluster_owns`) could be extracted into a small helper function to make this logic more readable and reusable if other controllers need to make the same check.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

let kube_client = Arc::<Client>::unwrap_or_clone(client);
let err = "ApprovedImage had no name";
let name = image.metadata.name.clone().context(err)?;
let cluster = get_opt_trusted_execution_cluster(kube_client.clone())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I approved this in #273 and I'm sorry that I'm only complaining now, but I did wonder about something now that we have space to discuss it, so thank you for raising separately. Thus:

Is this really an improvement? When image creation and cluster deletion fall close to one another, this cluster might also be None, to the same effect.

Or did this cause a test failure? If it did, I wonder if that test should lower its expectations on how quick an adoption happens. test_approved_image_readoption has been doing that since 6664071 (which was only merged a few days ago -- did this patch make a test pass more consistently since that?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster has been deleted before adopt image was called, then yes, we will have the same problem. But we can add a condition where, if the cluster has been deleted and the image has not adopted yet, we will delete the image.
Before this patch, the test failed on test_approved_image_readoption. I don't remember if I used 6664071 but waiting for a bit longer probably can solve it but in my opinion it's not an ideal solution. WDYT

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster has been deleted before adopt image was called, then yes, we will have the same problem. But we can add a condition where, if the cluster has been deleted and the image has not adopted yet, we will delete the image. Before this patch, the test failed on test_approved_image_readoption.

hmm the intention is that you may have ApprovedImages before a TrustedExecutionCluster exists, but with your suggestion (IIUIC) they'd always be removed, even if a TrustedExecutionCluster had never existed

I don't remember if I used 6664071 but waiting for a bit longer probably can solve it but in my opinion it's not an ideal solution. WDYT

FWIW that commit doesn't wait a fixed time in seconds, but awaits the ownership

Comment thread operator/src/reference_values.rs
@alicefr

alicefr commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Thanks for raising the PR, I like the fact that we are now adopting the image as part of its reconciliation loop. @Jakob-Naucke I know I introduce this, but probably we have the same pattern also on other object we should check it.

@Jakob-Naucke

Copy link
Copy Markdown
Member

I like the fact that we are now adopting the image as part of its reconciliation loop

I realize the tests passed on this one, but logically, should the image even be adopted as part of its reconciliation when a cluster is created? What if the cluster is created later?

@alicefr

alicefr commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

I like the fact that we are now adopting the image as part of its reconciliation loop

I realize the tests passed on this one, but logically, should the image even be adopted as part of its reconciliation when a cluster is created? What if the cluster is created later?

Ah I see, you mean it should be part of the reconciliation loop of TEC right? I guess it should be on both depending who is created first

@Jakob-Naucke

Jakob-Naucke commented Jul 1, 2026

Copy link
Copy Markdown
Member

I realize the tests passed on this one, but logically, should the image even be adopted as part of its reconciliation when a cluster is created? What if the cluster is created later?

Ah I see, you mean it should be part of the reconciliation loop of TEC right? I guess it should be on both depending who is created first

Yes, and I didn't clarify this in my earlier answer, but this is already what we do today:

reference_values::adopt_approved_images(kube_client, &cluster).await?;
adopt_approved_image(client.clone(), name, &cluster).await?;

@alicefr

alicefr commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Yes, and I didn't clarify this in my earlier answer, but this is already what we do today:

reference_values::adopt_approved_images(kube_client, &cluster).await?;

adopt_approved_image(client.clone(), name, &cluster).await?;

We probably need both. Roy's fix looks correct to me but we need to leave the adoption of the approved images also in the TEC reconcile loop

@iroykaufman

Copy link
Copy Markdown
Member Author

Yes, and I didn't clarify this in my earlier answer, but this is already what we do today:

reference_values::adopt_approved_images(kube_client, &cluster).await?;

adopt_approved_image(client.clone(), name, &cluster).await?;

We probably need both. Roy's fix looks correct to me but we need to leave the adoption of the approved images also in the TEC reconcile loop

This make sense. I'm on it.

The race scenario:
- Assume the operator is installed and TEC object exists
- Apply an approved-image yaml file (create a new CR instance)
- Delete the TEC object (immediately)
- The new approved-image is not deleted with the TEC object

This happens because kube-rs finalizer() does not call the Apply handler on the first
reconcile — it only adds the finalizer and returns. This means
adopt_approved_image inside image_add_reconcile never runs on the
first reconcile. If the TEC is deleted before the second reconcile,
the owner reference is never set and GC (garbage collector) cannot cascade-delete the
ApprovedImage.

Move adoption before the finalizer() call in image_reconcile so
it runs on the very first reconcile.

For more information see https://docs.rs/kube/latest/kube/runtime/finalizer/fn.finalizer.html
in sections Guarantees, Assumptions (Third point) and Expected Flow (from 1 to 6).

Signed-off-by: Roy Kaufman <rkaufman@redhat.com>

@Jakob-Naucke Jakob-Naucke left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. I still think 6664071 fixed the test failure that this patch tries to improve, and that the race is still possible after this patch, but can see how it might be less likely.

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alicefr, iroykaufman, Jakob-Naucke

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jakob-Naucke added a commit to Jakob-Naucke/trusted-cluster-operator that referenced this pull request Jul 1, 2026
was added in error, thinking that the `labeled` event could not apply
to `pull_request` when it can. This lead to failures because checkout
is not enabled by default on `pull_request_target` for security
reasons, as happened in trusted-execution-clusters#302.

Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants