Use event updates for PR reconciliation#564
Conversation
|
The following is the coverage report on the affected files.
|
| if parent, err := tektonObj.GetParent(ctx, o.Pipelineclientset); err != nil { | ||
| return err | ||
| } else if parent != nil { | ||
| if err := MarkParentSigned(ctx, parent, o.Pipelineclientset); err != nil { |
There was a problem hiding this comment.
Do we need to wait until all Tasks are done before doing this?
In particular I'm thinking about if we have a matrix pipeline that builds images for different arch in different Tasks - if we set this too early the Pipeline sig might still clash with other Tasks.
Maybe we invert this relationship and have objects query for their parent status? 🤔
There was a problem hiding this comment.
Do we need to wait until all Tasks are done before doing this?
In reality, we just need to wait until the last Task within the Pipeline that has signable subjects. We could enhance the logic to handle that, not sure how much more beneficial that would be though.
In particular I'm thinking about if we have a matrix pipeline that builds images for different arch in different Tasks - if we set this too early the Pipeline sig might still clash with other Tasks.
I think in this case, each of those Tasks produces a different image (unique digest). So the Tasks themselves wouldn't interfere with each other's signatures and attestations.
Maybe we invert this relationship and have objects query for their parent status?
Interesting idea. It would mean that signatures and attestations may have a perceivable delay in being created (maybe?). Not sure if that's an issue.
There was a problem hiding this comment.
In reality, we just need to wait until the last Task within the Pipeline that has signable subjects. We could enhance the logic to handle that, not sure how much more beneficial that would be though.
Makes sense! This is just waiting until 1 Task is done before annotating the PipelineRun to go though, right? That's why I'm wondering if we should just wait until the pipeline clears first before we handle the TaskRuns 🤔
I think in this case, each of those Tasks produces a different image (unique digest). So the Tasks themselves wouldn't interfere with each other's signatures and attestations.
Hmmm I think I'm not fully grasping the problem then - if TaskRuns are pushing unique digests that won't interfere with each other, how does this interfere with the PipelineRun? 👀
Interesting idea. It would mean that signatures and attestations may have a perceivable delay in being created (maybe?). Not sure if that's an issue.
I'm not super concerned - chains is already async so there's an expectation of some delay. We could use https://pkg.go.dev/knative.dev/pkg/controller#Impl.EnqueueKey to force an immediate requeue of sub resources to help speed things up if needed.
There was a problem hiding this comment.
Makes sense! This is just waiting until 1 Task is done before annotating the PipelineRun to go though, right?
This updates the PipelineRun whenever a TaskRun finishes. So if the PipelineRun contains 5 TaskRuns, the annotation on the PipelineRun will get updated 5 times.
That's why I'm wondering if we should just wait until the pipeline clears first before we handle the TaskRuns
I think to make this work, we would need to update an annotation on the TaskRun to indicate whether or not it can be proceed by Chains, right? That might work. I'm not sure what the main benefit would be though. Can you help me understand?
The reason I chose the current approach is because I wanted to reconcile resources as early as possible. Waiting to reconcile TaskRuns until the whole PipelineRun completes goes against that. I'm not saying that we shouldn't do that, just clarifying my thought process.
Hmmm I think I'm not fully grasping the problem then - if TaskRuns are pushing unique digests that won't interfere with each other, how does this interfere with the PipelineRun?
Because if you want a PipelineRun attestation for each one of those images, they need to become Pipeline results. Let's say a PipelineRun contains three TaskRuns, each building a unique image (e.g. multi-arch). Since they're unique, they have different digests, so even if they're pushed to the same OCI repo, they'll each have their own signature and attestation images. (This is also the case prior to TEP-84). Now let's say the PipelineRun produces multiple images in its result, which makes sense since at the scope of a PipelineRun, multiple images were built. The PipelineRun reconciler is going to generate the PipelineRun attestation (which is the same for all three images), and push them not only to the same OCI repo, but also to the same sha256-<digest>.att tag. cosign is smart enough to not overwrite the attestations that are already there, so you end up with attestation images that contain two layers. The first layer contains the TaskRun attestation (unique across all three images), and the second containing the PipelineRun attestation (the same across all three images).
There was a problem hiding this comment.
Okay, now that I'm caught up and grok everything, I think we can actually do this with less changes!
I think to make this work, we would need to update an annotation on the TaskRun to indicate whether or not it can be proceed by Chains, right? That might work. I'm not sure what the main benefit would be though. Can you help me understand?
I think we already have this in the chains.tekton.dev/signed annotation + the reconciled check here:
chains/pkg/chains/annotations.go
Lines 36 to 43 in ad5a650
I think what we probably want to do is swap out the direct client in
for the lister cache via the informer from https://pkg.go.dev/github.com/tektoncd/pipeline@v0.40.0/pkg/client/injection/informers/pipeline/v1beta1/taskrun . This should be safe to query as much as we want and should generally have the latest-ish values.To trigger the resync from the TaskRun, you can just requeue the parent object key using https://pkg.go.dev/knative.dev/pkg@v0.0.0-20220921024409-d1d5c849073b/controller#Impl.EnqueueKey
This should block the PipelineRun reconcile until all the TaskRuns are done and minimize the # of API calls the PipelineRun reconciler is making to query TaskRun state.
The reason I chose the current approach is because I wanted to reconcile resources as early as possible. Waiting to reconcile TaskRuns until the whole PipelineRun completes goes against that. I'm not saying that we shouldn't do that, just clarifying my thought process.
Got it. Makes sense!
There was a problem hiding this comment.
ooh I like this idea.
I replaced the line with:
taskruninformer.Get(ctx).Lister().TaskRuns(pr.Namespace).Get(name)But I'm getting a panic when that line is run:
Unable to fetch github.com/tektoncd/pipeline/pkg/client/informers/externalversions/pipeline/v1beta1.TaskRunInformer from context.I'm not well-versed here, but it sounds like maybe I also need to do some setup on the informer.
There was a problem hiding this comment.
Yeah I always forget where the client injection needs to go. Try passing it when the controller is created similar to the pipelinerun informer.
There was a problem hiding this comment.
Thanks for the tip 🙏 Pushed a new commit that takes this approach. It should further improve performance.
For the second part of your suggestion, I'm not sure if it'll work. It would require the TaskRun controller/reconciler to add something on the queue of the PipelineRun controller/reconciler. I'm likely missing something obvious here though.
There was a problem hiding this comment.
Shout out to @vaikas for this!
We should be able to use a tracker for this to watch a PipelineRun's children:
https://github.com/sigstore/policy-controller/blob/main/pkg/reconciler/clusterimagepolicy/clusterimagepolicy.go#L204 for an example
There was a problem hiding this comment.
|
Also a few general questions:
|
Unlike the taskrun reconciler, the pipelinerun reconciler doesn't sign images, it only creates attestations. But, yes, the same concept applies. This is not an issue when other types of storage are used, afaik.
Yes, we lose data and events. If both the taskrun and pipelinerun reconcilers run at the same time, either the taskrun or pipelinerun attestation is lost. We saw this early on in the POC. This is why the check was added to only reconcile pipelineruns once their taskruns have been reconciled. If we don't poll (like the code used to), and don't update the parent resource (like this PR does), then there's nothing that triggers another pipelinerun reconciliation, thus the pipelinerun attestation is not created. (Oddly enough, if you do delete the pipelinerun then the attestation is created due to the use of finalizers. I don't think we should rely on that though.) |
e752bae to
bc87aa8
Compare
|
The following is the coverage report on the affected files.
|
Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision? |
Something like |
I was misunderstanding the problem. 🙃 I get it now - we can hit this case any time we try to attach things to an OCI image, which can happen for simple signing, but also when we try and attach the build provenance to the image as well. I was getting tripped up thinking about the case where we are just using an OCI registry as a storage bucket for build provenance w/o attaching it to an image, which should be safe from this. +1 to staying consistent with cosign. This could be solved with a bit of retries if we could pass an etag, though looks like this is still WIP - google/go-containerregistry#1333. /cc @imjasonh - any OCI wisdom you can pass down here? |
|
The following is the coverage report on the affected files.
|
|
/test pull-tekton-chains-integration-tests |
|
The following is the coverage report on the affected files.
|
|
So, this looks right to me! Glad the Tracker worked out :) The test seems to be failing (I think?) because of this: And what's is expected is this: Notice the entryPoint and arguments are different. Could it be a newer version of ko, or something else? |
A PipelineRun should only be processed when the underlying TaskRun tasks have been signed. This prevents timing issues where the Chains controller may attempt to push both TaskRun and PipelineRun attestations to an OCI repository at the same time, causing one of them to be lost. Prior to this change, a PipelineRun that was not ready to be processed was requeued so it could be re-processed in 15 seconds. Given that this scenario happens often, this caused the Chains controller to be overwhelmed if there were too many TaskRun and/or PipelineRun resources on the cluster. This commit modifies the logic so when Chains signs a TaskRun, it also updates an annotation, `chains.tekton.dev/child-signed`, on the parent PipelineRun, if any. The annotation update triggers the PipelineRun to be reconciled again. Thus the logic is driven by "events" instead of a polling mechanism. Signed-off-by: Luiz Carvalho <lucarval@redhat.com> Notify parent object after child is reconciled Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
f493b69 to
02539eb
Compare
|
The following is the coverage report on the affected files.
|
|
@wlynch, lgtm label removed because I had to rebase to pickup the e2e test workaround 🙏 |
|
@wlynch, @priyawadhwa, this PR is approved but needs lgtm label. Can I just add it to my own PRs? Or is that not a recommended practice? |
Not all registries support ETags today, and though there's guidance to recommend it, there's currently no way to tell whether the ETag you pass will be honored. That being said, you should still set them optimistically. When/if that ggcr PR lands you should just get it For Free™️, but I don't know at this time whether or when it will land. I'll poke Jon to give his feedback. |
wlynch
left a comment
There was a problem hiding this comment.
Sorry about the delay! Was OOO Mon-Tue 🙏
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wlynch The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
A PipelineRun should only be processed when the underlying TaskRun tasks have been signed. This prevents timing issues where the Chains controller may attempt to push both TaskRun and PipelineRun attestations to an OCI repository at the same time, causing one of them to be lost.
Prior to this change, a PipelineRun that was not ready to be processed was requeued so it could be re-processed in 15 seconds. Given that this scenario happens often, this caused the Chains controller to be overwhelmed if there were too many TaskRun and/or PipelineRun resources on the cluster.
This commit modifies the logic so when Chains signs a TaskRun, it also updates an annotation,chains.tekton.dev/child-signed, on the parent PipelineRun, if any. The annotation update triggers the PipelineRun to be reconciled again. Thus the logic is driven by "events" instead of a polling mechanism.This commit modifies the logic in three significant ways. First, it removes the polling every 15 seconds. Second, if the PipelineRun cannot be reconciled because the TaskRun is not yet signed, a tracker is created so the PipelineRun controller is called again automatically by knative if the TaskRun changes. Finally, instead of querying the k8s API for the latest version of the TaskRun resource from the PipelineRun controller, an informer is used to minimize API calls.