Use event updates for PR reconciliation by lcarva · Pull Request #564 · tektoncd/chains

lcarva · 2022-09-22T19:08:35Z

Changes

A PipelineRun should only be processed when the underlying TaskRun tasks have been signed. This prevents timing issues where the Chains controller may attempt to push both TaskRun and PipelineRun attestations to an OCI repository at the same time, causing one of them to be lost.

Prior to this change, a PipelineRun that was not ready to be processed was requeued so it could be re-processed in 15 seconds. Given that this scenario happens often, this caused the Chains controller to be overwhelmed if there were too many TaskRun and/or PipelineRun resources on the cluster.

This commit modifies the logic so when Chains signs a TaskRun, it also updates an annotation, chains.tekton.dev/child-signed, on the parent PipelineRun, if any. The annotation update triggers the PipelineRun to be reconciled again. Thus the logic is driven by "events" instead of a polling mechanism.

This commit modifies the logic in three significant ways. First, it removes the polling every 15 seconds. Second, if the PipelineRun cannot be reconciled because the TaskRun is not yet signed, a tracker is created so the PipelineRun controller is called again automatically by knative if the TaskRun changes. Finally, instead of querying the k8s API for the latest version of the TaskRun resource from the PipelineRun controller, an informer is used to minimize API calls.

tekton-robot · 2022-09-22T19:12:38Z

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/chains/annotations.go	87.9%	82.9%	-5.0
pkg/chains/signing.go	75.0%	73.0%	-2.0

wlynch · 2022-09-23T14:55:14Z

+	if parent, err := tektonObj.GetParent(ctx, o.Pipelineclientset); err != nil {
+		return err
+	} else if parent != nil {
+		if err := MarkParentSigned(ctx, parent, o.Pipelineclientset); err != nil {


Do we need to wait until all Tasks are done before doing this?

In particular I'm thinking about if we have a matrix pipeline that builds images for different arch in different Tasks - if we set this too early the Pipeline sig might still clash with other Tasks.

Maybe we invert this relationship and have objects query for their parent status? 🤔

Do we need to wait until all Tasks are done before doing this?

In reality, we just need to wait until the last Task within the Pipeline that has signable subjects. We could enhance the logic to handle that, not sure how much more beneficial that would be though.

In particular I'm thinking about if we have a matrix pipeline that builds images for different arch in different Tasks - if we set this too early the Pipeline sig might still clash with other Tasks.

I think in this case, each of those Tasks produces a different image (unique digest). So the Tasks themselves wouldn't interfere with each other's signatures and attestations.

Maybe we invert this relationship and have objects query for their parent status?

Interesting idea. It would mean that signatures and attestations may have a perceivable delay in being created (maybe?). Not sure if that's an issue.

In reality, we just need to wait until the last Task within the Pipeline that has signable subjects. We could enhance the logic to handle that, not sure how much more beneficial that would be though.

Makes sense! This is just waiting until 1 Task is done before annotating the PipelineRun to go though, right? That's why I'm wondering if we should just wait until the pipeline clears first before we handle the TaskRuns 🤔

I think in this case, each of those Tasks produces a different image (unique digest). So the Tasks themselves wouldn't interfere with each other's signatures and attestations.

Hmmm I think I'm not fully grasping the problem then - if TaskRuns are pushing unique digests that won't interfere with each other, how does this interfere with the PipelineRun? 👀

Interesting idea. It would mean that signatures and attestations may have a perceivable delay in being created (maybe?). Not sure if that's an issue.

I'm not super concerned - chains is already async so there's an expectation of some delay. We could use https://pkg.go.dev/knative.dev/pkg/controller#Impl.EnqueueKey to force an immediate requeue of sub resources to help speed things up if needed.

Makes sense! This is just waiting until 1 Task is done before annotating the PipelineRun to go though, right?

This updates the PipelineRun whenever a TaskRun finishes. So if the PipelineRun contains 5 TaskRuns, the annotation on the PipelineRun will get updated 5 times.

That's why I'm wondering if we should just wait until the pipeline clears first before we handle the TaskRuns

I think to make this work, we would need to update an annotation on the TaskRun to indicate whether or not it can be proceed by Chains, right? That might work. I'm not sure what the main benefit would be though. Can you help me understand?

The reason I chose the current approach is because I wanted to reconcile resources as early as possible. Waiting to reconcile TaskRuns until the whole PipelineRun completes goes against that. I'm not saying that we shouldn't do that, just clarifying my thought process.

Hmmm I think I'm not fully grasping the problem then - if TaskRuns are pushing unique digests that won't interfere with each other, how does this interfere with the PipelineRun?

Because if you want a PipelineRun attestation for each one of those images, they need to become Pipeline results. Let's say a PipelineRun contains three TaskRuns, each building a unique image (e.g. multi-arch). Since they're unique, they have different digests, so even if they're pushed to the same OCI repo, they'll each have their own signature and attestation images. (This is also the case prior to TEP-84). Now let's say the PipelineRun produces multiple images in its result, which makes sense since at the scope of a PipelineRun, multiple images were built. The PipelineRun reconciler is going to generate the PipelineRun attestation (which is the same for all three images), and push them not only to the same OCI repo, but also to the same sha256-<digest>.att tag. cosign is smart enough to not overwrite the attestations that are already there, so you end up with attestation images that contain two layers. The first layer contains the TaskRun attestation (unique across all three images), and the second containing the PipelineRun attestation (the same across all three images).

Okay, now that I'm caught up and grok everything, I think we can actually do this with less changes!

I think to make this work, we would need to update an annotation on the TaskRun to indicate whether or not it can be proceed by Chains, right? That might work. I'm not sure what the main benefit would be though. Can you help me understand?

I think we already have this in the chains.tekton.dev/signed annotation + the reconciled check here:

chains/pkg/chains/annotations.go

Lines 36 to 43 in ad5a650

func Reconciled(obj objects.TektonObject) bool {

annotations := obj.GetAnnotations()

val, ok := annotations[ChainsAnnotation]

if !ok {

return false

}

return val == "true" || val == "failed"

}

I think what we probably want to do is swap out the direct client in

chains/pkg/reconciler/pipelinerun/pipelinerun.go

Line 94 in ad5a650

tr, err := r.Pipelineclientset.TektonV1beta1().TaskRuns(pr.Namespace).Get(ctx, name, v1.GetOptions{})

for the lister cache via the informer from https://pkg.go.dev/github.com/tektoncd/pipeline@v0.40.0/pkg/client/injection/informers/pipeline/v1beta1/taskrun . This should be safe to query as much as we want and should generally have the latest-ish values.

To trigger the resync from the TaskRun, you can just requeue the parent object key using https://pkg.go.dev/knative.dev/pkg@v0.0.0-20220921024409-d1d5c849073b/controller#Impl.EnqueueKey

This should block the PipelineRun reconcile until all the TaskRuns are done and minimize the # of API calls the PipelineRun reconciler is making to query TaskRun state.

The reason I chose the current approach is because I wanted to reconcile resources as early as possible. Waiting to reconcile TaskRuns until the whole PipelineRun completes goes against that. I'm not saying that we shouldn't do that, just clarifying my thought process.

Got it. Makes sense!

ooh I like this idea.

I replaced the line with:

taskruninformer.Get(ctx).Lister().TaskRuns(pr.Namespace).Get(name)

But I'm getting a panic when that line is run:

Unable to fetch github.com/tektoncd/pipeline/pkg/client/informers/externalversions/pipeline/v1beta1.TaskRunInformer from context.

I'm not well-versed here, but it sounds like maybe I also need to do some setup on the informer.

Yeah I always forget where the client injection needs to go. Try passing it when the controller is created similar to the pipelinerun informer.

chains/pkg/reconciler/pipelinerun/controller.go

Line 33 in ad5a650

pipelineRunInformer := pipelineruninformer.Get(ctx)

Thanks for the tip 🙏 Pushed a new commit that takes this approach. It should further improve performance.

For the second part of your suggestion, I'm not sure if it'll work. It would require the TaskRun controller/reconciler to add something on the queue of the PipelineRun controller/reconciler. I'm likely missing something obvious here though.

Shout out to @vaikas for this!

We should be able to use a tracker for this to watch a PipelineRun's children:

https://github.com/sigstore/policy-controller/blob/main/pkg/reconciler/clusterimagepolicy/clusterimagepolicy.go#L204 for an example

Kudos to @vaikas!

Added a commit that takes this approach.

@wlynch, if the current changes look good to you, I can squash the commits if you prefer.

(The failing e2e test seems to be unrelated to these changes because it's also failing on my PR that updates just the docs.)

wlynch · 2022-09-23T14:58:15Z

Also a few general questions:

Is this primarily affecting OCI artifacts w/ OCI storage since we're trying to update the same .sig file, or did you observe this for tekton artifacts as well?
Are we actually losing data / events when this happens (i.e. the events are successfully writing over each other), or are we just getting a conflict error and we'll eventually get to the right state?

lcarva · 2022-09-23T17:06:54Z

1. Is this primarily affecting OCI artifacts w/ OCI storage since we're trying to update the same `.sig` file, or did you observe this for tekton artifacts as well?

Unlike the taskrun reconciler, the pipelinerun reconciler doesn't sign images, it only creates attestations. But, yes, the same concept applies. This is not an issue when other types of storage are used, afaik.

2. Are we actually losing data / events when this happens (i.e. the events are successfully writing over each other), or are we just getting a conflict error and we'll eventually get to the right state?

Yes, we lose data and events. If both the taskrun and pipelinerun reconcilers run at the same time, either the taskrun or pipelinerun attestation is lost. We saw this early on in the POC. This is why the check was added to only reconcile pipelineruns once their taskruns have been reconciled. If we don't poll (like the code used to), and don't update the parent resource (like this PR does), then there's nothing that triggers another pipelinerun reconciliation, thus the pipelinerun attestation is not created. (Oddly enough, if you do delete the pipelinerun then the attestation is created due to the use of finalizers. I don't think we should rely on that though.)

tekton-robot · 2022-09-23T18:30:44Z

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/chains/annotations.go	87.9%	82.9%	-5.0
pkg/chains/signing.go	75.0%	73.0%	-2.0

wlynch · 2022-09-26T16:07:35Z

Unlike the taskrun reconciler, the pipelinerun reconciler doesn't sign images, it only creates attestations. But, yes, the same concept applies. This is not an issue when other types of storage are used, afaik.

Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision?

lcarva · 2022-09-26T17:30:43Z

Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision?

Something like sha256-<digest>.pr.at ? Yes, it would absolutely avoid this sort of issue. I'm not sure if that's a good idea though since cosign wouldn't know about it.

wlynch · 2022-09-26T20:07:24Z

Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision?

Something like sha256-<digest>.pr.at ? Yes, it would absolutely avoid this sort of issue. I'm not sure if that's a good idea though since cosign wouldn't know about it.

I was misunderstanding the problem. 🙃 I get it now - we can hit this case any time we try to attach things to an OCI image, which can happen for simple signing, but also when we try and attach the build provenance to the image as well. I was getting tripped up thinking about the case where we are just using an OCI registry as a storage bucket for build provenance w/o attaching it to an image, which should be safe from this.

+1 to staying consistent with cosign.

This could be solved with a bit of retries if we could pass an etag, though looks like this is still WIP - google/go-containerregistry#1333. /cc @imjasonh - any OCI wisdom you can pass down here?

tekton-robot · 2022-09-27T16:19:36Z

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/chains/annotations.go	87.9%	82.9%	-5.0
pkg/chains/signing.go	75.0%	73.0%	-2.0
pkg/reconciler/pipelinerun/controller.go	94.4%	95.0%	0.6

lcarva · 2022-09-27T16:52:38Z

/test pull-tekton-chains-integration-tests

tekton-robot · 2022-09-27T19:32:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/chains/signing.go	75.0%	74.5%	-0.5
pkg/reconciler/pipelinerun/controller.go	94.4%	95.2%	0.8
pkg/reconciler/pipelinerun/pipelinerun.go	70.3%	71.8%	1.5

wlynch

Looks great!

/lgtm

vaikas · 2022-09-27T22:36:25Z

So, this looks right to me! Glad the Tracker worked out :)

The test seems to be failing (I think?) because of this:
The attestation that we get back has this:

        {
          "entryPoint": "/ko-app/imagedigestexporter",
          "arguments": [
            "-images",
            "[{\"name\":\"builtImage\",\"type\":\"image\",\"url\":\"gcr.io/foo/bar\",\"digest\":\"\",\"OutputImageDir\":\"/workspace/sourcerepo\"}]"
          ],
          "environment": {
            "container": "image-digest-exporter-4nq2b",
            "image": "docker-pullable://gcr.io/tekton-nightly/github.com/tektoncd/pipeline/cmd/imagedigestexporter@sha256:ea087a3978cde45884aaa01354939d82352d9c49b3d6c44a2fb1c22a681e59aa"
          },
          "annotations": null
        }

And what's is expected is this:

                        {
                            "entryPoint": "",
                            "arguments": null,
                            "environment": {
                                "container": "image-digest-exporter-4nq2b",
                                "image": "docker-pullable://gcr.io/tekton-nightly/github.com/tektoncd/pipeline/cmd/imagedigestexporter@sha256:ea087a3978cde45884aaa01354939d82352d9c49b3d6c44a2fb1c22a681e59aa"
                            },
                            "annotations": null
                        }

Notice the entryPoint and arguments are different. Could it be a newer version of ko, or something else?

A PipelineRun should only be processed when the underlying TaskRun tasks have been signed. This prevents timing issues where the Chains controller may attempt to push both TaskRun and PipelineRun attestations to an OCI repository at the same time, causing one of them to be lost. Prior to this change, a PipelineRun that was not ready to be processed was requeued so it could be re-processed in 15 seconds. Given that this scenario happens often, this caused the Chains controller to be overwhelmed if there were too many TaskRun and/or PipelineRun resources on the cluster. This commit modifies the logic so when Chains signs a TaskRun, it also updates an annotation, `chains.tekton.dev/child-signed`, on the parent PipelineRun, if any. The annotation update triggers the PipelineRun to be reconciled again. Thus the logic is driven by "events" instead of a polling mechanism. Signed-off-by: Luiz Carvalho <lucarval@redhat.com> Notify parent object after child is reconciled Signed-off-by: Luiz Carvalho <lucarval@redhat.com>

Signed-off-by: Luiz Carvalho <lucarval@redhat.com>

tekton-robot · 2022-10-04T20:04:17Z

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/chains/signing.go	75.0%	74.5%	-0.5
pkg/reconciler/pipelinerun/controller.go	94.4%	95.2%	0.8
pkg/reconciler/pipelinerun/pipelinerun.go	70.3%	71.8%	1.5

lcarva · 2022-10-04T20:33:56Z

@wlynch, lgtm label removed because I had to rebase to pickup the e2e test workaround 🙏

lcarva · 2022-10-10T17:21:35Z

@wlynch, @priyawadhwa, this PR is approved but needs lgtm label. Can I just add it to my own PRs? Or is that not a recommended practice?

imjasonh · 2022-10-10T18:49:03Z

This could be solved with a bit of retries if we could pass an etag, though looks like this is still WIP - google/go-containerregistry#1333. /cc @imjasonh - any OCI wisdom you can pass down here?

Not all registries support ETags today, and though there's guidance to recommend it, there's currently no way to tell whether the ETag you pass will be honored.

That being said, you should still set them optimistically. When/if that ggcr PR lands you should just get it For Free™️, but I don't know at this time whether or when it will land. I'll poke Jon to give his feedback.

wlynch

Sorry about the delay! Was OOO Mon-Tue 🙏

/lgtm
/approve

tekton-robot · 2022-10-12T14:36:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wlynch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wlynch]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot requested review from priyawadhwa and wlynch September 22, 2022 19:08

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 22, 2022

wlynch reviewed Sep 23, 2022

View reviewed changes

lcarva force-pushed the fix-perf-sync-pr-att branch from e752bae to bc87aa8 Compare September 23, 2022 18:27

wlynch approved these changes Sep 27, 2022

View reviewed changes

tekton-robot assigned wlynch Sep 27, 2022

tekton-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 27, 2022

lcarva added 3 commits October 4, 2022 16:00

Use informer to query for TaskRuns

d1b205c

Signed-off-by: Luiz Carvalho <lucarval@redhat.com>

Use tracker to be notified of TaskRun updates

02539eb

Signed-off-by: Luiz Carvalho <lucarval@redhat.com>

lcarva force-pushed the fix-perf-sync-pr-att branch from f493b69 to 02539eb Compare October 4, 2022 20:00

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2022

wlynch approved these changes Oct 12, 2022

View reviewed changes

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2022

tekton-robot merged commit 201dad8 into tektoncd:main Oct 12, 2022

lcarva mentioned this pull request Nov 1, 2022

Chains can stop PipelineRun's from being deleted if the TaskRun has been deleted #602

Closed

	func Reconciled(obj objects.TektonObject) bool {
	annotations := obj.GetAnnotations()
	val, ok := annotations[ChainsAnnotation]
	if !ok {
	return false
	}
	return val == "true" \|\| val == "failed"
	}

Uh oh!

Conversation

lcarva commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

tekton-robot commented Sep 22, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lcarva Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wlynch commented Sep 23, 2022

Uh oh!

lcarva commented Sep 23, 2022

Uh oh!

tekton-robot commented Sep 23, 2022

Uh oh!

wlynch commented Sep 26, 2022

Uh oh!

lcarva commented Sep 26, 2022

Uh oh!

wlynch commented Sep 26, 2022

Uh oh!

tekton-robot commented Sep 27, 2022

Uh oh!

lcarva commented Sep 27, 2022

Uh oh!

tekton-robot commented Sep 27, 2022

Uh oh!

wlynch left a comment

Choose a reason for hiding this comment

Uh oh!

vaikas commented Sep 27, 2022

Uh oh!

tekton-robot commented Oct 4, 2022

Uh oh!

lcarva commented Oct 4, 2022

Uh oh!

lcarva commented Oct 10, 2022

Uh oh!

imjasonh commented Oct 10, 2022

Uh oh!

wlynch left a comment

Choose a reason for hiding this comment

Uh oh!

tekton-robot commented Oct 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lcarva commented Sep 22, 2022 •

edited

Loading

lcarva Sep 26, 2022 •

edited

Loading