Skip to content

Use event updates for PR reconciliation#564

Merged
tekton-robot merged 3 commits into
tektoncd:mainfrom
lcarva:fix-perf-sync-pr-att
Oct 12, 2022
Merged

Use event updates for PR reconciliation#564
tekton-robot merged 3 commits into
tektoncd:mainfrom
lcarva:fix-perf-sync-pr-att

Conversation

@lcarva

@lcarva lcarva commented Sep 22, 2022

Copy link
Copy Markdown
Contributor

Changes

A PipelineRun should only be processed when the underlying TaskRun tasks have been signed. This prevents timing issues where the Chains controller may attempt to push both TaskRun and PipelineRun attestations to an OCI repository at the same time, causing one of them to be lost.

Prior to this change, a PipelineRun that was not ready to be processed was requeued so it could be re-processed in 15 seconds. Given that this scenario happens often, this caused the Chains controller to be overwhelmed if there were too many TaskRun and/or PipelineRun resources on the cluster.

This commit modifies the logic so when Chains signs a TaskRun, it also updates an annotation, chains.tekton.dev/child-signed, on the parent PipelineRun, if any. The annotation update triggers the PipelineRun to be reconciled again. Thus the logic is driven by "events" instead of a polling mechanism.

This commit modifies the logic in three significant ways. First, it removes the polling every 15 seconds. Second, if the PipelineRun cannot be reconciled because the TaskRun is not yet signed, a tracker is created so the PipelineRun controller is called again automatically by knative if the TaskRun changes. Finally, instead of querying the k8s API for the latest version of the TaskRun resource from the PipelineRun controller, an informer is used to minimize API calls.

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 22, 2022
@tekton-robot

Copy link
Copy Markdown

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/chains/annotations.go 87.9% 82.9% -5.0
pkg/chains/signing.go 75.0% 73.0% -2.0

Comment thread pkg/chains/objects/objects.go Outdated
Comment thread pkg/chains/signing.go Outdated
if parent, err := tektonObj.GetParent(ctx, o.Pipelineclientset); err != nil {
return err
} else if parent != nil {
if err := MarkParentSigned(ctx, parent, o.Pipelineclientset); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wait until all Tasks are done before doing this?

In particular I'm thinking about if we have a matrix pipeline that builds images for different arch in different Tasks - if we set this too early the Pipeline sig might still clash with other Tasks.

Maybe we invert this relationship and have objects query for their parent status? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wait until all Tasks are done before doing this?

In reality, we just need to wait until the last Task within the Pipeline that has signable subjects. We could enhance the logic to handle that, not sure how much more beneficial that would be though.

In particular I'm thinking about if we have a matrix pipeline that builds images for different arch in different Tasks - if we set this too early the Pipeline sig might still clash with other Tasks.

I think in this case, each of those Tasks produces a different image (unique digest). So the Tasks themselves wouldn't interfere with each other's signatures and attestations.

Maybe we invert this relationship and have objects query for their parent status?

Interesting idea. It would mean that signatures and attestations may have a perceivable delay in being created (maybe?). Not sure if that's an issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality, we just need to wait until the last Task within the Pipeline that has signable subjects. We could enhance the logic to handle that, not sure how much more beneficial that would be though.

Makes sense! This is just waiting until 1 Task is done before annotating the PipelineRun to go though, right? That's why I'm wondering if we should just wait until the pipeline clears first before we handle the TaskRuns 🤔

I think in this case, each of those Tasks produces a different image (unique digest). So the Tasks themselves wouldn't interfere with each other's signatures and attestations.

Hmmm I think I'm not fully grasping the problem then - if TaskRuns are pushing unique digests that won't interfere with each other, how does this interfere with the PipelineRun? 👀

Interesting idea. It would mean that signatures and attestations may have a perceivable delay in being created (maybe?). Not sure if that's an issue.

I'm not super concerned - chains is already async so there's an expectation of some delay. We could use https://pkg.go.dev/knative.dev/pkg/controller#Impl.EnqueueKey to force an immediate requeue of sub resources to help speed things up if needed.

@lcarva lcarva Sep 26, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! This is just waiting until 1 Task is done before annotating the PipelineRun to go though, right?

This updates the PipelineRun whenever a TaskRun finishes. So if the PipelineRun contains 5 TaskRuns, the annotation on the PipelineRun will get updated 5 times.

That's why I'm wondering if we should just wait until the pipeline clears first before we handle the TaskRuns

I think to make this work, we would need to update an annotation on the TaskRun to indicate whether or not it can be proceed by Chains, right? That might work. I'm not sure what the main benefit would be though. Can you help me understand?

The reason I chose the current approach is because I wanted to reconcile resources as early as possible. Waiting to reconcile TaskRuns until the whole PipelineRun completes goes against that. I'm not saying that we shouldn't do that, just clarifying my thought process.

Hmmm I think I'm not fully grasping the problem then - if TaskRuns are pushing unique digests that won't interfere with each other, how does this interfere with the PipelineRun?

Because if you want a PipelineRun attestation for each one of those images, they need to become Pipeline results. Let's say a PipelineRun contains three TaskRuns, each building a unique image (e.g. multi-arch). Since they're unique, they have different digests, so even if they're pushed to the same OCI repo, they'll each have their own signature and attestation images. (This is also the case prior to TEP-84). Now let's say the PipelineRun produces multiple images in its result, which makes sense since at the scope of a PipelineRun, multiple images were built. The PipelineRun reconciler is going to generate the PipelineRun attestation (which is the same for all three images), and push them not only to the same OCI repo, but also to the same sha256-<digest>.att tag. cosign is smart enough to not overwrite the attestations that are already there, so you end up with attestation images that contain two layers. The first layer contains the TaskRun attestation (unique across all three images), and the second containing the PipelineRun attestation (the same across all three images).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now that I'm caught up and grok everything, I think we can actually do this with less changes!

I think to make this work, we would need to update an annotation on the TaskRun to indicate whether or not it can be proceed by Chains, right? That might work. I'm not sure what the main benefit would be though. Can you help me understand?

I think we already have this in the chains.tekton.dev/signed annotation + the reconciled check here:

func Reconciled(obj objects.TektonObject) bool {
annotations := obj.GetAnnotations()
val, ok := annotations[ChainsAnnotation]
if !ok {
return false
}
return val == "true" || val == "failed"
}

I think what we probably want to do is swap out the direct client in

tr, err := r.Pipelineclientset.TektonV1beta1().TaskRuns(pr.Namespace).Get(ctx, name, v1.GetOptions{})
for the lister cache via the informer from https://pkg.go.dev/github.com/tektoncd/pipeline@v0.40.0/pkg/client/injection/informers/pipeline/v1beta1/taskrun . This should be safe to query as much as we want and should generally have the latest-ish values.

To trigger the resync from the TaskRun, you can just requeue the parent object key using https://pkg.go.dev/knative.dev/pkg@v0.0.0-20220921024409-d1d5c849073b/controller#Impl.EnqueueKey

This should block the PipelineRun reconcile until all the TaskRuns are done and minimize the # of API calls the PipelineRun reconciler is making to query TaskRun state.

The reason I chose the current approach is because I wanted to reconcile resources as early as possible. Waiting to reconcile TaskRuns until the whole PipelineRun completes goes against that. I'm not saying that we shouldn't do that, just clarifying my thought process.

Got it. Makes sense!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh I like this idea.

I replaced the line with:

taskruninformer.Get(ctx).Lister().TaskRuns(pr.Namespace).Get(name)

But I'm getting a panic when that line is run:

Unable to fetch github.com/tektoncd/pipeline/pkg/client/informers/externalversions/pipeline/v1beta1.TaskRunInformer from context.

I'm not well-versed here, but it sounds like maybe I also need to do some setup on the informer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I always forget where the client injection needs to go. Try passing it when the controller is created similar to the pipelinerun informer.

pipelineRunInformer := pipelineruninformer.Get(ctx)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip 🙏 Pushed a new commit that takes this approach. It should further improve performance.

For the second part of your suggestion, I'm not sure if it'll work. It would require the TaskRun controller/reconciler to add something on the queue of the PipelineRun controller/reconciler. I'm likely missing something obvious here though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shout out to @vaikas for this!

We should be able to use a tracker for this to watch a PipelineRun's children:

https://github.com/sigstore/policy-controller/blob/main/pkg/reconciler/clusterimagepolicy/clusterimagepolicy.go#L204 for an example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos to @vaikas!

Added a commit that takes this approach.

@wlynch, if the current changes look good to you, I can squash the commits if you prefer.

(The failing e2e test seems to be unrelated to these changes because it's also failing on my PR that updates just the docs.)

@wlynch

wlynch commented Sep 23, 2022

Copy link
Copy Markdown
Member

Also a few general questions:

  1. Is this primarily affecting OCI artifacts w/ OCI storage since we're trying to update the same .sig file, or did you observe this for tekton artifacts as well?
  2. Are we actually losing data / events when this happens (i.e. the events are successfully writing over each other), or are we just getting a conflict error and we'll eventually get to the right state?

@lcarva

lcarva commented Sep 23, 2022

Copy link
Copy Markdown
Contributor Author
1. Is this primarily affecting OCI artifacts w/ OCI storage since we're trying to update the same `.sig` file, or did you observe this for tekton artifacts as well?

Unlike the taskrun reconciler, the pipelinerun reconciler doesn't sign images, it only creates attestations. But, yes, the same concept applies. This is not an issue when other types of storage are used, afaik.

2. Are we actually losing data / events when this happens (i.e. the events are successfully writing over each other), or are we just getting a conflict error and we'll eventually get to the right state?

Yes, we lose data and events. If both the taskrun and pipelinerun reconcilers run at the same time, either the taskrun or pipelinerun attestation is lost. We saw this early on in the POC. This is why the check was added to only reconcile pipelineruns once their taskruns have been reconciled. If we don't poll (like the code used to), and don't update the parent resource (like this PR does), then there's nothing that triggers another pipelinerun reconciliation, thus the pipelinerun attestation is not created. (Oddly enough, if you do delete the pipelinerun then the attestation is created due to the use of finalizers. I don't think we should rely on that though.)

@lcarva lcarva force-pushed the fix-perf-sync-pr-att branch from e752bae to bc87aa8 Compare September 23, 2022 18:27
@tekton-robot

Copy link
Copy Markdown

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/chains/annotations.go 87.9% 82.9% -5.0
pkg/chains/signing.go 75.0% 73.0% -2.0

@wlynch

wlynch commented Sep 26, 2022

Copy link
Copy Markdown
Member

Unlike the taskrun reconciler, the pipelinerun reconciler doesn't sign images, it only creates attestations. But, yes, the same concept applies. This is not an issue when other types of storage are used, afaik.

Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision?

@lcarva

lcarva commented Sep 26, 2022

Copy link
Copy Markdown
Contributor Author

Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision?

Something like sha256-<digest>.pr.at ? Yes, it would absolutely avoid this sort of issue. I'm not sure if that's a good idea though since cosign wouldn't know about it.

@wlynch

wlynch commented Sep 26, 2022

Copy link
Copy Markdown
Member

Could this be avoided if we used a different naming scheme for OCI storage for Run attestations to avoid collision?

Something like sha256-<digest>.pr.at ? Yes, it would absolutely avoid this sort of issue. I'm not sure if that's a good idea though since cosign wouldn't know about it.

I was misunderstanding the problem. 🙃 I get it now - we can hit this case any time we try to attach things to an OCI image, which can happen for simple signing, but also when we try and attach the build provenance to the image as well. I was getting tripped up thinking about the case where we are just using an OCI registry as a storage bucket for build provenance w/o attaching it to an image, which should be safe from this.

+1 to staying consistent with cosign.

This could be solved with a bit of retries if we could pass an etag, though looks like this is still WIP - google/go-containerregistry#1333. /cc @imjasonh - any OCI wisdom you can pass down here?

@tekton-robot

Copy link
Copy Markdown

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/chains/annotations.go 87.9% 82.9% -5.0
pkg/chains/signing.go 75.0% 73.0% -2.0
pkg/reconciler/pipelinerun/controller.go 94.4% 95.0% 0.6

@lcarva

lcarva commented Sep 27, 2022

Copy link
Copy Markdown
Contributor Author

/test pull-tekton-chains-integration-tests

@tekton-robot

Copy link
Copy Markdown

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/chains/signing.go 75.0% 74.5% -0.5
pkg/reconciler/pipelinerun/controller.go 94.4% 95.2% 0.8
pkg/reconciler/pipelinerun/pipelinerun.go 70.3% 71.8% 1.5

@wlynch wlynch left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

/lgtm

@tekton-robot tekton-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 27, 2022
@vaikas

vaikas commented Sep 27, 2022

Copy link
Copy Markdown
Contributor

So, this looks right to me! Glad the Tracker worked out :)

The test seems to be failing (I think?) because of this:
The attestation that we get back has this:

        {
          "entryPoint": "/ko-app/imagedigestexporter",
          "arguments": [
            "-images",
            "[{\"name\":\"builtImage\",\"type\":\"image\",\"url\":\"gcr.io/foo/bar\",\"digest\":\"\",\"OutputImageDir\":\"/workspace/sourcerepo\"}]"
          ],
          "environment": {
            "container": "image-digest-exporter-4nq2b",
            "image": "docker-pullable://gcr.io/tekton-nightly/github.com/tektoncd/pipeline/cmd/imagedigestexporter@sha256:ea087a3978cde45884aaa01354939d82352d9c49b3d6c44a2fb1c22a681e59aa"
          },
          "annotations": null
        }

And what's is expected is this:

                        {
                            "entryPoint": "",
                            "arguments": null,
                            "environment": {
                                "container": "image-digest-exporter-4nq2b",
                                "image": "docker-pullable://gcr.io/tekton-nightly/github.com/tektoncd/pipeline/cmd/imagedigestexporter@sha256:ea087a3978cde45884aaa01354939d82352d9c49b3d6c44a2fb1c22a681e59aa"
                            },
                            "annotations": null
                        }

Notice the entryPoint and arguments are different. Could it be a newer version of ko, or something else?

lcarva added 3 commits October 4, 2022 16:00
A PipelineRun should only be processed when the underlying TaskRun tasks
have been signed. This prevents timing issues where the Chains
controller may attempt to push both TaskRun and PipelineRun attestations
to an OCI repository at the same time, causing one of them to be lost.

Prior to this change, a PipelineRun that was not ready to be processed
was requeued so it could be re-processed in 15 seconds. Given that this
scenario happens often, this caused the Chains controller to be
overwhelmed if there were too many TaskRun and/or PipelineRun resources
on the cluster.

This commit modifies the logic so when Chains signs a TaskRun, it also
updates an annotation, `chains.tekton.dev/child-signed`, on the parent
PipelineRun, if any. The annotation update triggers the PipelineRun to
be reconciled again. Thus the logic is driven by "events" instead of a
polling mechanism.

Signed-off-by: Luiz Carvalho <lucarval@redhat.com>

Notify parent object after child is reconciled

Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
Signed-off-by: Luiz Carvalho <lucarval@redhat.com>
@lcarva lcarva force-pushed the fix-perf-sync-pr-att branch from f493b69 to 02539eb Compare October 4, 2022 20:00
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2022
@tekton-robot

Copy link
Copy Markdown

The following is the coverage report on the affected files.
Say /test pull-tekton-chains-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/chains/signing.go 75.0% 74.5% -0.5
pkg/reconciler/pipelinerun/controller.go 94.4% 95.2% 0.8
pkg/reconciler/pipelinerun/pipelinerun.go 70.3% 71.8% 1.5

@lcarva

lcarva commented Oct 4, 2022

Copy link
Copy Markdown
Contributor Author

@wlynch, lgtm label removed because I had to rebase to pickup the e2e test workaround 🙏

@lcarva

lcarva commented Oct 10, 2022

Copy link
Copy Markdown
Contributor Author

@wlynch, @priyawadhwa, this PR is approved but needs lgtm label. Can I just add it to my own PRs? Or is that not a recommended practice?

@imjasonh

Copy link
Copy Markdown
Member

This could be solved with a bit of retries if we could pass an etag, though looks like this is still WIP - google/go-containerregistry#1333. /cc @imjasonh - any OCI wisdom you can pass down here?

Not all registries support ETags today, and though there's guidance to recommend it, there's currently no way to tell whether the ETag you pass will be honored.

That being said, you should still set them optimistically. When/if that ggcr PR lands you should just get it For Free™️, but I don't know at this time whether or when it will land. I'll poke Jon to give his feedback.

@wlynch wlynch left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay! Was OOO Mon-Tue 🙏

/lgtm
/approve

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2022
@tekton-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wlynch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants