Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache secrets in interceptors with Reflector #594

Conversation

tragiclifestories
Copy link
Contributor

Changes

This is an alternative implementation of #585 using client-go's cache package. We create a reflector for secrets across all namespaces and use it as a caching layer for secrets in both the Github/lab webhook parsers and CEL compareSecret calls.

I'm putting this up as a draft to get immediate feedback. Still need to test. Also, now that it's also backing CEL functions, WebhookSecretStore is probably the wrong name ...

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Release Notes

Describe any user facing changes here, or delete this block.

Examples of user facing changes:
- API changes
- Bug fixes
- Any changes in behavior

@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2020
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 3, 2020

CLA Check
The committers are authorized under a signed CLA.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 3, 2020
@tekton-robot
Copy link

Hi @tragiclifestories. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 3, 2020
pkg/interceptors/interceptors.go Outdated Show resolved Hide resolved
pkg/interceptors/interceptors.go Outdated Show resolved Hide resolved
Get(sr triggersv1.SecretRef) ([]byte, error)
}

type webhookSecretStore struct {
Copy link

@jace-ys jace-ys Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address your question, maybe just secretStore would work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which of course was what it was when I first wrote this struct def out 😅

ns = eventListenerNamespace
// Get returns the secret value for a given SecretRef.
func (ws *WebhookSecretStore) Get(sr triggersv1.SecretRef) ([]byte, error) {
cachedObj, ok, _ := ws.store.GetByKey(getKey(sr))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with the cache store, so curious why we're ignoring the error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that if you spelunk into the code, error is always nil from this particular store implementation. However, I will probably handle it anyway, since it may turn out that a different store makes more sense in future.

pkg/interceptors/interceptors.go Outdated Show resolved Hide resolved
@tragiclifestories tragiclifestories force-pushed the cache-trigger-secrets-reflector branch 2 times, most recently from 9ae0f53 to 6eacf24 Compare June 3, 2020 16:38
pkg/sink/sink.go Outdated
case i.GitLab != nil:
interceptor = gitlab.NewInterceptor(i.GitLab, r.KubeClientSet, r.EventListenerNamespace, log)
interceptor = gitlab.NewInterceptor(i.GitHub, r.WebhookSecretStore, log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy/paste mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does indeed ...

@bigkevmcd
Copy link
Member

Definitely would help with scaling both volume of requests, and triggers that use secrets.

This caching layer is something we do need to consider if we go for a process-based plugin mechanism.

@tragiclifestories
Copy link
Contributor Author

@bigkevmcd I'll be picking this up again. There are actually a whole host of uncached requests going on in processTrigger and it seems like a lot of caching will be necessary to keep this performant with a lot of triggers.

@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2020
@tragiclifestories
Copy link
Contributor Author

tragiclifestories commented Jun 30, 2020

So, on further investigation, the fix for this problem implemented in #595 did not work. The reason for this seems to be that the the triggers are executed asynchronously, and thus all immediately stampede to get the secret at the very top of the ExecuteTrigger methods, so in practice you get 100% cache misses, 100% of the time. Which was not what was intended, to put it mildly.

This version speeds things up very significantly compared to master in my tests - more thorough than last time around ;-). It takes about 3s to process the hideous 500-github-trigger YAML I put in the examples in this PR, as opposed to something like 3 minutes on master - and most of that is now due to another, far less severe performance problem with compiling CEL expressions (will raise an issue when I get a second).

}

return make(map[string]interface{})
lw := cache.NewListWatchFromClient(cs.CoreV1().RESTClient(), "secrets", metav1.NamespaceAll, fields.Everything())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using NamespaceAll here is probably not a long term solution, since it requires cluster-wide list secrets in RBAC. I guess we could wrap a map of stores here - one per namespace referenced in calls to Get ...

@tragiclifestories tragiclifestories marked this pull request as ready for review June 30, 2020 15:38
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 30, 2020
@@ -80,12 +82,15 @@ func main() {
logger.Fatal(err)
}

webhookSecretStore := interceptors.NewWebhookSecretStore(kubeClient, sinkArgs.ElNamespace, 5*time.Second, stopCh)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interval should probably be configurable here ...

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2020
@tekton-robot
Copy link

@tragiclifestories: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

James Turley and others added 4 commits July 9, 2020 09:42
r.URL is nil sometimes, in which case this code will panic. This fix
just handles the nil case.
Co-authored-by: Jace Tan <jaceys.tan@gmail.com>
@tragiclifestories tragiclifestories force-pushed the cache-trigger-secrets-reflector branch 2 times, most recently from cb3018a to 3b4cc64 Compare July 9, 2020 11:16
@tekton-robot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link

@tekton-robot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 15, 2020
@tragiclifestories
Copy link
Contributor Author

/reopen

@tekton-robot tekton-robot reopened this Aug 15, 2020
@tekton-robot
Copy link

@tragiclifestories: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wlynch
You can assign the PR to them by writing /assign @wlynch in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tragiclifestories
Copy link
Contributor Author

I hope I'll get some time to update this in the next week or so. Any ideas on how to write tests for this - and why they don't currently pass - very much welcome. FWIW, we've a version of this PR running in production for over a month with no incident.

@dibyom dibyom removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2020
@tekton-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021
@tragiclifestories
Copy link
Contributor Author

/remove-lifecycle stale Promise I'll get to this soon ...

@tragiclifestories
Copy link
Contributor Author

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021
Base automatically changed from master to main March 10, 2021 15:03
@tekton-robot
Copy link

@tragiclifestories: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Mar 10, 2021
@bobcatfish
Copy link
Collaborator

hey @tragiclifestories ! do you have any strong feelings about us closing this PR for now? You can absolutely re-open it (or open a new one) when you're ready to get back to it

@tragiclifestories
Copy link
Contributor Author

At this point the rebase would probably take longer than redoing it from scratch 😅 . I think this PR can be closed.

I will try to find a moment to see if the bug still exists - it may well not for all I know ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants