Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the way timeouts are handled #3500

Merged
merged 1 commit into from
Nov 9, 2020

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Nov 5, 2020

/kind cleanup

Fixes: #2905

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Fixes an issue where TaskRuns and PipelineRuns may not properly timeout.

@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Nov 5, 2020
@tekton-robot tekton-robot requested review from bobcatfish and a user November 5, 2020 18:35
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 5, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

Staging this to let Prow test the first chunk as I look at doing the same elsewhere in the codebase. Will split each round into separate commits, if folks want to take a look and give feedback incrementally.

cc @imjasonh @vdemeester @afrittoli

@imjasonh
Copy link
Member

imjasonh commented Nov 5, 2020

cc @yaoxiaoqi

@@ -1748,8 +1748,8 @@ func TestReconcileInvalidTaskRuns(t *testing.T) {

// Check actions and events
actions := clients.Kube.Actions()
if len(actions) != 3 || actions[0].Matches("namespaces", "list") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that all of these checks were previously incorrect because they were missing ! and had their arguments transposed. 🙃

@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

Alright, I have two more commits staged that:

  1. Make the same change to PipelineRun
  2. rm -rf pkg/timeout (this resulted in some dep changes)

I kinda want to see the integration test results before pushing that, but heads up that it's coming (again separate commits).

@mattmoor mattmoor changed the title [WIP] Refactor the way timeouts are handled in TaskRun [WIP] Refactor the way timeouts are handled Nov 5, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

ok it passed, I'm going to push the other two commits now and remove the WIP.

@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 5, 2020
@mattmoor mattmoor changed the title [WIP] Refactor the way timeouts are handled Refactor the way timeouts are handled Nov 5, 2020
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

Apparently I didn't delete enough code 🤣

@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

Alright, e2e tests have passed twice (once with pipelinerun changes).

Hopefully this time everything is green, but I'd appreciate any feedback, so we can hopefully squash the timeout_test flake (or start chasing what's left).

@bobcatfish
Copy link
Collaborator

Thanks for working on this @mattmoor ! Looking forward to taking a look - tomorrow our team is having an "offsite" so they'll be a delay for me personally at least

so we can hopefully squash the timeout_test flake (or start chasing what's left).

Are you seeing these flakes locally, or in recent PRs, or somewhere else?

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, I think it's going to be a useful simplification, I just have some concerns about resource-exhaustion handling.

AIUI, for my own understanding, the old timeout handling code ended up calling Enqueue anyway, so we were always reliant on the workqueue to schedule the timeout check, and if the workqueue backed up we'd start to fall behind on timeout checks, so this shouldn't introduce any new behavior except that we rely on EnqueueAfter's internal delay mechanism instead of our own. Is that all correct?

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun.go Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved
Copy link
Member Author

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, for my own understanding, the old timeout handling code ended up calling Enqueue anyway,

Yes

so we were always reliant on the workqueue to schedule the timeout check, and if the workqueue backed up we'd start to fall behind on timeout checks,

The problem I saw was that we could actually end up calling the callback and processing the key before the resource was actually timed out[1], and if we miss that one shot the entire timeout handling is shot because we had very edge triggered logic to kick off the timeout handling logic.

[1] - I suspect this is due to jitter in the .status.StartTime due to the stale informer issues we saw previously here: #3460 (with pipelinerun, where we clamped the StartTime to the child TaskRun time), so there is likely more we can do here (cc @pritidesai called this out), but this is certainly worth doing anyways as it's much more resilient than what's there now.

so this shouldn't introduce any new behavior except that we rely on EnqueueAfter's internal delay mechanism instead of our own. Is that all correct?

Essentially yes. In theory we could have used the old method here, but each invocation consumes a NEW go routine (not idempotent), which I suspect would tax the system under load. AFAIK the workqueue doesn't used go routines for EnqueueAfter so I suspect this is more efficient even than what's there, and idempotent to boot so we can blindly EnqueueAfter and let it deduplicate internally.

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun.go Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
@@ -797,12 +804,12 @@ func combineTaskRunAndTaskSpecAnnotations(pr *v1beta1.PipelineRun, pipelineTask
return annotations
}

func getPipelineRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun) metav1.Duration {
func getPipelineRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun) time.Duration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need one of these for TaskRun timeouts too, the same config is used for both (which seems sort of odd to me 🤔 ), so maybe we can just write one GetTimeout(context.Context, *metav1.Duration) time.Duration and share it in both places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there is a GetTimeout() above the lines in the taskrun reconciler, so I'm going so use that to handle defaulting, but it just uses the static default, not the configurable default. This should probably be fixed as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright fixed in a separate commit.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 80.3% 75.4% -4.9
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.6% 76.3% -1.3

`{Task,Pipeline}Run` now handle timeouts via `EnqueueAfter` on the workqueue.

`pkg/timeout` is now removed.

We now have consistent `GetTimeout(ctx)` methods on types.
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 80.3% 75.4% -4.9
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.6% 76.3% -1.3

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2020
@dlorenc
Copy link
Contributor

dlorenc commented Nov 9, 2020

/lgtm

nice!

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 9, 2020
@tekton-robot tekton-robot merged commit 8eaaeaa into tektoncd:master Nov 9, 2020
@mattmoor mattmoor deleted the refactor-timeout branch November 9, 2020 18:55
imjasonh added a commit to imjasonh/pipeline that referenced this pull request Apr 14, 2021
This permission was previously needed to support how we enforced
timeouts, by listing all TaskRuns/PipelineRuns across all namespaces and
determining whether they were past their timeout. Since
tektoncd#3500 this check was changed
to not require listing all namespaces, so I believe the permission is no
longer necessary.
tekton-robot pushed a commit that referenced this pull request Apr 15, 2021
This permission was previously needed to support how we enforced
timeouts, by listing all TaskRuns/PipelineRuns across all namespaces and
determining whether they were past their timeout. Since
#3500 this check was changed
to not require listing all namespaces, so I believe the permission is no
longer necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The timeoutHandler is only instructed to wait when it creates pods
6 participants