Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[concurrency] Concurrency controller fails to cancel matching PipelineRuns #932

Closed
lbernick opened this issue Jan 5, 2023 · 9 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@lbernick
Copy link
Member

lbernick commented Jan 5, 2023

Expected Behavior

When a PipelineRun is created, the concurrency webhook marks it as pending w/ label "tekton.dev/ok-to-start=true". When the PipelineRun is reconciled for the first time, the concurrency controller cancels all PipelineRuns matching relevant concurrency controls (based on their labels), removes the label "tekton.dev/ok-to-start=true", and adds the label "tekton.dev/concurrency=true". The controller ignores PipelineRuns labeled "tekton.dev/concurrency=true", so the PipelineRun is not modified on subsequent reconcile loops.

Actual Behavior

Controller does not cancel other PipelineRuns when the PipelineRun is first reconciled. I think the PipelineRun doesn't have labels set when it's first reconciled, meaning that no other PipelineRuns are canceled, it's marked with "tekton.dev/concurrency=true", and only gets the relevant labels... later? This means it never matches the concurrency control.

Steps to Reproduce the Problem

  1. Enable debug logging via logging-config configmap in tekton-concurrency namespace.
  2. Create the following CRDs:
apiVersion: tekton.dev/v1alpha1
kind: ConcurrencyControl
metadata:
  name: teps-linter
  namespace: tekton-ci
spec:
  groupBy:
  - tekton.dev/pr-number
  selector:
    matchLabels:
      tekton.dev/pipeline: teps-linter
  strategy: GracefullyCancel
  1. Kick off two PipelineRuns in quick succession

Evidence

From the labels, we can see that each PipelineRun initially appears as "PipelineRunPending", with generation 2 and the label "tekton.dev/concurrency:true". Neither of them are ever canceled. We can see in the kubectl describe output for each PipelineRun some events that indicate patch calls may have failed due to concurrent writes (maybe?):

  Warning  Error          2m3s                 PipelineRun  Operation cannot be fulfilled on pipelineruns.tekton.dev "pull-community-teps-lint-vfssn": the object has been modified; please apply your changes to the latest version and try again
...
Warning  InternalError  2m2s                 PipelineRun  1 error occurred:
           * Operation cannot be fulfilled on pipelineruns.tekton.dev "pull-community-teps-lint-vfssn": the object has been modified; please apply your changes to the latest version and try again

We can also see in the controller logs that this line of code was reached for both PRs, indicating that they were reconciled once by the controller before the "tekton.dev/concurrency:true" label was applied, but we don't see this line being reached, indicating that the concurrency control never matched the PipelineRun, despite both PipelineRuns always having the label "tekton.dev/pipeline:teps-linter" in kubectl output.

Additional Info

Client Version: v1.25.4
Kustomize Version: v4.5.7
Server Version: v1.23.11-gke.300
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1
  • Tekton Pipeline version: devel
@lbernick lbernick added the kind/bug Categorizes issue or PR as related to a bug. label Jan 5, 2023
@lbernick
Copy link
Member Author

lbernick commented Jan 5, 2023

/assign

@lbernick
Copy link
Member Author

lbernick commented Jan 5, 2023

I think I know what's going on:

The concurrency control relies on a label, "tekton.dev/pipeline:teps-linter", which is added by the PipelineRun controller rather than hard-coded in the trigger template. When I switch the concurrency control to rely on a label ("tekton.dev/check-name: pull-community-teps-lint") that is set by the trigger template rather than the PipelineRun controller, the previous PipelineRun is successfully canceled. I think the first reconcile loop of the concurrency controller is happening before the PipelineRun controller has set this label, so the concurrency control never matches the PipelineRun.

I'm still not sure why this seems to happen every time, rather than failing only some of the time (as I'd expect if there were a race condition between the PipelineRun controller and the concurrency controller). One possibility is that the event w/ message "the object has been modified; please apply your changes to the latest version and try again" is coming from the update call in the PipelineRun controller that sets these labels, which fails because the concurrency webhook has already patched the labels when the PipelineRun was created. (Not sure if this is how reconciler queues and patches/updates actually work?)

I'm also not sure what to do about this-- one option would be to skip a concurrency controller reconcile loop if the label applied by the PipelineRun controller is not present yet. (A bit hacky and we'd have to build these assumptions into the unit tests.) What if a project has another controller that sets labels on PipelineRuns somehow and wants to use them with concurrency controls?

@lbernick
Copy link
Member Author

Update: the reason why this is deterministic is that the pipeline is not fetched until after the PipelineRun starts executing; i.e. the concurrency controller marks the PipelineRun as no longer pending, the PipelineRun begins executing, and then the PipelineRun controller applies the label "tekton.dev/pipeline". Therefore, I'm not sure how we can simultaneously prevent a PipelineRun from starting until concurrency controls are applied, and also allow concurrency controls to be applied w/ the label "tekton.dev/pipeline" (unless we change the pipelinerun controller behavior).

@chengjoey
Copy link
Member

chengjoey commented Jan 30, 2023

I think the pipelinerun controller behavior should be modified, and a field similar to bindQueue should be added to the pipelinerun spec or get queue by selector. If the queue is bound, it should enter the queue first, and wait for the queue to be processed according to the policy before performing subsequent processing logic.

@chengjoey
Copy link
Member

#tektoncd/community#951

@tekton-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2023
@tekton-robot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2023
@tekton-robot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
Status: Done
Development

No branches or pull requests

3 participants