[concurrency] Concurrency controller fails to cancel matching PipelineRuns #932

lbernick · 2023-01-05T15:05:29Z

Expected Behavior

When a PipelineRun is created, the concurrency webhook marks it as pending w/ label "tekton.dev/ok-to-start=true". When the PipelineRun is reconciled for the first time, the concurrency controller cancels all PipelineRuns matching relevant concurrency controls (based on their labels), removes the label "tekton.dev/ok-to-start=true", and adds the label "tekton.dev/concurrency=true". The controller ignores PipelineRuns labeled "tekton.dev/concurrency=true", so the PipelineRun is not modified on subsequent reconcile loops.

Actual Behavior

Controller does not cancel other PipelineRuns when the PipelineRun is first reconciled. I think the PipelineRun doesn't have labels set when it's first reconciled, meaning that no other PipelineRuns are canceled, it's marked with "tekton.dev/concurrency=true", and only gets the relevant labels... later? This means it never matches the concurrency control.

Steps to Reproduce the Problem

Enable debug logging via logging-config configmap in tekton-concurrency namespace.
Create the following CRDs:

EventListener, Trigger, TriggerTemplate yaml: https://gist.github.com/lbernick/6a48ea6e75cac48475f3b7f911b65738
ConcurrencyControl yaml:

apiVersion: tekton.dev/v1alpha1
kind: ConcurrencyControl
metadata:
  name: teps-linter
  namespace: tekton-ci
spec:
  groupBy:
  - tekton.dev/pr-number
  selector:
    matchLabels:
      tekton.dev/pipeline: teps-linter
  strategy: GracefullyCancel

Kick off two PipelineRuns in quick succession

Evidence

PR 1 (pull-community-teps-lint-vfssn) and 2 (pull-community-teps-lint-b7rk6): https://gist.github.com/lbernick/98ab1bcdcafcb8b7eea02e3044bb6a31
Labels, generation, and resourceVersion: https://gist.github.com/lbernick/cdf579f714669fa6c52ac9d681104bd3
controller and webhook logs: https://gist.github.com/lbernick/9d64e605ae21a736694c5bd5665126a4

From the labels, we can see that each PipelineRun initially appears as "PipelineRunPending", with generation 2 and the label "tekton.dev/concurrency:true". Neither of them are ever canceled. We can see in the kubectl describe output for each PipelineRun some events that indicate patch calls may have failed due to concurrent writes (maybe?):

  Warning  Error          2m3s                 PipelineRun  Operation cannot be fulfilled on pipelineruns.tekton.dev "pull-community-teps-lint-vfssn": the object has been modified; please apply your changes to the latest version and try again
...
Warning  InternalError  2m2s                 PipelineRun  1 error occurred:
           * Operation cannot be fulfilled on pipelineruns.tekton.dev "pull-community-teps-lint-vfssn": the object has been modified; please apply your changes to the latest version and try again

We can also see in the controller logs that this line of code was reached for both PRs, indicating that they were reconciled once by the controller before the "tekton.dev/concurrency:true" label was applied, but we don't see this line being reached, indicating that the concurrency control never matched the PipelineRun, despite both PipelineRuns always having the label "tekton.dev/pipeline:teps-linter" in kubectl output.

Additional Info

Concurrency project version hash: https://github.com/tektoncd/experimental/tree/7bce447ed20956ef28a22ccdbf28f1b60ff1b0f7/concurrency
Kubernetes version:

Output of kubectl version:

Client Version: v1.25.4
Kustomize Version: v4.5.7
Server Version: v1.23.11-gke.300
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1

Tekton Pipeline version: devel

The text was updated successfully, but these errors were encountered:

lbernick · 2023-01-05T15:05:36Z

/assign

lbernick · 2023-01-05T19:49:29Z

I think I know what's going on:

The concurrency control relies on a label, "tekton.dev/pipeline:teps-linter", which is added by the PipelineRun controller rather than hard-coded in the trigger template. When I switch the concurrency control to rely on a label ("tekton.dev/check-name: pull-community-teps-lint") that is set by the trigger template rather than the PipelineRun controller, the previous PipelineRun is successfully canceled. I think the first reconcile loop of the concurrency controller is happening before the PipelineRun controller has set this label, so the concurrency control never matches the PipelineRun.

I'm still not sure why this seems to happen every time, rather than failing only some of the time (as I'd expect if there were a race condition between the PipelineRun controller and the concurrency controller). One possibility is that the event w/ message "the object has been modified; please apply your changes to the latest version and try again" is coming from the update call in the PipelineRun controller that sets these labels, which fails because the concurrency webhook has already patched the labels when the PipelineRun was created. (Not sure if this is how reconciler queues and patches/updates actually work?)

I'm also not sure what to do about this-- one option would be to skip a concurrency controller reconcile loop if the label applied by the PipelineRun controller is not present yet. (A bit hacky and we'd have to build these assumptions into the unit tests.) What if a project has another controller that sets labels on PipelineRuns somehow and wants to use them with concurrency controls?

lbernick · 2023-01-26T20:32:25Z

Update: the reason why this is deterministic is that the pipeline is not fetched until after the PipelineRun starts executing; i.e. the concurrency controller marks the PipelineRun as no longer pending, the PipelineRun begins executing, and then the PipelineRun controller applies the label "tekton.dev/pipeline". Therefore, I'm not sure how we can simultaneously prevent a PipelineRun from starting until concurrency controls are applied, and also allow concurrency controls to be applied w/ the label "tekton.dev/pipeline" (unless we change the pipelinerun controller behavior).

chengjoey · 2023-01-30T08:25:41Z

I think the pipelinerun controller behavior should be modified, and a field similar to bindQueue should be added to the pipelinerun spec or get queue by selector. If the queue is bound, it should enter the queue first, and wait for the queue to be processed according to the policy before performing subsequent processing logic.

chengjoey · 2023-02-06T16:07:06Z

#tektoncd/community#951

tekton-robot · 2023-05-07T16:20:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-06-06T16:33:52Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-07-06T17:09:06Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-07-06T17:09:08Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lbernick added the kind/bug Categorizes issue or PR as related to a bug. label Jan 5, 2023

tekton-robot assigned lbernick Jan 5, 2023

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2023

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2023

tekton-robot closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[concurrency] Concurrency controller fails to cancel matching PipelineRuns #932

[concurrency] Concurrency controller fails to cancel matching PipelineRuns #932

lbernick commented Jan 5, 2023

lbernick commented Jan 5, 2023

lbernick commented Jan 5, 2023

lbernick commented Jan 26, 2023

chengjoey commented Jan 30, 2023 •

edited

Loading

chengjoey commented Feb 6, 2023

tekton-robot commented May 7, 2023

tekton-robot commented Jun 6, 2023

tekton-robot commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

[concurrency] Concurrency controller fails to cancel matching PipelineRuns #932

[concurrency] Concurrency controller fails to cancel matching PipelineRuns #932

Comments

lbernick commented Jan 5, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Evidence

Additional Info

lbernick commented Jan 5, 2023

lbernick commented Jan 5, 2023

lbernick commented Jan 26, 2023

chengjoey commented Jan 30, 2023 • edited Loading

chengjoey commented Feb 6, 2023

tekton-robot commented May 7, 2023

tekton-robot commented Jun 6, 2023

tekton-robot commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

chengjoey commented Jan 30, 2023 •

edited

Loading