Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements on pipeline cancel #2543

Merged

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented May 5, 2020

Changes

When a pipeline is cancelled, detect it early in the reconcile
cycle and cancel all taskruns defined in the pipelinerun status,
so that we don't spend time building the DAG and resolving all
resources.

The patch to cancel a TaskRun is the same for all TaskRuns, so
build it once and apply it to all taskruns.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

@afrittoli afrittoli requested review from vdemeester and a user May 5, 2020 09:02
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 5, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/cancel.go 72.7% 73.7% 1.0
pkg/reconciler/pipelinerun/pipelinerun.go 72.4% 72.3% -0.1

@afrittoli afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 67936f4 to 8ad2727 Compare May 5, 2020 09:45
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/cancel.go 72.7% 73.7% 1.0
pkg/reconciler/pipelinerun/pipelinerun.go 72.4% 72.2% -0.2

afrittoli added a commit to afrittoli/pipeline that referenced this pull request May 5, 2020
Emit additional events:
- Pipeline Start

Emit all events through the events.go module.
Align and simplify the reconcile structure to have clear points
for error handling and emitting events.

Depends on tektoncd#2543
Depends on tektoncd#2526
@afrittoli afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 8ad2727 to 14e6785 Compare May 5, 2020 12:41
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/cancel.go 72.7% 73.7% 1.0
pkg/reconciler/pipelinerun/pipelinerun.go 72.4% 72.3% -0.1

@afrittoli afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 14e6785 to 69541df Compare May 5, 2020 16:48
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/cancel.go 72.7% 73.7% 1.0
pkg/reconciler/pipelinerun/pipelinerun.go 72.4% 72.2% -0.2

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2020
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions from me!

if _, err := clientSet.TektonV1alpha1().TaskRuns(pr.Namespace).Patch(rprt.TaskRunName, types.JSONPatchType, b, ""); err != nil {
errs = append(errs, fmt.Errorf("Failed to patch TaskRun `%s` with cancellation: %s", rprt.TaskRunName, err).Error())
// Loop over the TaskRuns in the PipelineRun status.
// If a TaskRun is not in the status yet we should not cancel it anyways.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking, can we guarantee that all started TaskRuns will be in the status? im guessing we can but wanted to check: i think this would mean that in the same reconcile loop that we create a TaskRun, we always update the status (i'm wondering also if there was an error partway through if we might end up in an inconsistent state? e.g. we create the taskrun, then a transient networking error prevents us from updating the status?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that. In my understanding the reconciler would not take the same key from the queue. When we start to reconcile we get the latest version of the pr from the API, that should have all running taskruns. It might be that a previous reconcile failed to update the status of the pipeline, but the same may be true for the taskrun status too, so I feel we are not introducing a new issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation uses the pipelineState which requires fetching the pipeline and taskruns and tasks and pipeline resources and validating and binding all together... I was hoping to moving the catch of cancel earlier in the reconcile and thus avoid calculating the pipelineState.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that a previous reconcile failed to update the status of the pipeline, but the same may be true for the taskrun status too, so I feel we are not introducing a new issue.

I guess in that case the next reconcile would notice that the status needed to be updated anyway maybe, then update it, then maybe the "is cancelled" block would catch any taskruns that need to be cancelled that werent before 🤔

it seems like eventually it would end up in the right state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, you make a very good point.
I agree that the scenario you describe is not covered.

Looking deeper into the code, it looks like it is not covered today either 😅
ResolvePipelineRun is used to build the pipeline run state. That function behaves as follows:

  • for each task in the spec:
  • look in the pipelinerun status for a matching taskrun
  • when not found, generate a new name
  • try to fetch the taskruns, if error and error is NotFound, ignore

This means that if we fail to sync the pipelinerun status, we never go a list TaskRuns by owner, so the taskruns created during the reconcile cycle where the issue happened become orphaned. A new name is generated for the tasks so presumably dag.GetSchedulable will return the task again and it will end up running the same task(s) twice.

It is a rather unlikely case to happen, since it requires being able to create a taskrun and a few milliseconds later being unable to update the pipelinerun status, however it may well happen.

I think a way forward would be to ensure that the pipelinerun status is valid.
To make sure that the status is in sync with the list of taskruns owned by the pipelinerun, we can run this check before we try and use the pipelinerun status as source of truth. To do we only need the pipelinerun status and the object ref of the pipelinerun, so we do not need to go through the resolution and validation and conversion and re-conversion madness.

I will propose a patch that does that, keep this on hold, and once (if) the other patch is approved, it should then be fine to merge this roughly as it is today. Does it sound like a plan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobcatfish Since - given #2558 - this patch does not change the situation compared to today - I wonder if we could go ahead with this and fix #2558 then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I setup a unit test to verify that we do not recover orphaned taskruns - assuming that they may happen in case the pipelinerun status sync fails.
This WIP fix shows how we could solve the issue: #2558

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that if we fail to sync the pipelinerun status, we never go a list TaskRuns by owner, so the taskruns created during the reconcile cycle where the issue happened become orphaned. A new name is generated for the tasks so presumably dag.GetSchedulable will return the task again and it will end up running the same task(s) twice.

yikes!! nice find :D

To make sure that the status is in sync with the list of taskruns owned by the pipelinerun, we can run this check before we try and use the pipelinerun status as source of truth

kk sounds good to me!

Thanks for looking into this and opening the other issue 🙏

@@ -48,6 +48,7 @@ var (
stats.UnitDimensionless)
)

// Recorder holds keys for Tekton metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding docs!

@@ -160,7 +160,8 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
// to update, and return the original error combined with any update error
var merr error

if pr.IsDone() {
switch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :D

pkg/reconciler/pipelinerun/pipelinerun.go Show resolved Hide resolved
@afrittoli
Copy link
Member Author

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2020
When a pipeline is cancelled, detect it early in the reconcile
cycle and cancel all taskruns defined in the pipelinerun status,
so that we don't spend time building the DAG and resolving all
resources.

The patch to cancel a TaskRun is the same for all TaskRuns, so
build it once and apply it to all taskruns.
@afrittoli afrittoli force-pushed the issues/2082-simplify-pipeline-cancel branch from 69541df to 6f1a6ea Compare May 7, 2020 11:27
@afrittoli
Copy link
Member Author

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/cancel.go 72.7% 73.7% 1.0
pkg/reconciler/pipelinerun/pipelinerun.go 72.4% 72.2% -0.2

@bobcatfish
Copy link
Collaborator

Thanks for the indepth back and forth :D 💃

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2020
@tekton-robot tekton-robot merged commit c861b37 into tektoncd:master May 7, 2020
afrittoli added a commit to afrittoli/pipeline that referenced this pull request May 13, 2020
Emit additional events:
- Pipeline Start

Emit all events through the events.go module.
Align and simplify the reconcile structure to have clear points
for error handling and emitting events.

Depends on tektoncd#2543
Depends on tektoncd#2526
afrittoli added a commit to afrittoli/pipeline that referenced this pull request May 13, 2020
Emit additional events:
- Pipeline Start

Emit all events through the events.go module.
Align and simplify the reconcile structure to have clear points
for error handling and emitting events.

Depends on tektoncd#2543
Depends on tektoncd#2526
afrittoli added a commit to afrittoli/pipeline that referenced this pull request May 15, 2020
Emit additional events:
- Pipeline Start

Emit all events through the events.go module.
Align and simplify the reconcile structure to have clear points
for error handling and emitting events.

Depends on tektoncd#2543
Depends on tektoncd#2526
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants