TEP-0121: Refine Retries for TaskRuns and CustomRuns by XinruZhang · Pull Request #816 · tektoncd/community

XinruZhang · 2022-09-08T15:50:15Z

Retries is designed inconsistently between TaskRun and CustomRun, see the Motivation section for details. In this TEP, we propose to align the inconsistency by

solidifying TimeOut is for each retry attempt, updating TEP-069 and docs/runs.md to reflect the alignment.
migrating Retries implementation to TaskRun reconciler.

jerop · 2022-09-08T16:06:12Z

/kind tep

lbernick · 2022-09-08T16:15:19Z

/assign

lbernick · 2022-09-12T16:08:24Z

/assign @pritidesai

XinruZhang · 2022-09-13T12:43:12Z

Pushed a new commit comparing related works.

XinruZhang · 2022-09-21T17:27:34Z

I'm moving this TEP to implementable as we've fully discussed the options in Tekton API Working Group, PTAL~

lbernick · 2022-09-22T00:31:50Z

+We propose to support `retries` in TaskRuns, and solidify that `Timeout` is for each Retry attempt in CustomRun by updating [TEP-0069](https://github.com/tektoncd/community/blob/33ca1d5254a405b1d479f2350443f6c7979a0b72/teps/0069-support-retries-for-custom-task-in-a-pipeline.md) and the Developer guidance [docs/runs.md](https://github.com/tektoncd/pipeline/blob/0dd9f9719177054bdf60928ce9b4aab4b85af5bd/docs/runs.md). 
+
+### Notes and Caveats
+One caveat is that **it's hard for PipelineRun Controller to know whether retries is implemented correctly by customers**. If customers want to use a specific CustomRun, they need to check (the documentation or even the code) whether the functionality is supported properly. 


With the proposed option, I think we should strongly consider removing retriesStatus from CustomRun status (while still keeping retries in the spec). this will allow customRun controllers to implement retries however they want and prevent the pipelinerun controller from making assumptions about how retries are implemented.

Similarly, I think this discussion is also related to this comment thread because IMO we need the retriesStatus in pipelineController to decide whether the retries is exhausted, let's move our discussions there.

sorry to continue this thread-- I think this is actually a separate point. My idea here was that the pipelinerun controller can determine whether a run is running, failed, or succeeded based on its status.conditions. If we want to hand retries responsibility entirely off to run controllers to implement however they would like, why require them to update status.retriesstatus in a certain way? The run controller can keep track of retries in status.extrafields.

No problem at all. I'm wondering how does pipelinerun controller determine the run finishes: like what if it failed (the condition is marked as failed), and would be retried in the next reconciliation loop?

I think we'd have to clarify our run controller guidance not to update the ConditionSucceeded until the run is finished running, including exhausting all retries (i.e. most runs would probably implement retries differently than how they're currently implemented on taskrun, but that's OK).

If we want to hand retries responsibility entirely off to run controllers to implement however they would like, why require them to update status.retriesstatus in a certain way? The run controller can keep track of retries in status.extrafields.

status.retriesStatus is a natural field to store status of each execution. It is part of an implicit contract between tekton controller and custom controller. status.extrafields is a very generic and not ideal for storing such key information.

What is the motivation behind this request?

I made this request because I think we want to minimize the implicit contract between the tekton controller and custom controller as much as possible. Part of the motivation for adding retries to custom tasks is to allow them to implement retries in whatever way makes the most sense, and we take away some of that flexibility if we require them to update retries status in a certain way.

Thanks for the good discussion on this topic.

I definitely agree that we need to have a clear separation between the PipelineRun controller and the TaskRun / CustomRun controller, and what @lbernick highlighted in tektoncd/pipeline#5248 - i.e. that the TaskRun reconciler will use the length of its RetriesStatus to determine what to name the pod, is not great.

About RetriesStatus specifically, the fact that RetriesStatus exists in a TaskRun is not ideal, because that's not something that the TaskRun controller should be concerned with, in the current implementation.
Similarly, we should not need RetriesStatus in CustomRuns:

if retries are implemented by the PipelineRun controller, it should keep in the PipelineRun this status info

if retries are implemented by the CustomRun controller, it's not something that the PipelineRun controller needs to know about. If the CustomRun controller needs to store data in the status, it can do so, but it doesn't have to be a standard location as the PipelineRun controller will not use it.

My preferred overall setup would be:

Focus on PipelineRun driven retries. Advantages:

Users have a consistent interface for retries

Custom task controller developers get a default implementation of retries for free (by embedding in a pipeline)

"Pipelines in pipeline" can be retried the same as the other resources

No changes to the PipelineRun API (not in the spec at least)

No changes to the TaskRun API (not in the spec at least)

Remove RetriesStatus from TaskRuns and CustomRuns

Improve the retries of TaskRuns by using separate TaskRuns for each retry

This means we can see the logs of failed retries too

Dashboard and CLI may need extra work for this

Deprecate spec.retries in Run

Thanks for the detailed comment.

For the advantages of the your poposed overall setup:

Users have a consistent interface for retries

For RuntimeObject driven retries, the interface is also consistent -- we pass PipelineTask.retries to the retries field in child runtime objects.

Custom task controller developers get a default implementation of retries for free (by embedding in a pipeline)

This harms the flexibility of Custom Task -- what if a custom task controller wants to have different retry stragety/policy, instead of just recreating on failure. Most CI/CD systems have the configurable retry stragety. See the related work.

"Pipelines in pipeline" can be retried the same as the other resources

This is an area that I'm not quite sure what needs to be done, and

I'm not sure if this discussion should be part of this TEP.

No changes to the PipelineRun API (not in the spec at least)

For RuntimeObject driven retries, no changes to the PipelineRun API neither.

No changes to the TaskRun API (not in the spec at least)

For RuntimeObject driven retries, indeed, it adds an extra API field retries, but this would be an addictive change rather than backward-incompatible change, wdyt?

I believe we should leave the retry implementation to the CustomRun controller, pipelinerun controller shouldn't touch it at all -- it shouldn't rely on the retriesStatus to decide whether the CustomRun is finished or not etc.

I totally agree that we can get rid of retriesStatus for both CustomRun and TaskRun. For TaskRun, instead of updating retriesStatus, we can create a new TaskRun in taskrun controller for each retry attempt.

Therefore, having TaskRun controller implements the retry can have most of the advantages you mentiond plus

taskrun retry on its own

we don't need to deprecate retries in Run

Thanks for all of the great discussion in the WG meeting Andrea @afrittoli!

Yeah as you mentioned, it doesn't neccesarrily harm the flexibility of CustomTask. The more accurate way to state it is it harms the UX: say a CustomRun controller implemented its own retry strategy, for the users of the Custom Task, they would be confused about which retries field to use (PipelineTask.retries or PipelineTask.TaskSpec.SpecializedRetries):

apiVersion: tekton.dev/v1beta1 kind: PipelineRun metadata: generateName: pr-custom-task- spec: pipelineSpec: tasks: - name: wait timeout: "1s" retries: 1 // The common retries field in the PipelineTask taskSpec: specialized-retries: 5 // Specialized retries field in Custom Task Spec. other-spec-fields: foobar

Besides, would you mind pointing out which field we need to update in PipelineRun API spec?

XinruZhang · 2022-09-26T17:58:10Z

We need to document clearly the end condition of a Custom Run.
- For now, we rely on both Condition and retriesStatus: IsDone() & len(retriesStatus)==retries
  - And it is not aligned with what we documented, because the pr controller thinks the run should be retried when it times out.
- An alternative: make the decision solely based on Condition.
  - consider removing retriesStatus from run status because we'd like to give customers as much flexibility as possible.
Do we want to change the documentation of v1beta1 or both v1alpha1 and v1beta1?

A piece of background information: the current implementation is different (timeout for per retry) from what documented (timeout for all retry attempts), but is aligned with TaskRun behavior.

The current end condition when a Run fails:

// isFailure returns true only if the run has failed and will not be retried.
// If the PipelineTask has a Matrix, isFailure returns true if any run has failed (no remaining retries)
// and all other runs are done.
func (t ResolvedPipelineTask) isFailure() bool {
	//...
	return isDone && c.IsFalse() && !t.hasRemainingRetries()
}

// IsDone returns true if the Run's status indicates that it is done.
func (r *Run) IsDone() bool {
 return !r.Status.GetCondition(apis.ConditionSucceeded).IsUnknown()
}

We propose to not rely on the retriesStatus as part of the end condition, instead, instruct CustomRun users that DO NOT update run condition to ConditionSucceeded until all retries be exhausted.

@lbernick, @pritidesai , @abayer wdyt?

cc @dibyom

lbernick · 2022-09-26T18:00:05Z

thanks Xinru for this summary! I think we should remove status.retriesStatus from v1beta1.CustomRun instead of only changing the docs and PR reconciler code.

XinruZhang · 2022-09-26T18:06:59Z

thanks Xinru for this summary! I think we should remove status.retriesStatus from v1beta1.CustomRun instead of only changing the docs and PR reconciler code.

Thanks @lbernick! Right, updated the sumary to reflect the proposal.

pritidesai · 2022-09-28T02:34:07Z

+
+### Goals
+1. Align the behavior of `Retries` between TaskRun and CustomRun.
+2. Retry TaskRun independently.


What is the use case of being able to retry taskRun independently? What is the driving factor for creating just a taskRun outside of pipeline? I am trying to recollect and gather any such use case. I have hardly seen a requirement of creating just a taskRun. I can think that it could be beneficial for testing purpose, but if I am eventually making that task part of a pipeline, its easier to create a pipeline to begin with. Do we know if our catalog is using taskRun for their testing purpose?

Thanks for the comment Priti!

We do have standalone TaskRuns use cases:

https://github.com/tektoncd/catalog/tree/main/task/send-to-webhook-slack/0.1

https://github.com/tektoncd/catalog/tree/main/task/sendmail/0.1

It sounds weird to me If we believe all custom runs need to support the retries field while we don't necessarily need to support retries for all Task Runs.

IMHO the reason why we didn't see a lot of use cases (i.e. feature request iessues) is creating feature requests is not an effortless action. People may not bother to create feature requests for stand-alone TaskRuns potentially because they can actually write scripts to retry them. By supporting retries natively, we can group their retry attempts all together, saving them times on writing scripts etc.

Transient errors are everywhere especially in the Cloud Environment, services can be down for a short period of time making the entire TaskRun fails. I truly believe retry is something necessary for TaskRuns to avoid flaky errors.

https://learn.microsoft.com/en-us/azure/architecture/best-practices/transient-faults#why-do-transient-faults-occur-in-the-cloud explained how common the transient errors are in the Cloud env. It deepened my understanding about the nature of Cloud Services and how to make things more stable. Hope it to be helpful.

@XinruZhang @pritidesai we do have use cases where we run tasks by themselves in Tekton CI, typically to interact with GitHub, however, two things are worth noticing:

if a standalone task required retries, that would be possible today by using a single-task pipeline

we are in the process of replacing those with custom tasks because when the only thing done by a task is calling out to an API, running a Pod introduces overhead, and a custom task controller can handle the failure/retry logic much better anyways

IMHO the reason why we didn't see a lot of use cases (i.e. feature request iessues) is creating feature requests is not an effortless action. People may not bother to create feature requests for stand-alone TaskRuns potentially because they can actually write scripts to retry them. By supporting retries natively, we can group their retry attempts all together, saving them times on writing scripts etc.

I'm not sure I fully agree with this. Retrying a multi-step Task is not trivial.
Creating a feature request is not effortless, but I don't think the bar is too high if the feature solves an actual pain point.

Thanks Andrea for the use cases. That's true, users can always write a sinle-task pipeline in order to retry a Task. Therefore making TaskRun retriable on its own should not be the reason to introduce the new retries field for TaskRun.

pritidesai · 2022-09-28T17:16:20Z

+| **Attempts amount** |supported|supported|supported|supported|supported|supported|
+| **Timeout for each attempt** |supported|supported|supported|[supported](https://docs.gitlab.com/ee/ci/yaml/#retrywhen)|supported|supported|
+| **Timeout for all attempts** |[supported](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes)|[supported](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes)|[?](https://stackoverflow.com/questions/71424429/argo-workflowtemplate-timeout)|-|-|-|
+


Thank you @XinruZhang for this detailed comparison with an industry standard.

Tekton pipelines do support retries when a task is part of a pipeline. The task can be a native Tekton task or a custom task.

This table does not represent apples to apples comparison. For example, Argo Workflow is not equivalent to Tekton task but its more or less similar to pipeline or even Tekton workflow. This proposal is focused on independent taskRun and customRun CRDs. We need to identify and document the list of resources in the industry which are equivalent to a task or taskRun in Tekton.

This table is misleading and may confuse users while comparing Tekton with the other CI/CD systems.

@jerop and I had documented this kind of comparison last year (which might be outdated at this point), but something for your reference:

Thanks for the review @pritidesai and the comparison work you and Jerop did!

This related work section is written for comparing the general retry strategy (instead of demonstrating that retry is a neccesary functionality in TaskRun), especially about if we should set Timeout for all retry attempts or each retry attempt. I updated the Related Work section, hopefuly the confusion is alleviated.

Thanks for the concept comparison! It seems like there are no concepts equivalent to Task, becuase in Tekton, PipelineRun, TaskRun and CustomRun can all run on their own, while in other CI/CD systems, they only support the top level Pipeline/Workflow to run on its own. So it's hard for us to find an industry standard to follow to decide if we want to support retry in TaskRun.

XinruZhang · 2022-10-17T18:03:55Z

Thanks @pritidesai @jerop @lbernick for your detailed review, appreciate it a lot! ❤️
I Pushed a new commit to reflect what we've discussed in the comment thread, PTAL :)

pritidesai · 2022-10-19T23:43:18Z

+- 1.b: Update `retriesStatus` for each retry attempt for `TaskRun`, deprecate `retriesStatus` for `CustomRun`
+  - No implementation restrictions of `retriesStatus` for `CustomRun`
+  - Need to implement a strategy for clients to get the previous pod and read its logs.
+- 1.c: Deprecate `retriesStatus` for both `TaskRun` and `CustomRun`, create a new `TaskRun` for each retry attempt.


For independent taskRun, the users might have to query metadata (labels and annotations) to get the list of taskRun CRDs for each retry attempt. Let's say there is a taskRun which was retried five times, each five taskRuns and pod will have a label saying it was created for a particular instance.

For the taskRun created for a pipelineTask, how do we retrieve the collection of taskRuns for each attempt? Using the same metadata?

taskRun with retriesStatus works as a master with a list of children as each attempt. It hides complexity of querying a list of taskRuns for a given pipelineTask and checking the status of each attempted taskRun.

Summary our recent dicussion offline around option 1.c, the details are attached in Appendix if you are interested.

TaskRun reconciler creates a new TaskRun for each retry attempt.

Creates a new field RetryAttempts in TaskRun.Status to record the retry history.

Creates two new labels:

tekton.dev/retry-count: <retry number>, if the number is 0, the TaskRun is not initiated for a retry.

tekton.dev/retry-parent: <parent taskrun name>, record the parent TaskRun of a retry attempt TaskRun.

We need to prototype this solution before moving this TEP to implementable.

Thanks for all of the inputs and suggestions @pritidesai !

In [TEP-0121][tep-0121], we are redesigining `Retries` in `TaskRuns` and `CustomRuns`. In this change, we propose excluding `retries` and `retriesStatus` in the initial release of `CustomRuns`. We will figure out a way forward in [TEP-0121][tep-0121], which could be reintroducing the fields implementing replacement fields. This allows us to unblock release of Custom Tasks Beta and Tekton Pipelines V1. /kind tep [tep-0121]: tektoncd#816

XinruZhang · 2022-10-20T20:45:28Z

/hold Talking with @pritidesai offline about some important details of Option 1, related links:

Will update the TEP soon.

In [TEP-0121][tep-0121], we are redesigining `Retries` in `TaskRuns` and `CustomRuns`. In this change, we propose excluding `retries` and `retriesStatus` in the initial release of `CustomRuns`. We will figure out a way forward in [TEP-0121][tep-0121], which could be reintroducing the fields implementing replacement fields. This allows us to unblock release of Custom Tasks Beta and Tekton Pipelines V1. /kind tep [tep-0121]: #816

ScrapCodes · 2022-10-26T17:27:25Z

Hello folks, I am just trying to catch up with all the updates and might have missed some. Do we ensure that a custom task treats that timeout is for each retry and how?

XinruZhang · 2022-10-26T17:50:46Z

Hi @ScrapCodes, thanks for the comment! Yes, that's a great question. I don't think we have a way to enforce that, this is more of a contract we build with our customers: i.e. if you don't implement the custom task controller in this way, the pipeline controller may not work as intended.

XinruZhang · 2022-10-31T20:16:45Z

We(me and Prashant) have reached agreement in an offline discussion on last Friday, see the meeting note. Thanks for the great discussion! @ScrapCodes

Propose to refine Retries for TaskRun and CustomRun

pritidesai · 2022-11-03T19:23:56Z

thanks @XinruZhang for all your efforts and patience 🤗

/approve

tekton-robot · 2022-11-03T19:24:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, lbernick, pritidesai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~teps/OWNERS~~ [jerop,lbernick,pritidesai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dibyom · 2022-11-03T19:46:17Z

LGTM since have all the approvals now

/lgtm

tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2022

tekton-robot requested review from a-roberts and mnuttall September 8, 2022 15:50

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 8, 2022

XinruZhang mentioned this pull request Sep 8, 2022

TEP-0114: Clarify the Behavior of Custom Task Retries tektoncd/pipeline#5393

Closed

7 tasks

tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Sep 8, 2022

tekton-robot assigned lbernick Sep 8, 2022

jerop self-assigned this Sep 8, 2022

tekton-robot assigned pritidesai Sep 12, 2022

jerop reviewed Sep 14, 2022

View reviewed changes

Comment thread teps/0121-unified-retries-for-runtime-objects.md Outdated

Comment thread teps/0121-unified-retries-for-runtime-objects.md Outdated

Comment thread teps/0121-unified-retries-for-runtime-objects.md Outdated

dibyom reviewed Sep 15, 2022

View reviewed changes

Comment thread teps/0121-unified-retries-for-runtime-objects.md Outdated

dibyom reviewed Sep 15, 2022

View reviewed changes

Comment thread teps/0121-unified-retries-for-runtime-objects.md Outdated

XinruZhang marked this pull request as ready for review September 20, 2022 18:01

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 20, 2022

tekton-robot requested review from chmouel and sthaha September 20, 2022 18:01

lbernick reviewed Sep 21, 2022

View reviewed changes

lbernick reviewed Sep 22, 2022

View reviewed changes

pritidesai reviewed Sep 28, 2022

View reviewed changes

afrittoli mentioned this pull request Oct 3, 2022

TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler tektoncd/pipeline#5248

Closed

XinruZhang changed the title ~~TEP-0121: Unified Retries for Runtime Objects~~ TEP-0121: Refine Retries for TaskRuns and CustomRuns Oct 3, 2022

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2022

lbernick reviewed Oct 17, 2022

View reviewed changes

Comment thread teps/0121-refine-retries-for-taskruns-and-customruns.md Outdated

pritidesai reviewed Oct 19, 2022

View reviewed changes

jerop mentioned this pull request Oct 20, 2022

TEP-0114: Reconciler, event, config, webhook support for CustomRuns tektoncd/pipeline#5662

Merged

7 tasks

jerop mentioned this pull request Oct 20, 2022

TEP-0114: Custom Tasks Beta - Retries #862

Merged

tekton-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 20, 2022

XinruZhang mentioned this pull request Oct 27, 2022

TEP-0114: Support Both V1alpha1 and V1beta1 for Custom Task in the Initial Release #867

Merged

TEP-0121: Refine Retries for TaskRuns and CustomRuns

b5dac39

Propose to refine Retries for TaskRun and CustomRun

tekton-robot assigned dibyom Nov 3, 2022

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2022

tekton-robot merged commit e328094 into tektoncd:main Nov 3, 2022

Conversation

XinruZhang commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerop commented Sep 8, 2022

Uh oh!

lbernick commented Sep 8, 2022

Uh oh!

lbernick commented Sep 12, 2022

Uh oh!

XinruZhang commented Sep 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XinruZhang commented Sep 21, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lbernick Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbernick Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbernick Sep 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afrittoli Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XinruZhang Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XinruZhang commented Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lbernick commented Sep 26, 2022

Uh oh!

XinruZhang commented Sep 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XinruZhang commented Oct 17, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

XinruZhang commented Sep 8, 2022 •

edited

Loading

lbernick Sep 22, 2022 •

edited

Loading

lbernick Sep 22, 2022 •

edited

Loading

lbernick Sep 29, 2022 •

edited

Loading

afrittoli Oct 3, 2022 •

edited

Loading

XinruZhang Oct 3, 2022 •

edited

Loading

XinruZhang commented Sep 26, 2022 •

edited

Loading