TEP-0121: Retries - Move to Implementable #879

XinruZhang · 2022-11-11T21:19:14Z

This PR proposes a solution for TEP-0121 and moves it to implementable. PTAL.

Signed-Off-By: @jerop @XinruZhang
cc @tektoncd/core-maintainers

XinruZhang · 2022-11-11T21:19:59Z

/kind tep

jerop

/assign

@XinruZhang please update readme.md with the updated status and date (that's why linting is failing) and update the table of contents too

jerop · 2022-11-11T21:33:36Z

cc @afrittoli @abayer @dibyom @lbernick @pritidesai @vdemeester

teps/0121-refine-retries-for-taskruns-and-customruns.md

afrittoli

Thank you for this!

As I commented previously, personally I'd prefer the option where retries are managed by the parent controller - it would give custom runs a default retry approach available out of the box.

Nonetheless I think this proposal is sound and a valid alternative, and it has the very nice benefit of maintaining the same custom task API we have today.

We should get +1 from @lbernick @pritidesai and @jerop before we merge this.

/approve

afrittoli · 2022-11-14T14:15:58Z

/assign @lbernick
/assign @pritidesai

jerop

I strongly support this change because it fixes Conditions.Succeeded - going forward, setting Conditions.Succeeded to "True" or "False" is the final status. Removing the caveat of remaining retries in "False" case makes the status reporting clearer.

/approve

vdemeester

I do like this approach 🙃

afrittoli · 2022-11-14T15:05:07Z

I think I had an idea that would combine the best of both worlds (Pipeline driven and Custom/TaskRun driven retries)!
In a nutshell, we define a way for the Custom/TaskRun controller to signal back to the PipelineRun controller.
It would work as follows:

When pipelineTask retries are specified, they are passed down to the CustomRun, like today
If the CustomRun controller implements retries, it will take care of setting the Succeeded condition to false only when all retries failed. The custom run controller MUST include retries data in the status, at a minimum something like "retriesHandled: true", probably something better :)
If the CustomRun controller implements retries, it will set the Succeeded condition to false after the first failed attempt
The PipelineRun controller watches for the CustomRun status. When it reaches "failed" it looks for retry data in the status. If none is found it handles retries
This can be extended to TaskRuns as well, by adding retries to the TaskRun

This approach combines the best of both worlds - since it provides a default implementation of retries, but it allows custom controllers and taskrun controller to take over control of retries if they want to, all with no changes in the pipeline definition from a user point of view, and no changes on the retries field in CustomRun.
The only change would be for CustomRun controller that do implement retries, they would have to signal this in the status.

This approach also means that the we could start with having no taskrun controller implementation in the beginning (like today). Once we implement it would take over the pipelinerun implementation, and provide out of the box efficient matrix retries.

@XinruZhang @lbernick @jerop @pritidesai wdyt?

XinruZhang · 2022-11-14T15:25:26Z

Thanks for the comment Andrea @afrittoli !

I understand that implementing Retries is not Trivia for custom task authors. This is a very valuable conversation. I'm thinking of the possibility of keeping this proposal as it is and move it forward. In the meantime, we can create an issue to talk about bringing built-in Retries support for CustomRun.

We are sure of that this current proposal definitely does not block the built-in Retries support for CustomRun, and in the meantime, we can unblock Custom Task Beta promotion and V1 CRD release by sorting Retries out.

wdyt? cc @lbernick @jerop @pritidesai @dibyom

XinruZhang · 2022-11-14T17:17:58Z

I created an issue: tektoncd/pipeline#5751 to track

teps/0121-refine-retries-for-taskruns-and-customruns.md

jerop · 2022-11-14T19:20:50Z

At API WG today, we agreed to move forward with the design and explore the built-in implementation that @afrittoli suggested in future work - tektoncd/pipeline#5751

lbernick · 2022-11-14T20:57:46Z

I agree with Andrea that if I were designing this functionality from scratch, I'd prefer an approach where the pipelinerun controller creates a new instance of each child object for each retry; however this is a reasonable proposal to move forward with. Thanks for documenting that this blocks our v1 software release rather than our v1 api version of TaskRun.

/approve

tekton-robot · 2022-11-14T20:57:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, jerop, lbernick, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~teps/OWNERS~~ [afrittoli,jerop,lbernick,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Propose a solution and move the tep to implementable

XinruZhang · 2022-11-14T22:56:34Z

Updated the TEP per Priti's suggestion! Thanks for those suggestions @pritidesai!

pritidesai · 2022-11-15T05:10:23Z

Thank you @XinruZhang 👍

The changes look great so far. Nothing blocking here but an important detail to be designed and included.

when does a taskRun controller or customRun controller triggers a retry of an existing taskRun or customRun? i.e. when is a next attempt scheduled?

Before TEP-0121:

Pipelinerun controller creates a taskRun for a given pipelineTask. taskRun reconciler marks a taskRun as failed (initialize a condition of type Succeeded and set the Status to False) in case of step failure. Pipelinerun controller checks for a taskRun if it has been declared failed to decide whether to schedule a next attempt. If a pipelineTask is ready to retry, pipelinerun controller, archives its existing status into retryStatus and clears the existing status.

After TEP-0121:

Pipelinerun controller creates a taskRun for a given pipelineTask. taskRun reconciler marks a taskRun as running (initialize a condition of type Succeeded, set the Status to Unknown, and the reason to FailedWithNRetries) in case of a step failure. TaskRun controller checks if the taskRun failed with more than 0 number of retries left (succeeded condition status set to unknown and reason is set to FailedWithNRetries and len(retriesStatus) == taskRunSpec.retries). If a taskRun is ready to retry, taskRun controller archives its existing status into retryStatus and clears the existing reason.

Thoughts?

I am merging this TEP after leaving this comment and with enough approvals.

/lgtm

XinruZhang · 2022-11-15T16:29:47Z

@pritidesai Thanks for the comment Priti, I'm writing POCs for this implementation detail. Meanwhile, I think @afrittoli made a great point about on demand retry poposed in TEP-0123 if we go with the proposed solution. Need to think more about it.

pritidesai · 2022-11-15T22:47:21Z

When pipelineTask retries are specified, they are passed down to the CustomRun, like today

If the CustomRun controller implements retries, it will take care of setting the Succeeded condition to false only when all retries failed. The custom run controller MUST include retries data in the status, at a minimum something like "retriesHandled: true", probably something better :)

👍

If the CustomRun controller implements retries, it will set the Succeeded condition to false after the first failed attempt

I think this is missing not? 🤔 If the CustomRun controller does not implement retries ... otherwise it contradicts with second bullet.

The PipelineRun controller watches for the CustomRun status. When it reaches "failed" it looks for retry data in the status. If none is found it handles retries

This can be extended to TaskRuns as well, by adding retries to the TaskRun

Please refer to the my comment here for the details on the status. How status is set to running with a particular reason while all retries are being exhausted to identify when to initiate a next attempt and at the same time making sure pipelineRun controller waits for the final outcome of the taskRun.

XinruZhang · 2022-11-16T16:04:03Z

Thanks @pritidesai and @afrittoli ❤️

I wrote a POC (tektoncd/pipeline#5766) of implementing Retries in TaskRun. From the implementation perspective, we are able to delegate the pipeline-driven retry strategy to respective controllers in EmitOnRetry.

This implementation is slightly different from @pritidesai described, but the main ideas match: we are able to control "when to initiate a next attempt and at the same time making sure pipelineRun controller waits for the final outcome of the taskRun."

tekton-robot requested review from piyush-garg and wlynch November 11, 2022 21:19

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 11, 2022

tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Nov 11, 2022

jerop reviewed Nov 11, 2022

View reviewed changes

tekton-robot assigned jerop Nov 11, 2022

jerop changed the title ~~TEP-0121: Move to Implementable~~ TEP-0121: Retries - Move to Implementable Nov 11, 2022

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 11, 2022

afrittoli reviewed Nov 14, 2022

View reviewed changes

teps/0121-refine-retries-for-taskruns-and-customruns.md Outdated Show resolved Hide resolved

afrittoli reviewed Nov 14, 2022

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2022

tekton-robot assigned lbernick and pritidesai Nov 14, 2022

afrittoli self-assigned this Nov 14, 2022

jerop reviewed Nov 14, 2022

View reviewed changes

lbernick mentioned this pull request Nov 14, 2022

[TEP-0121] Add examples for custom tasks #878

Closed

vdemeester approved these changes Nov 14, 2022

View reviewed changes

lbernick reviewed Nov 14, 2022

View reviewed changes

teps/0121-refine-retries-for-taskruns-and-customruns.md Outdated Show resolved Hide resolved

tekton-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 14, 2022

XinruZhang mentioned this pull request Nov 14, 2022

Built-in Retries support for CustomRun tektoncd/pipeline#5751

Closed

TEP-0121: Move to Implementable

9bf1ecb

Propose a solution and move the tep to implementable

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2022

tekton-robot merged commit 549f15a into tektoncd:main Nov 15, 2022

afrittoli mentioned this pull request Nov 15, 2022

TEP-0123 - proposal to specify on-demand-retry in a pipelineTask #823

Merged

pritidesai mentioned this pull request Nov 18, 2022

TEP-0121 - add status transition details #882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEP-0121: Retries - Move to Implementable #879

TEP-0121: Retries - Move to Implementable #879

XinruZhang commented Nov 11, 2022

XinruZhang commented Nov 11, 2022

jerop left a comment

jerop commented Nov 11, 2022

afrittoli left a comment

afrittoli commented Nov 14, 2022

jerop left a comment •

edited

Loading

vdemeester left a comment

afrittoli commented Nov 14, 2022 •

edited

Loading

XinruZhang commented Nov 14, 2022 •

edited

Loading

XinruZhang commented Nov 14, 2022 •

edited

Loading

jerop commented Nov 14, 2022

lbernick commented Nov 14, 2022

tekton-robot commented Nov 14, 2022

XinruZhang commented Nov 14, 2022

pritidesai commented Nov 15, 2022

XinruZhang commented Nov 15, 2022 •

edited

Loading

pritidesai commented Nov 15, 2022

XinruZhang commented Nov 16, 2022

TEP-0121: Retries - Move to Implementable #879

TEP-0121: Retries - Move to Implementable #879

Conversation

XinruZhang commented Nov 11, 2022

XinruZhang commented Nov 11, 2022

jerop left a comment

Choose a reason for hiding this comment

jerop commented Nov 11, 2022

afrittoli left a comment

Choose a reason for hiding this comment

afrittoli commented Nov 14, 2022

jerop left a comment • edited Loading

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

afrittoli commented Nov 14, 2022 • edited Loading

XinruZhang commented Nov 14, 2022 • edited Loading

XinruZhang commented Nov 14, 2022 • edited Loading

jerop commented Nov 14, 2022

lbernick commented Nov 14, 2022

tekton-robot commented Nov 14, 2022

XinruZhang commented Nov 14, 2022

pritidesai commented Nov 15, 2022

XinruZhang commented Nov 15, 2022 • edited Loading

pritidesai commented Nov 15, 2022

XinruZhang commented Nov 16, 2022

jerop left a comment •

edited

Loading

afrittoli commented Nov 14, 2022 •

edited

Loading

XinruZhang commented Nov 14, 2022 •

edited

Loading

XinruZhang commented Nov 14, 2022 •

edited

Loading

XinruZhang commented Nov 15, 2022 •

edited

Loading