Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0121: Retries - Move to Implementable #879

Merged
merged 1 commit into from
Nov 15, 2022
Merged

TEP-0121: Retries - Move to Implementable #879

merged 1 commit into from
Nov 15, 2022

Conversation

XinruZhang
Copy link
Member

This PR proposes a solution for TEP-0121 and moves it to implementable. PTAL.

Signed-Off-By: @jerop @XinruZhang
cc @tektoncd/core-maintainers

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 11, 2022
@XinruZhang
Copy link
Member Author

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Nov 11, 2022
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@XinruZhang please update readme.md with the updated status and date (that's why linting is failing) and update the table of contents too

@jerop
Copy link
Member

jerop commented Nov 11, 2022

@jerop jerop changed the title TEP-0121: Move to Implementable TEP-0121: Retries - Move to Implementable Nov 11, 2022
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 11, 2022
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

As I commented previously, personally I'd prefer the option where retries are managed by the parent controller - it would give custom runs a default retry approach available out of the box.

Nonetheless I think this proposal is sound and a valid alternative, and it has the very nice benefit of maintaining the same custom task API we have today.

We should get +1 from @lbernick @pritidesai and @jerop before we merge this.

/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2022
@afrittoli
Copy link
Member

/assign @lbernick
/assign @pritidesai

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly support this change because it fixes Conditions.Succeeded - going forward, setting Conditions.Succeeded to "True" or "False" is the final status. Removing the caveat of remaining retries in "False" case makes the status reporting clearer.

/approve

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like this approach 🙃

@afrittoli
Copy link
Member

afrittoli commented Nov 14, 2022

I think I had an idea that would combine the best of both worlds (Pipeline driven and Custom/TaskRun driven retries)!
In a nutshell, we define a way for the Custom/TaskRun controller to signal back to the PipelineRun controller.
It would work as follows:

  • When pipelineTask retries are specified, they are passed down to the CustomRun, like today
  • If the CustomRun controller implements retries, it will take care of setting the Succeeded condition to false only when all retries failed. The custom run controller MUST include retries data in the status, at a minimum something like "retriesHandled: true", probably something better :)
  • If the CustomRun controller implements retries, it will set the Succeeded condition to false after the first failed attempt
  • The PipelineRun controller watches for the CustomRun status. When it reaches "failed" it looks for retry data in the status. If none is found it handles retries
  • This can be extended to TaskRuns as well, by adding retries to the TaskRun

This approach combines the best of both worlds - since it provides a default implementation of retries, but it allows custom controllers and taskrun controller to take over control of retries if they want to, all with no changes in the pipeline definition from a user point of view, and no changes on the retries field in CustomRun.
The only change would be for CustomRun controller that do implement retries, they would have to signal this in the status.

This approach also means that the we could start with having no taskrun controller implementation in the beginning (like today). Once we implement it would take over the pipelinerun implementation, and provide out of the box efficient matrix retries.

@XinruZhang @lbernick @jerop @pritidesai wdyt?

@XinruZhang
Copy link
Member Author

XinruZhang commented Nov 14, 2022

Thanks for the comment Andrea @afrittoli !

I understand that implementing Retries is not Trivia for custom task authors. This is a very valuable conversation. I'm thinking of the possibility of keeping this proposal as it is and move it forward. In the meantime, we can create an issue to talk about bringing built-in Retries support for CustomRun.

We are sure of that this current proposal definitely does not block the built-in Retries support for CustomRun, and in the meantime, we can unblock Custom Task Beta promotion and V1 CRD release by sorting Retries out.

wdyt? cc @lbernick @jerop @pritidesai @dibyom

@XinruZhang
Copy link
Member Author

XinruZhang commented Nov 14, 2022

I created an issue: tektoncd/pipeline#5751 to track

@jerop
Copy link
Member

jerop commented Nov 14, 2022

At API WG today, we agreed to move forward with the design and explore the built-in implementation that @afrittoli suggested in future work - tektoncd/pipeline#5751

@tekton-robot tekton-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 14, 2022
@lbernick
Copy link
Member

I agree with Andrea that if I were designing this functionality from scratch, I'd prefer an approach where the pipelinerun controller creates a new instance of each child object for each retry; however this is a reasonable proposal to move forward with. Thanks for documenting that this blocks our v1 software release rather than our v1 api version of TaskRun.

/approve

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, jerop, lbernick, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Propose a solution and move the tep to implementable
@XinruZhang
Copy link
Member Author

Updated the TEP per Priti's suggestion! Thanks for those suggestions @pritidesai!

@pritidesai
Copy link
Member

Thank you @XinruZhang 👍

The changes look great so far. Nothing blocking here but an important detail to be designed and included.

when does a taskRun controller or customRun controller triggers a retry of an existing taskRun or customRun? i.e. when is a next attempt scheduled?

Before TEP-0121:

Pipelinerun controller creates a taskRun for a given pipelineTask. taskRun reconciler marks a taskRun as failed (initialize a condition of type Succeeded and set the Status to False) in case of step failure. Pipelinerun controller checks for a taskRun if it has been declared failed to decide whether to schedule a next attempt. If a pipelineTask is ready to retry, pipelinerun controller, archives its existing status into retryStatus and clears the existing status.

After TEP-0121:

Pipelinerun controller creates a taskRun for a given pipelineTask. taskRun reconciler marks a taskRun as running (initialize a condition of type Succeeded, set the Status to Unknown, and the reason to FailedWithNRetries) in case of a step failure. TaskRun controller checks if the taskRun failed with more than 0 number of retries left (succeeded condition status set to unknown and reason is set to FailedWithNRetries and len(retriesStatus) == taskRunSpec.retries). If a taskRun is ready to retry, taskRun controller archives its existing status into retryStatus and clears the existing reason.

Thoughts?

I am merging this TEP after leaving this comment and with enough approvals.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2022
@tekton-robot tekton-robot merged commit 549f15a into tektoncd:main Nov 15, 2022
@XinruZhang
Copy link
Member Author

XinruZhang commented Nov 15, 2022

@pritidesai Thanks for the comment Priti, I'm writing POCs for this implementation detail. Meanwhile, I think @afrittoli made a great point about on demand retry poposed in TEP-0123 if we go with the proposed solution. Need to think more about it.

@pritidesai
Copy link
Member

  • When pipelineTask retries are specified, they are passed down to the CustomRun, like today
  • If the CustomRun controller implements retries, it will take care of setting the Succeeded condition to false only when all retries failed. The custom run controller MUST include retries data in the status, at a minimum something like "retriesHandled: true", probably something better :)

👍

  • If the CustomRun controller implements retries, it will set the Succeeded condition to false after the first failed attempt

I think this is missing not? 🤔 If the CustomRun controller does not implement retries ... otherwise it contradicts with second bullet.

  • The PipelineRun controller watches for the CustomRun status. When it reaches "failed" it looks for retry data in the status. If none is found it handles retries
  • This can be extended to TaskRuns as well, by adding retries to the TaskRun

Please refer to the my comment here for the details on the status. How status is set to running with a particular reason while all retries are being exhausted to identify when to initiate a next attempt and at the same time making sure pipelineRun controller waits for the final outcome of the taskRun.

@XinruZhang
Copy link
Member Author

Thanks @pritidesai and @afrittoli ❤️

I wrote a POC (tektoncd/pipeline#5766) of implementing Retries in TaskRun. From the implementation perspective, we are able to delegate the pipeline-driven retry strategy to respective controllers in EmitOnRetry.

This implementation is slightly different from @pritidesai described, but the main ideas match: we are able to control "when to initiate a next attempt and at the same time making sure pipelineRun controller waits for the final outcome of the taskRun."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: UnAssigned
Development

Successfully merging this pull request may close these issues.

7 participants