Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate pod on TaskRun's pod deletion #758

Merged
merged 1 commit into from
Apr 24, 2019

Conversation

dicarlo2
Copy link
Contributor

@dicarlo2 dicarlo2 commented Apr 13, 2019

Changes

A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.

Fixes #618

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

Release Notes

Recreate TaskRun pods on deletion.

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 13, 2019
@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 13, 2019
@abayer
Copy link
Contributor

abayer commented Apr 14, 2019

/ok-to-test

@tekton-robot tekton-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 14, 2019
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for catching and fixing this @dicarlo2 !!

I have a request: now that the logic to get a TaskRun's associated pod is getting more complicated, can we move it into a different package with its own unit tests? (that doesnt depend on the reconciler) ❤️

pkg/reconciler/v1alpha1/taskrun/taskrun.go Outdated Show resolved Hide resolved
A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 23, 2019
@bobcatfish
Copy link
Collaborator

niiiice, looks great! 😎 thanks @dicarlo2 ❤️ !!

/lgtm
/approve
/meow space

@tekton-robot
Copy link
Collaborator

@bobcatfish: cat image

In response to this:

niiiice, looks great! 😎 thanks @dicarlo2 ❤️ !!

/lgtm
/approve
/meow space

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, dicarlo2

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2019
@tekton-robot tekton-robot merged commit 11e03b7 into tektoncd:master Apr 24, 2019
@imjasonh
Copy link
Member

Should this require explicit opt-in from the user? Some tasks will not be idempotent, and I can imagine a scenario where automatically re-running them when they fail due to underlying platform issues would be surprising to users.

If I understand correctly, this change also makes tasks work on preemptible VMs, since the controller will recreate and restart a task if its underlying node gets preempted. That sounds really compelling for a cheap CI solution where unit tests are likely idempotent, and I'd love to see a demo/guide for setting that up. But it still seems like something users should have to explicitly opt in to, rather than assuming all tasks can handle that gracefully.

WDYT @dicarlo2 ?

@dicarlo2
Copy link
Contributor Author

Yes, you're right, it should require opt-in. I'm happy to submit a PR for it. The only question I have is how we would like the user to configure it in the context of #658. Is it the same option? Is it considered a retry?

IIRC, argo workflows use two separate options, one to enable retrying system failures (argo, kubernetes, etc.) and one for retrying user failures, which at first is a bit confusing and adds to the cognitive overhead of configuring argo, so I'm not sure if we want to follow that approach here or not.

@bobcatfish

@imjasonh
Copy link
Member

@dicarlo2 I don't think it should be considered as a "retry" if it's retrying because of platform issues. An idempotent taskrun that gets really unlucky could be preempted dozens of times, and should only be "retried" in terms of task failure once or twice. It would be confusing if those both counted toward the same retry limit.

I think the option to enable this should be phrased as something like idempotent: true (or idempotencyMode: Idempotent or something).

WDYT?

@dicarlo2
Copy link
Contributor Author

dicarlo2 commented May 3, 2019

@imjasonh SGTM, I'll get a PR up shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does not restart pods
6 participants