Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable grace period for TaskRun pods in ImagePullBackOff #5987

Open
rinckm opened this issue Jan 12, 2023 · 12 comments
Open

Configurable grace period for TaskRun pods in ImagePullBackOff #5987

rinckm opened this issue Jan 12, 2023 · 12 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@rinckm
Copy link
Contributor

rinckm commented Jan 12, 2023

Feature request

Introduce a configurable grace period for TaskRun pods to be in ImagePullBackOff without failing the TaskRun.

Use case

We are using Tekton to execute TaskRuns. In our use case images for TaskRun pods cannot be pulled directly because container registry credentials must not exist in the namespace for security reasons. Instead, there’s another component in the cluster that pulls images for TaskRun pods. In case of delays in image pulling a TaskRun’s pod may be subject to ImagePullBackOff and recovers after the image has been provisioned on the respective node.

With PR #4921 (fail TaskRuns on ImagePullBackOff) we now see sporadically failing TaskRuns.

We propose to introduce a configurable grace period where ImagePullBackOff is tolerated and does not fail a TaskRun.

@rinckm rinckm added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 12, 2023
@jerop
Copy link
Member

jerop commented Jan 12, 2023

@rinckm thank you for sharing your use case for this feature request; we had discussed supporting this as future work for TEP-0092 -- the TEP has not been implemented yet, we welcome contributions in implementing it

cc @bobcatfish @Aleromerog

/help-wanted

@jerop jerop added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 12, 2023
@shuheiktgw
Copy link

shuheiktgw commented Jan 22, 2023

Hi, may I try implementing the TEP-0092? 👋
/assign

@afrittoli
Copy link
Member

Hello @shuheiktgw, thank you for offering to implement this, that would be great.
The TEP already exists and is in an implementable state.
@jerop @bobcatfish @Aleromerog FYI

@jerop
Copy link
Member

jerop commented Jan 24, 2023

@shuheiktgw thank you for offering to implement -- @EmmaMunley has also started looking into implementing TEP-0092, maybe you can collaborate?

@shuheiktgw
Copy link

Sure, I'm happy to collaborate🙂 Hi @EmmaMunley, how is the implementation going so far? I'd appreciate it if you would push any WIP changes so that I can see if there is anything I can help!

@EmmaMunley
Copy link
Contributor

Hi @shuheiktgw! Sure! I am working on implementing the scheduling timeout feature first as part of this issue: #4078.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2023
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dibyom dibyom removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 5, 2024
@dibyom dibyom reopened this Feb 5, 2024
@pritidesai
Copy link
Member

Thanks @dibyom for reopening this issue.

We are running into this issue and looking for a potential solution.

The issue is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up.

We have this issue reported by multiple end users: #7184

The problem statement for this issue is to be able to continue waiting when a step or sidecar container reports the container is waiting with ImagePullBackOff.

TEP-0092 proposed to solve this issue. TEP-0092 is now marked as deferred while proposing TEP-0132. TEP-0132 has much wider scope and focuses on proposing a generic solution for creating a queue of pipelineRun/taskRuns/etc. Until we resume TEP-0132, I would like to check with the community to propose a solution for this particular problem statement. Thoughts? @tektoncd/core-maintainers

@pritidesai
Copy link
Member

pritidesai commented Feb 5, 2024

Revisiting the TEP-0132 , it might not be able to resolve this particular issue as the taskrun controller treats ImagePullBackOff as permanent error. To overcome this limitation, we need a solution to not just avoid ImagePullBackOff as there can be other non-tekton resource in the cluster causing rate limit problem. We need a solution (opt-in) to avoid treating ImagePullBackOff as a permanent error.

pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 13, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 14, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

This is a manual cheery-pick of tektoncd#7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit to tekton-robot/pipeline that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
pritidesai added a commit to pritidesai/pipeline that referenced this issue Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

This is a manual cheery-pick of tektoncd#7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this issue Feb 26, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

This is a manual cheery-pick of #7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this issue Feb 26, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
l-qing pushed a commit to l-qing/pipeline that referenced this issue Mar 19, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: In Progress
Development

No branches or pull requests

8 participants