Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun fails with recoverable mount error #6960

Open
RafaeLeal opened this issue Jul 21, 2023 · 5 comments
Open

TaskRun fails with recoverable mount error #6960

RafaeLeal opened this issue Jul 21, 2023 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@RafaeLeal
Copy link
Contributor

Expected Behavior

TaskRun's pods should be able to recover from transient mount errors

Actual Behavior

When such an error occurs, the pod gets the state CreateContainerConfigError then the TaskRun fails.
Often the pod recovers, but it's too late.
This behavior was introduced in #1907

Steps to Reproduce the Problem

Not sure exactly how to reproduce this, but we have a fairly big Tekton cluster and it happens quite often with a volume that uses AWS EFS.
What happens it's that we notice the pod status like this:

status:
  conditions:
    - ...
    - type: "ContainersReady"
      status: "False"
      lastProbeTime: null
      lastTransitionTime: "2023-05-12T14:00:14Z"
      reason: "ContainersNotReady"
      message: "containers with unready status: [step-checkout]"
  containerStatuses:
    - name: "step-checkout"
      state:
        waiting:
          reason: "CreateContainerConfigError"
          message: "failed to create subPath directory for volumeMount \"ws-dmnjx\" of container \"step-checkout\""

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.3-eks-a5565ad", GitCommit:"78c8293d1c65e8a153bf3c03802ab9358c0e1a14", GitTreeState:"clean", BuildDate:"2023-06-16T17:32:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.48.0
@RafaeLeal RafaeLeal added the kind/bug Categorizes issue or PR as related to a bug. label Jul 21, 2023
@RafaeLeal
Copy link
Contributor Author

I can help with the fix...
I was considering having a grace period before setting the TaskRun status as error.
I'm not sure if we should hard-code this grace period or make it configurable via tekton controller's config maps. WDYT?
Do we need a WEP for this?

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2023
@vdemeester
Copy link
Member

/remove-lifecycle stale
@RafaeLeal I think that can make sense (having a grace period for this). I feel we might not necessarily need a TEP for this.
cc @afrittoli

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2023
@codegold79
Copy link

codegold79 commented Mar 25, 2024

My team has tried to recover from a CreateContainerConfigError because the TaskRun hasn't really failed. Notice no completionTimestamp, and the one status.steps item is waiting, not terminated.

TaskRun

status:
  conditions:
  - lastTransitionTime: "2024-03-22T18:09:56Z"
    message: Failed to create pod due to config error
    reason: CreateContainerConfigError
    status: "False"
    type: Succeeded
  startTime: "2024-03-22T18:09:40Z"
  steps:
  - container: step-check-step
    name: check-step
    waiting:
      message: secret "oci-store" not found
      reason: CreateContainerConfigError

In that waiting (but failed status) state, we tried to provide the correct configuration to pull an image, but the task never recovered. We had a pipeline tied to the task (it spawned the task), and it was in a terminated/failed/non-waiting/non-recoverable state.

We also went the other way, and waited for the pod to timeout while waiting, but the Task doesn't switch to being timed out. And of course, PipelineRun is still failed. And the pod hangs, never deleting.

I wonder, @RafaeLeal, you mentioned that the TaskRun fails, and pod recovers, but too late. In that state, is the TaskRun terminated at that point, with a completionTime, or is it still in waiting? I wonder if your problem is the same as ours, or if we need to make a separate issue.

@codegold79
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants