Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipelineRun status does not accurately reflect the state of the runtime #2268

Closed
steveodonovan opened this issue Mar 23, 2020 · 8 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/question Issues or PRs that are questions around the project or a particular feature lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@steveodonovan
Copy link
Member

steveodonovan commented Mar 23, 2020

Expected Behavior

PipelineRuns and TaskRuns status should reflect the state of the runtimes that make up a pipeline.

Actual Behavior

There are some cases where a pod will not run but the pipelineRun and taskRun status does not entirely reflect this (although the information is in there). This may be by design given the pods are recoverable but worth raising as a question at least.

For example

Sample resources with missing config map

apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
  name: task1
spec:
  steps:
  - name: task-one-step-one
    env:
      - name: ENV
        valueFrom:
          configMapKeyRef:
            name: environment-properties
            key: environment
    image: ubuntu
    command: ["/bin/bash"]
    args: ['-c', 'echo $ENV']
---
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: bad-map-ref-pipe
spec:
  tasks:
  - name: task1
    taskRef:
      name: task1
---
apiVersion: tekton.dev/v1alpha1
kind: PipelineRun
metadata:
  name: bad-map-ref-run
spec:
  pipelineRef:
    name: bad-map-ref-pipe

We end up with a pipeline run which appears to be running but on inspecting the taskRun message we see is stuck with the message Message: build step "step-task-one-step-one" is pending with reason "configmap \"environment-properties\" not found"and reason Reason: CreateContainerConfigError (standalone taskRun message and pipelineRun -> taskRun message both show this)

NAME              SUCCEEDED   REASON    STARTTIME   COMPLETIONTIME
bad-map-ref-run   Unknown     Running   19s

This does mean observing the pipelineRun you don't get an accurate view of its state given it will never complete. However, if the configmap was to be created the pod would recover and run.

PipelineRun

Name:         bad-map-ref-run
Namespace:    default
Labels:       tekton.dev/pipeline=bad-map-ref-pipe
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"tekton.dev/v1alpha1","kind":"Pipeline","metadata":{"annotations":{},"name":"bad-map-ref-pipe","namespace":"default"},"spec"...
API Version:  tekton.dev/v1alpha1
Kind:         PipelineRun
Metadata:
  Creation Timestamp:  2020-03-23T17:40:26Z
  Generation:          1
  Resource Version:    227000
  Self Link:           /apis/tekton.dev/v1alpha1/namespaces/default/pipelineruns/bad-map-ref-run
  UID:                 359d58c3-ca76-4607-90fc-bfce4f802139
Spec:
  Pipeline Ref:
    Name:   bad-map-ref-pipe
  Timeout:  24h0m0s
Status:
  Conditions:
    Last Transition Time:  2020-03-23T17:40:26Z
    Message:               Not all Tasks in the Pipeline have finished executing
    Reason:                Running
    Status:                Unknown
    Type:                  Succeeded
  Start Time:              2020-03-23T17:40:26Z
  Task Runs:
    bad-map-ref-run-task1-lnt28:
      Pipeline Task Name:  task1
      Status:
        Conditions:
          Last Transition Time:  2020-03-23T17:40:31Z
          Message:               build step "step-task-one-step-one" is pending with reason "configmap \"environment-properties\" not found"
          Reason:                CreateContainerConfigError
          Status:                Unknown
          Type:                  Succeeded
        Pod Name:                bad-map-ref-run-task1-lnt28-pod-xtxnr
        Start Time:              2020-03-23T17:40:26Z
        Steps:
          Container:  step-task-one-step-one
          Name:       task-one-step-one
          Waiting:
            Message:  configmap "environment-properties" not found
            Reason:   CreateContainerConfigError
Events:               <none>

Similar if a pvc is missing.

Sample missing pvc resources

apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
  name: task1-pvc
spec:
  steps:
  - name: task-one-step-one
    image: ubuntu
    command: ["/bin/bash"]
    args: ['-c', 'echo running']
    volumeMounts:
      - mountPath: /artifacts
        name: task-volume
  volumes:
    - name: task-volume
      persistentVolumeClaim:
        claimName: missing-claim
---
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: missing-pvc-pipe
spec:
  tasks:
  - name: task1-pvc
    taskRef:
      name: task1-pvc
---
apiVersion: tekton.dev/v1alpha1
kind: PipelineRun
metadata:
  name: missing-pvc-run
spec:
  pipelineRef:
    name: missing-pvc-pipe

The taskRun/pod relationship is the same as above for the config map example.

Name:         missing-pvc-run
Namespace:    default
Labels:       tekton.dev/pipeline=missing-pvc-pipe
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"tekton.dev/v1alpha1","kind":"Pipeline","metadata":{"annotations":{},"name":"missing-pvc-pipe","namespace":"default"},"spec"...
API Version:  tekton.dev/v1alpha1
Kind:         PipelineRun
Metadata:
  Creation Timestamp:  2020-03-23T17:46:55Z
  Generation:          1
  Resource Version:    227552
  Self Link:           /apis/tekton.dev/v1alpha1/namespaces/default/pipelineruns/missing-pvc-run
  UID:                 e70a7537-1f5d-4540-830e-5ce6db40cbf6
Spec:
  Pipeline Ref:
    Name:   missing-pvc-pipe
  Timeout:  24h0m0s
Status:
  Conditions:
    Last Transition Time:  2020-03-23T17:46:55Z
    Message:               Not all Tasks in the Pipeline have finished executing
    Reason:                Running
    Status:                Unknown
    Type:                  Succeeded
  Start Time:              2020-03-23T17:46:55Z
  Task Runs:
    missing-pvc-run-task1-pvc-xfpw6:
      Pipeline Task Name:  task1-pvc
      Status:
        Conditions:
          Last Transition Time:  2020-03-23T17:46:55Z
          Message:               pod status "PodScheduled":"False"; message: "persistentvolumeclaim \"missing-claim\" not found"
          Reason:                Pending
          Status:                Unknown
          Type:                  Succeeded
        Pod Name:                missing-pvc-run-task1-pvc-xfpw6-pod-gbkxt
        Start Time:              2020-03-23T17:46:55Z
Events:
  Type     Reason             Age   From                 Message
  ----     ------             ----  ----                 -------
  Warning  PipelineRunFailed  26s   pipeline-controller  PipelineRun failed to update labels/annotations

There are other cases where a pod will fail to schedule and the pipelineRun will be ambiguous around the state of the run, for example taints or resource limits stopping the pod from scheduling.

Steps to Reproduce the Problem

  1. Run the above sample resources
  2. Check the status of the pipelineRun

Additional Info

  • Kubernetes version:
    v1.16

  • Tekton Pipeline version:
    v0.10.1

I'm not sure this is a legitimate issue given all these states are recoverable and this recovery is handled at the pod level, it does mean that at a given point for a pipelineRun to say its running is not accurate. It may also tie into #1684 given we are considering creating dependencies on the status of these resources.

EDIT: Should have spotted this, the pvc case looks like a legitimate issue given theres an event indicating it failed

@ghost
Copy link

ghost commented Apr 1, 2020

I'm going to assign this a "bug" label for the PVC issue - that looks to me like we should do a better job of detecting the failure. Although I'm actually not that clear whether the pod itself has entered into a completely failed state here or is re-attempting.

With the configmap example I'm not really sure what the best approach would be. I guess we could indicate some kind of Initializing state but then again it's quite possible that some TaskRun pods are executing fine while others in the same PipelineRun are still booting up or experiencing issues. In that case it's a little confusing to call the PipelineRun "Initializing".

@ghost ghost added kind/bug Categorizes issue or PR as related to a bug. kind/question Issues or PRs that are questions around the project or a particular feature labels Apr 1, 2020
@GregDritschler
Copy link
Contributor

I've been looking into this issue.

In both cases a pod is waiting for some external resource (pvc, configmap) to become available.
Once it becomes available the pod's containers will run.

I've been thinking about inventing a new PipelineRun condition Reason for this condition.
"Initializing" and "Pending" both sound like something that would apply prior to anything running,
while these conditions can happen to any task within the pipeline. I am leaning toward using "Waiting" because it doesn't have the connotation of being an initial block of some sort.

The problem is that as @sbwsg points out, some tasks may be running while some may be waiting.
One option is to just use this new Reason when there are no tasks running. This means the status would change to Waiting only if the pipeline was completely blocked. Another option is to have yet another Reason which reflects the mixed condition, something like "Running/Waiting" or "Partially Running" or something like that. I don't know the right term yet. But someone doing a short display of the PipelineRun might want to see that something's up.

  $ kubectl get pr 
  NAME    SUCCEEDED   REASON             STARTTIME   COMPLETIONTIME
  my-pr   Unknown     Partially Running  4m10s

The other question is what to do with the PR condition's message in this case.
Currently that message looks like this:

  'Tasks Completed: 0, Incomplete: 1, Skipped: 0'

(This used to be 'Not all Tasks in the Pipeline have finished executing' as shown in the issue description.)

One option is to change it to decompose incomplete into running and waiting.

  'Tasks Completed: 0, Running: 0, Waiting: 1, Skipped: 0'

Taskruns with ConditionUnknown and Reason Running are running.
Taskruns with ConditionUnknown and any other Reason are waiting.
This is based on my reading of updateIncompleteTaskRun in pkg/pod/status.go.

However this won't tell you which TaskRuns are waiting. You'd have to scan the taskrun conditions to find that. Unfortunately there isn't a consistent Reason used in the Taskruns.
It may have a Reason of Pending (e.g. missing PVC case), or CreateContainerConfigError (e.g. missing configmap case), or ExceededNodeResources (unschedulable due to resource limits).

So the other option is to use a different message when there is at least one waiting taskrun.

  'Taskrun %s is waiting; %s'

where it would tell you the taskrun and it's message. If there is more than one waiting taskrun, you either just get one of them, or I guess we could list them all. It might get lengthy.

What do you think @sbwsg? @afrittoli?

@steveodonovan
Copy link
Member Author

steveodonovan commented Apr 6, 2020

My 2c is something like SUCCEEDED - false REASON - error which is a recoverable state makes sense. We should reflect the state of the pipeline, if some tasks are running and others cannot it's an accurate state. It would occur where we get a FailedScheduling pod error on a taskRun for example. We would document this error state as a new state which reflects issues with the resources. Also we have inconsistent behaviour in these kinds of errors I'm afraid. Where we have a missing serviceAccount for example we fail hard - this case would definitely be better represented as error instead of failed

Name:         run1
Namespace:    default
Labels:       tekton.dev/pipeline=999-working-logs-long
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"tekton.dev/v1alpha1","kind":"Pipeline","metadata":{"annotations":{},"name":"999-working-logs-long","namespace":"default"},"...
API Version:  tekton.dev/v1alpha1
Kind:         PipelineRun
Metadata:
  Creation Timestamp:  2020-04-06T10:14:47Z
  Generation:          1
  Resource Version:    218588
  Self Link:           /apis/tekton.dev/v1alpha1/namespaces/default/pipelineruns/run1
  UID:                 8366b7b2-2a47-45e8-9537-3f8afc83f51b
Spec:
  Pipeline Ref:
    Name:                999-working-logs-long
  Service Account Name:  missingsa
  Timeout:               24h0m0s
Status:
  Completion Time:  2020-04-06T10:14:47Z
  Conditions:
    Last Transition Time:  2020-04-06T10:14:47Z
    Message:               TaskRun run1-task1-8zvkd has failed
    Reason:                Failed
    Status:                False
    Type:                  Succeeded
  Start Time:              2020-04-06T10:14:47Z
  Task Runs:
    run1-task1-8zvkd:
      Pipeline Task Name:  task1
      Status:
        Conditions:
          Last Transition Time:  2020-04-06T10:14:47Z
          Message:               Missing or invalid Task default/task1: translating Build to Pod: serviceaccounts "missingsa" not found
          Reason:                CouldntGetTask
          Status:                False
          Type:                  Succeeded
        Pod Name:                
        Start Time:              2020-04-06T10:14:47Z
    run1-task2-pjwh5:
      Pipeline Task Name:  task2
      Status:
        Conditions:
          Last Transition Time:  2020-04-06T10:14:47Z
          Message:               Missing or invalid Task default/task2: translating Build to Pod: serviceaccounts "missingsa" not found
          Reason:                CouldntGetTask
          Status:                False
          Type:                  Succeeded
        Pod Name:                
        Start Time:              2020-04-06T10:14:47Z
Events:
  Type     Reason  Age   From                 Message
  ----     ------  ----  ----                 -------
  Warning  Failed  24s   pipeline-controller  TaskRun run1-task1-8zvkd has failed

@steveodonovan
Copy link
Member Author

@dibyom @pritidesai just on #1684 and how it relates to this. Error, or broadly something reflecting this state seems like something to consider in that design as a terminal state. Although another one here is how do the runOn states map to pipeline run state ? Is it a separate abstraction or does it map to a particular field in the pipeline run ?

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/question Issues or PRs that are questions around the project or a particular feature lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants