Tekton unable to handle PipelineRuns too big #6076

RafaeLeal · 2023-01-31T02:33:33Z

Expected Behavior

Tekton should be able to run PipelineRuns that are big or at least give a proper error.

Actual Behavior

PipelineRun gets stuck in the same status in the cluster, not respecting any timeouts.

Steps to Reproduce the Problem

Create a PipelineRun with a lot of embedded tasks and status
Watch the tekton controller logs

Versions Info

Kubernetes Version (output of `kubectl version -o yaml`)

$ kubectl version -o yaml
clientVersion:
  buildDate: "2022-10-12T10:47:25Z"
  compiler: gc
  gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
  gitTreeState: clean
  gitVersion: v1.25.3
  goVersion: go1.19.2
  major: "1"
  minor: "25"
  platform: darwin/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2022-11-29T18:41:42Z"
  compiler: gc
  gitCommit: 52e500d139bdef42fbc4540c357f0565c7867a81
  gitTreeState: clean
  gitVersion: v1.22.16-eks-ffeb93d
  goVersion: go1.16.15
  major: "1"
  minor: 22+
  platform: linux/amd64

Tekton Pipeline Version (output of `tkn version`)

Client version: 0.26.0
Pipeline version: v0.35.1
Triggers version: v0.20.0
Dashboard version: v0.26.0

Additional Info

We (Nubank) manage a very large CICD cluster (we reached over 700k taskruns/week), and we've built some abstractions over Tekton to make it easier for our users to declare new pipelines
We started to observe that some pipelineruns were getting stuck in our cluster, not finishing even hours after the timeout.
Investigating the logs, we found these (im formatting to become easier to read):

{ 
   caller: pipelinerun/reconciler.go:268
   commit: 422a468
   error: etcdserver: request is too large
   knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler
   knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7
   knative.dev/kind: tekton.dev.PipelineRun
   knative.dev/traceid: d1c3c461-cd4e-4656-a913-900fede9ba6e
   level: warn
   logger: tekton-pipelines-controller
   msg: Failed to update resource status
   targetMethod: ReconcileKind
   ts: 2023-01-18T13:54:54.256Z
}

{ 
   caller: controller/controller.go:566
   commit: 422a468
   duration: 14.036001784
   error: etcdserver: request is too large
   knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler
   knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7
   knative.dev/kind: tekton.dev.PipelineRun
   knative.dev/traceid: d1c3c461-cd4e-4656-a913-900fede9ba6e
   level: error
   logger: tekton-pipelines-controller
   msg: Reconcile error
   stacktrace: github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr
        github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:566
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem
        github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:543
github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3
        github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:491
   ts: 2023-01-18T13:54:54.256Z
}

These two errors were indicating that the Tekton Pipelines controller were trying to update the status, but it's getting this request is too large error.
Searching for other logs, we found these:

{
   caller: pipelinerun/pipelinerun.go:1360
   commit: 422a468
   knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler
   knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7
   knative.dev/kind: tekton.dev.PipelineRun
   knative.dev/traceid: d1c3c461-cd4e-4656-a913-900fede9ba6e
   level: info
   logger: tekton-pipelines-controller
   msg: Found a TaskRun itaipu-tests-main-run-ie3c7-bors-ready that was missing from the PipelineRun status
   ts: 2023-01-18T13:54:40.233Z
}

{ 
   caller: taskrun/taskrun.go:117
   commit: 422a468
   knative.dev/controller: github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler
   knative.dev/key: pipelines/itaipu-tests-main-run-ie3c7-bors-ready
   knative.dev/kind: tekton.dev.TaskRun
   knative.dev/traceid: 74f9a53a-8c95-49c0-9093-2e6095e0ffa3
   level: info
   logger: tekton-pipelines-controller
   msg: taskrun done : itaipu-tests-main-run-ie3c7-bors-ready 
   ts: 2023-01-18T13:59:38.227Z
}

These logs indicated that there was nothing wrong at the TaskRun level: the TaskRun itaipu-tests-main-run-ie3c7-bors-ready was done, but it was missing from the PipelineRun status, since it was failing to update the status due to etcd: request is too large error.
This made the PipelineRun get stuck since it couldn't update the status, even for timing out.
We are still using the full embedded status. We know that's not ideal for our scenario, and we want to go to minimal as soon as possible. That being said, I still believe that we might have this error in the future for some pipelines that can get really huge.

The text was updated successfully, but these errors were encountered:

RafaeLeal · 2023-01-31T02:37:14Z

My first thoughts were to just remove the .Status.TaskRuns field when we receive an etcdserver: request is too large error, but this error is handled here which it's a generated code, that we probably shouldn't touch, right?

RafaeLeal · 2023-01-31T02:46:58Z

So what we could do, it's to remove the field when the PipelineRuns timeout is reached, but the tricky part is that we can't check if we are receiving the error or not.

After considering a bit this problem, I realized that the problem is not the condition that changed, but the status of the TaskRun that doesn't fit the manifest size. So we should not receive a request is too large if we return the timeout condition before reconciling the task runs status. Or we could simply re-copy the existing status before changing it to timeout reason.

afrittoli · 2023-01-31T08:03:12Z

Hello @RafaeLeal - which version of Tekton do you use?
Recently we introduced a configuration option for Tekton to embed in the status only a reference to the owned resources as opposed to a copy of the whole status. In the latest release, this behaviour is now the default.
Depending on the number of TaskRuns this can make a significant difference in the overall size.
See "embedded status" in https://tekton.dev/docs/pipelines/install/#customizing-the-pipelines-controller-behavior

RafaeLeal · 2023-01-31T10:56:40Z

Hey @afrittoli, as I stated in the additional info, we're still using full embedded status. We are going to change to minimal as soon as possible, but I still think that Tekton Controller should be able to handle this error. Since we generate PipelineRuns, it's quite possible that we would generate a PipelineRun that is just too big to run.

I'm ok with Tekton just giving an error in this scenario, but having it get stuck in the cluster is quite a problem, because we use the number of TaskRuns in the cluster to know if we have "space" for more, and this "locks" part of that quota.

afrittoli · 2023-01-31T11:38:13Z

Hi @RafaeLeal - thanks, I missed that bit of information.

You're correct that the code where the error is managed is generated, so we cannot change it directly in Tekton.
If I read the flow correctly, the error is treated as a non-permanent one, which means the key is re-queued and it will eventually be picked up by the TaskRun controller again (with rate limiting, but still...).

The problem is that by the time the key is reconciled again, the controller has no knowledge of the previous error and thus no way to prevent it.

The only strategy I could think of would be, in case the timeout is expired by more than a certain threshold, skip any status update except for PipelineRun own condition. Even that would not guarantee an update in all cases, but it should help in your case to let the PipelineRun eventually timeout. It would still require checking the logs to discover the actual cause of the failure though.

RafaeLeal · 2023-02-01T04:14:33Z

You're correct that the code where the error is managed is generated, so we cannot change it directly in Tekton. If I read the flow correctly, the error is treated as a non-permanent one, which means the key is re-queued and it will eventually be picked up by the TaskRun controller again (with rate limiting, but still...).

The problem is that by the time the key is reconciled again, the controller has no knowledge of the previous error and thus no way to prevent it.

Thanks a lot for your input @afrittoli
That's exactly the problem as I understood it!

The only strategy I could think of would be, in case the timeout is expired by more than a certain threshold, skip any status update except for PipelineRun own condition. Even that would not guarantee an update in all cases, but it should help in your case to let the PipelineRun eventually timeout. It would still require checking the logs to discover the actual cause of the failure though.

This strategy you proposed was something I was thinking of too.
I've implemented it here.

While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check pr.HasTimedOut() and mark the status and return controller.NewRequeueImmediately(). This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (the childReferences, for example).

This could work, I think. But the problem I had is that we still depend a lot on the order of the execution to make everything work properly. In the second reconciliation, we already have the timed-out condition, which means the pr.IsDone() is true and this changes the whole reconciliation process.
https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/pipelinerun/pipelinerun.go#L205-L225

Looking at this, does it make sense to "fail fast" the reconciliation process?
For example, even inside in this pr.IsDone() branch, it runs several cleanup proccesses. If one fails, should we really prevent the next one from executing? 🤔

RafaeLeal added the kind/bug Categorizes issue or PR as related to a bug. label Jan 31, 2023

RafaeLeal mentioned this issue Feb 1, 2023

Fix for PipelineRuns getting stuck in the running state in the cluster #6095

Merged

7 tasks

tekton-robot closed this as completed in #6095 Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tekton unable to handle PipelineRuns too big #6076

Tekton unable to handle PipelineRuns too big #6076

RafaeLeal commented Jan 31, 2023

RafaeLeal commented Jan 31, 2023

RafaeLeal commented Jan 31, 2023

afrittoli commented Jan 31, 2023

RafaeLeal commented Jan 31, 2023

afrittoli commented Jan 31, 2023

RafaeLeal commented Feb 1, 2023

Tekton unable to handle PipelineRuns too big #6076

Tekton unable to handle PipelineRuns too big #6076

Comments

RafaeLeal commented Jan 31, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Versions Info

Additional Info

RafaeLeal commented Jan 31, 2023

RafaeLeal commented Jan 31, 2023

afrittoli commented Jan 31, 2023

RafaeLeal commented Jan 31, 2023

afrittoli commented Jan 31, 2023

RafaeLeal commented Feb 1, 2023