(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330

fernandrone · 2024-02-05T18:27:23Z

Component

agent

Describe the bug

We were running Woodpecker v2.1.1 on Kubernetes backend at a multi-node cluster on AWS.

We've got a few panic: runtime error logs in our agent like this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x16f7594]

goroutine 99699 [running]:
[go.woodpecker-ci.org/woodpecker/v2/pipeline/backend/kubernetes.(*kube).WaitStep(0xc00050a140](http://go.woodpecker-ci.org/woodpecker/v2/pipeline/backend/kubernetes.%28*kube%29.WaitStep%280xc00050a140), {0x1e32748, 0xc0005048c0}, 0xc00116c500?, {0xc0024529b0, 0x5})
    /src/pipeline/backend/kubernetes/kubernetes.go:251 +0x594
[go.woodpecker-ci.org/woodpecker/v2/pipeline.(*Runtime).exec(0xc001d80b80](http://go.woodpecker-ci.org/woodpecker/v2/pipeline.%28*Runtime%29.exec%280xc001d80b80), 0xc00116c500)
    /src/pipeline/pipeline.go:269 +0x196
[go.woodpecker-ci.org/woodpecker/v2/pipeline.(*Runtime).execAll.func1()](http://go.woodpecker-ci.org/woodpecker/v2/pipeline.%28*Runtime%29.execAll.func1%28%29)
    /src/pipeline/pipeline.go:206 +0x1ba
[golang.org/x/sync/errgroup.(*Group).Go.func1()](http://golang.org/x/sync/errgroup.%28*Group%29.Go.func1%28%29)
    /src/vendor/[golang.org/x/sync/errgroup/errgroup.go:75](http://golang.org/x/sync/errgroup/errgroup.go:75) +0x56
created by [golang.org/x/sync/errgroup.(*Group).Go](http://golang.org/x/sync/errgroup.%28*Group%29.Go) in goroutine 41
    /src/vendor/[golang.org/x/sync/errgroup/errgroup.go:72](http://golang.org/x/sync/errgroup/errgroup.go:72) +0x96

Tracking it down to https://github.com/woodpecker-ci/woodpecker/blob/v2.1.1/pipeline/backend/kubernetes/kubernetes.go#L251 it's likely that either ContainerStatuses is empty or Terminated is nil, both which would cause a panic.

Now it seems that simply adding error handling code either case would be a viable option here, which is what I did internally (and I hope to be submitting that for review shortly). This way we were able to track down at least one occurrence of the bug: when the node hosting the pod is killed before the agent can retrieve the exit code. In our cause it was being caused by the Amazon Auto Scaling Group trying to rebalance multiple AZs despite active pipelines being executed on the node.

System Info

Woodpecker v2.1.1

Additional context

No response

Validations

Read the docs.
Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]

The text was updated successfully, but these errors were encountered:

Fixes #3330 This adds error handling on the agent's WaitStep function, on two sections where it could encounter a `panic: runtime error: invalid memory address or nil pointer dereference` in case it could no longer access complete information about a specific pod. This error was found to happen if the node in which the pod was running was terminated during the step's execution. spite active pipelines being executed on the node. Now instead of a panic on the agent's logs and undefined behavior on the UI it will display a more helpful error message on the UI. ### Additional context We observed the bug first on v2.1.1, but tested the fix internally on top of 2.3.0. ![image](https://github.com/woodpecker-ci/woodpecker/assets/7269710/dfbcf089-85f7-4b5d-8102-f21af95c5cda)

fernandrone added the bug Something isn't working label Feb 5, 2024

fernandrone mentioned this issue Feb 5, 2024

fix: agent panic when node is terminated during step execution #3331

Merged

6543 closed this as completed in #3331 Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330

(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330

fernandrone commented Feb 5, 2024

(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330

(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330

Comments

fernandrone commented Feb 5, 2024

Component

Describe the bug

System Info

Additional context

Validations