Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330

Closed
3 tasks done
fernandrone opened this issue Feb 5, 2024 · 0 comments · Fixed by #3331
Closed
3 tasks done
Labels
bug Something isn't working

Comments

@fernandrone
Copy link
Contributor

Component

agent

Describe the bug

We were running Woodpecker v2.1.1 on Kubernetes backend at a multi-node cluster on AWS.

We've got a few panic: runtime error logs in our agent like this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x16f7594]

goroutine 99699 [running]:
[go.woodpecker-ci.org/woodpecker/v2/pipeline/backend/kubernetes.(*kube).WaitStep(0xc00050a140](http://go.woodpecker-ci.org/woodpecker/v2/pipeline/backend/kubernetes.%28*kube%29.WaitStep%280xc00050a140), {0x1e32748, 0xc0005048c0}, 0xc00116c500?, {0xc0024529b0, 0x5})
    /src/pipeline/backend/kubernetes/kubernetes.go:251 +0x594
[go.woodpecker-ci.org/woodpecker/v2/pipeline.(*Runtime).exec(0xc001d80b80](http://go.woodpecker-ci.org/woodpecker/v2/pipeline.%28*Runtime%29.exec%280xc001d80b80), 0xc00116c500)
    /src/pipeline/pipeline.go:269 +0x196
[go.woodpecker-ci.org/woodpecker/v2/pipeline.(*Runtime).execAll.func1()](http://go.woodpecker-ci.org/woodpecker/v2/pipeline.%28*Runtime%29.execAll.func1%28%29)
    /src/pipeline/pipeline.go:206 +0x1ba
[golang.org/x/sync/errgroup.(*Group).Go.func1()](http://golang.org/x/sync/errgroup.%28*Group%29.Go.func1%28%29)
    /src/vendor/[golang.org/x/sync/errgroup/errgroup.go:75](http://golang.org/x/sync/errgroup/errgroup.go:75) +0x56
created by [golang.org/x/sync/errgroup.(*Group).Go](http://golang.org/x/sync/errgroup.%28*Group%29.Go) in goroutine 41
    /src/vendor/[golang.org/x/sync/errgroup/errgroup.go:72](http://golang.org/x/sync/errgroup/errgroup.go:72) +0x96

Tracking it down to https://github.com/woodpecker-ci/woodpecker/blob/v2.1.1/pipeline/backend/kubernetes/kubernetes.go#L251 it's likely that either ContainerStatuses is empty or Terminated is nil, both which would cause a panic.

Now it seems that simply adding error handling code either case would be a viable option here, which is what I did internally (and I hope to be submitting that for review shortly). This way we were able to track down at least one occurrence of the bug: when the node hosting the pod is killed before the agent can retrieve the exit code. In our cause it was being caused by the Amazon Auto Scaling Group trying to rebalance multiple AZs despite active pipelines being executed on the node.

System Info

Woodpecker v2.1.1

Additional context

No response

Validations

  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
@fernandrone fernandrone added the bug Something isn't working label Feb 5, 2024
@6543 6543 closed this as completed in #3331 Feb 5, 2024
6543 pushed a commit that referenced this issue Feb 5, 2024
Fixes #3330

This adds error handling on the agent's WaitStep function, on two
sections where it could encounter a `panic: runtime error: invalid
memory address or nil pointer dereference` in case it could no longer
access complete information about a specific pod.

This error was found to happen if the node in which the pod was running
was terminated during the step's execution.
spite active pipelines being executed on the node.

Now instead of a panic on the agent's logs and undefined behavior on the
UI it will display a more helpful error message on the UI.

### Additional context

We observed the bug first on v2.1.1, but tested the fix internally on
top of 2.3.0.


![image](https://github.com/woodpecker-ci/woodpecker/assets/7269710/dfbcf089-85f7-4b5d-8102-f21af95c5cda)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant