(Kubernetes backend) terminated node causes runtime error when handling step exit code #3330
Closed
3 tasks done
Labels
bug
Something isn't working
Component
agent
Describe the bug
We were running Woodpecker v2.1.1 on Kubernetes backend at a multi-node cluster on AWS.
We've got a few
panic: runtime error
logs in our agent like this:Tracking it down to https://github.com/woodpecker-ci/woodpecker/blob/v2.1.1/pipeline/backend/kubernetes/kubernetes.go#L251 it's likely that either
ContainerStatuses
is empty or Terminated is nil, both which would cause a panic.Now it seems that simply adding error handling code either case would be a viable option here, which is what I did internally (and I hope to be submitting that for review shortly). This way we were able to track down at least one occurrence of the bug: when the node hosting the pod is killed before the agent can retrieve the exit code. In our cause it was being caused by the Amazon Auto Scaling Group trying to rebalance multiple AZs despite active pipelines being executed on the node.
System Info
Additional context
No response
Validations
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]The text was updated successfully, but these errors were encountered: