Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.11.0
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Deploy runner scale set
2. Delete runner pod 5 times; time between deletes does not matter
Describe the bug
Hi, I've got runner scale sets deployed on a Kubernetes cluster which is backed by Karpenter, which will regularly consolidate and move around pods.
Now, normally this isn't a problem. My runner images have preStop
hooks which are pretty tolerant to the pod being deleted -- if a run is in progress, it'll block preStop
until the run is done (if there is no run in progress, it immediatly acquiesces), it cleans everything up, it exits 0, etc. The controller will simply deploy the pod again, the runner agent re-registers itself, all is good.
However, on scale sets which aren't used that frequently, I'll occasionally find Failed ephemeralrunners
with TooManyPodFailures
("Pod has failed to start more than 5 times:"). It appears this is because the controller has a simple hardcoded check for 5 pod "failures". In this case, it's not even checking for exiting >0 (my runner image exits 0 upon graceful preStop
); it counts any sort of pod exit as a failure.
This doesn't trigger very often, but it is a problem for less-used scale sets. The runner can stick around long enough for Karpenter to re-arrange enough for 5 exits, and this becomes a permanent problem.
Suggestions for dealing with this:
- Make the (currently hardcoded) 5 count configurable
- Take backoff into account (5 exits within 5 minutes may be a problem, but 5 exits within 5 days wouldn't)
- Allow the pod to return a specific exit code which indicates to the controller "everything's cool, this was expected, just re-launch the pod"
Describe the expected behavior
Multiple pod exits should not cause a permanent runner failure
Additional Context
N/A
Controller Logs
Not included, sorry. If needed, I can set up a test to provide log context, but as is, the logs would take a lot of work to redact. The issue should be reproducible on any standard install.
Runner Pod Logs
Not included, sorry. If needed, I can set up a test to provide log context, but as is, the logs would take a lot of work to redact. The issue should be reproducible on any standard install.