Skip to content

Ephemeral Runner failure limit hardcoded to 5 causing issues #4102

Closed
@rfinnie-epic

Description

@rfinnie-epic

Checks

Controller Version

0.11.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy runner scale set
2. Delete runner pod 5 times; time between deletes does not matter

Describe the bug

Hi, I've got runner scale sets deployed on a Kubernetes cluster which is backed by Karpenter, which will regularly consolidate and move around pods.

Now, normally this isn't a problem. My runner images have preStop hooks which are pretty tolerant to the pod being deleted -- if a run is in progress, it'll block preStop until the run is done (if there is no run in progress, it immediatly acquiesces), it cleans everything up, it exits 0, etc. The controller will simply deploy the pod again, the runner agent re-registers itself, all is good.

However, on scale sets which aren't used that frequently, I'll occasionally find Failed ephemeralrunners with TooManyPodFailures ("Pod has failed to start more than 5 times:"). It appears this is because the controller has a simple hardcoded check for 5 pod "failures". In this case, it's not even checking for exiting >0 (my runner image exits 0 upon graceful preStop); it counts any sort of pod exit as a failure.

This doesn't trigger very often, but it is a problem for less-used scale sets. The runner can stick around long enough for Karpenter to re-arrange enough for 5 exits, and this becomes a permanent problem.

Suggestions for dealing with this:

  • Make the (currently hardcoded) 5 count configurable
  • Take backoff into account (5 exits within 5 minutes may be a problem, but 5 exits within 5 days wouldn't)
  • Allow the pod to return a specific exit code which indicates to the controller "everything's cool, this was expected, just re-launch the pod"

Describe the expected behavior

Multiple pod exits should not cause a permanent runner failure

Additional Context

N/A

Controller Logs

Not included, sorry. If needed, I can set up a test to provide log context, but as is, the logs would take a lot of work to redact. The issue should be reproducible on any standard install.

Runner Pod Logs

Not included, sorry. If needed, I can set up a test to provide log context, but as is, the logs would take a lot of work to redact. The issue should be reproducible on any standard install.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions