Skip to content

EphemeralRunner left stuck Running after node drain/pod termination #4148

Open
@tyrken

Description

@tyrken

Checks

Controller Version

0.12.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Start a long-running GHA job
2. Run `kubectl drain <node-name>` on the EKS node running the pod for the allocated EphemeralRunner.  (Directly deleting the runner pod with `kubectl delete pod <pod-name>` also has the same effect, but isn't what we normally do/experience.)
3. Observe the Runner in GHE list of active runners goes away
4. Observe the EphemeralRunner in K8s stays in `Running` state for ever

Describe the bug

While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.

This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs. Until this is fixed we have to delete the ERs manually with a script like the below:

#!/usr/bin/env bash

set -euo pipefail

STUCK_RUNNERS=$(kubectl get ephemeralrunners -n gha-runner-scale-set -o json \
  | jq -r '.items[] | select(.status.phase == "Running" and .status.ready == false and .status.jobRepositoryName != null) | .metadata.name' \
  | tr '\n' ' ')

if [ -z "$STUCK_RUNNERS" ]; then
  echo "No stuck EphemeralRunners."
  exit 0
fi

echo "Deleting: $STUCK_RUNNERS"
kubectl delete ephemeralrunners -n gha-runner-scale-set $STUCK_RUNNERS

Describe the expected behavior

For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.

The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.

Additional Context

Here's the redacted YAML for the ER itself, complete with status: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml

Controller Logs

See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt - the node was drained around 19:11 UTC.

See also the listener logs if of any interest: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt

Runner Pod Logs

See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv

(Note copied from logging server as runner/pod is deleted during bug reproduction)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions