Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.12.0
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Start a long-running GHA job
2. Run `kubectl drain <node-name>` on the EKS node running the pod for the allocated EphemeralRunner. (Directly deleting the runner pod with `kubectl delete pod <pod-name>` also has the same effect, but isn't what we normally do/experience.)
3. Observe the Runner in GHE list of active runners goes away
4. Observe the EphemeralRunner in K8s stays in `Running` state for ever
Describe the bug
While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.
This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs. Until this is fixed we have to delete the ERs manually with a script like the below:
#!/usr/bin/env bash
set -euo pipefail
STUCK_RUNNERS=$(kubectl get ephemeralrunners -n gha-runner-scale-set -o json \
| jq -r '.items[] | select(.status.phase == "Running" and .status.ready == false and .status.jobRepositoryName != null) | .metadata.name' \
| tr '\n' ' ')
if [ -z "$STUCK_RUNNERS" ]; then
echo "No stuck EphemeralRunners."
exit 0
fi
echo "Deleting: $STUCK_RUNNERS"
kubectl delete ephemeralrunners -n gha-runner-scale-set $STUCK_RUNNERS
Describe the expected behavior
For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.
The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.
Additional Context
Here's the redacted YAML for the ER itself, complete with status: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml
Controller Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt - the node was drained around 19:11 UTC.
See also the listener logs if of any interest: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt
Runner Pod Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv
(Note copied from logging server as runner/pod is deleted during bug reproduction)