Closed
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.12.0
Deployment Method
ArgoCD
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Happens randomly after a few days
Describe the bug
With a fresh install after a few days the AutoscalingRunnerSet presents incorrect stats, and the builds get stuck.
For example:
There are no EphemeralRunners failed but we do have some EphemeralRunners in running state without a pod runner for a few hours.
I cancelled one workflow and triggered a new run, no runner gets created.
The AutoscalingRunnerSet stats:
status:
currentRunners: 4
pendingEphemeralRunners: 0
runningEphemeralRunners: 4
So summary: run a new job, the job never gets picked and the AutoscalingRunnerSet thinks there are some running jobs when in reality there are zero. If i recreate the AutoscalingRunnerSet then starts working again.
Describe the expected behavior
No job gets stuck and the stats of AutoscalingRunnerSet
/EphemeralRunnerset
should be fine.
Additional Context
We run EKS and Karpenter.
Listener logs: https://gist.github.com/andresrsanchez/11828b134de057c3fbaf8e6bf308901c
Controller Logs
https://gist.github.com/andresrsanchez/a57261b5ba976f3f283b4dccc42e2d1c
Runner Pod Logs
No runner gets triggered