-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.I am using charts that are officially providedTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Controller Version
0.12.0
Deployment Method
ArgoCD
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changesTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
To Reproduce
Not consistently reproducible, there's a small chance for it to happen whenever a job succeeds.
Just run some workload, then check that no EphemeralRunners are stuck in a failed state.
Describe the bug
After the workload succeeds the controller marks the EphemeralRunner as failed and it doesn't create a new pod, just hangs in a failed state until manually removed. There is always exactly 1 failure with a timestamp: "status": { "failures": {"<uuid>": "<timestamp>"}}
.
It probably lingers in the Github API a bit longer after the pod dies and the controller treats it as a failure. It calls deletePodAsFailed which is what's visible in the log excerpt.
After that it goes into backoff but it is never processed again. After the backoff period elapses there are no further logs available referencing the EphemeralRunner and it remains stuck and unmanaged.
For now we are removing these orphans periodically but the orphans seem to negatively impact CI job startup times.
The runners are eventually removed from Github API because I manually checked them and they were no longer present in Github. Yet the EphemeralRunners remain stuck.
Describe the expected behavior
The EphemeralRunner is cleanly removed once Github releases it. It should keep reconciling after the backoff period elapses instead of giving up on it silently.
Additional Context
Using a simple runnerset with DinD mode, cloud Github organization installation (via Github app).
Controller Logs
https://gist.github.com/Dawnflash/0a3fc1da0f99dfe67fc17b6987821a53
Runner Pod Logs
Don't have those but the jobs normally succeed. All green in Github.
Activity
github-actions commentedon Jun 18, 2025
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
kaplan-shaked commentedon Jun 19, 2025
We are suffering from the same issue
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic I think this is something connected to the change from this pr #4059 ,w e are seeing the same, the pod is get deleted, but the
EphemeralRunner
object is keep in the state running like thisnikola-jokic commentedon Jun 19, 2025
Hey, this might not be related to the actual controller change. Looking at the log, we see that the ephemeral runner finishes, but it exists. That shouldn't happen. After the ephemeral runner is done executing the job, it should self-delete. Therefore, the issue might be on the back-end side. Can you please share the workflow run URL?
nikola-jokic commentedon Jun 19, 2025
Hey, is anyone running ARC with a version older than 0.12.0 and experiencing this?
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic yes, I can share the workflow URL, but just before I want to show you how bad is the situation, this is all our runners right now. https://gist.github.com/kyrylomiro/64c559e7d3608fd459443f4a25328c12 so all of them that has errors, actually doesn't have pods, but the state keeps saying it's running, and actually our scheduling time now is reaching to n minutes. Workflow URL, trying to find it now.
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic this is the url and exact job the cause the runner to end up in
m-runner-hvlmr-runner-4x2cs Running map[53171800-7476-4af6-8a7b-00f286b15671:2025-06-19T15:13:22Z]
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic the run is this one where github is showing that workflow is still running but runner is already gone
nimjor commentedon Jun 19, 2025
I can corroborate this issue, was going to open it myself but didn't get a chance to yet. The EphemeralRunners exist indefinitely with a
.status.phase: Running
. I'll share the cronjob setup I added to buy me time to continue investigating without blocking our users' job startups:zombie-runner-cleanup.yaml
(obviously change the namespace as needed for your env)
This issue was not happening to our runners in 0.11.0. And it is not intermittent, in the sense that the overall issue doesn't come and go by the day, it is always affecting some percentage of our runners, but it IS intermittent in the sense that it seems to happen randomly to our jobs, with no discernible difference. It seems to affect roughly 2-20 jobs an hour for us. If we don't clean them up, the controller seems to be counting those as part of the current scale metric so it doesn't think it needs more runners added to meet demand, thus the increasing length of job queue.
nimjor commentedon Jun 20, 2025
@nikola-jokic I don't know if there are others on 0.11.0 still who are experiencing this same issue, but I doubt it, based on the fact that immediately after upgrading ours from 0.11.0 to 0.12.0 this issue started. The disappointing part is we eagerly upgraded because of #3685, only to get hit with this arguably worse bug.
andresrsanchez commentedon Jun 20, 2025
@nimjor we upgraded from 0.9.3 also because of that bug, now i don't know which one i prefer lol
i can confirm that with your script we didn't have issues this morning, let's see
thanks!
22 remaining items