Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.I am using charts that are officially provided
Controller Version
0.12.0
Deployment Method
ArgoCD
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Not consistently reproducible, there's a small chance for it to happen whenever a job succeeds.
Just run some workload, then check that no EphemeralRunners are stuck in a failed state.
Describe the bug
After the workload succeeds the controller marks the EphemeralRunner as failed and it doesn't create a new pod, just hangs in a failed state until manually removed. There is always exactly 1 failure with a timestamp: "status": { "failures": {"<uuid>": "<timestamp>"}}
.
It probably lingers in the Github API a bit longer after the pod dies and the controller treats it as a failure. It calls deletePodAsFailed which is what's visible in the log excerpt.
After that it goes into backoff but it is never processed again. After the backoff period elapses there are no further logs available referencing the EphemeralRunner and it remains stuck and unmanaged.
For now we are removing these orphans periodically but the orphans seem to negatively impact CI job startup times.
The runners are eventually removed from Github API because I manually checked them and they were no longer present in Github. Yet the EphemeralRunners remain stuck.
Describe the expected behavior
The EphemeralRunner is cleanly removed once Github releases it. It should keep reconciling after the backoff period elapses instead of giving up on it silently.
Additional Context
Using a simple runnerset with DinD mode, cloud Github organization installation (via Github app).
Controller Logs
https://gist.github.com/Dawnflash/0a3fc1da0f99dfe67fc17b6987821a53
Runner Pod Logs
Don't have those but the jobs normally succeed. All green in Github.
Activity
github-actions commentedon Jun 18, 2025
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
kaplan-shaked commentedon Jun 19, 2025
We are suffering from the same issue
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic I think this is something connected to the change from this pr #4059 ,w e are seeing the same, the pod is get deleted, but the
EphemeralRunner
object is keep in the state running like thisnikola-jokic commentedon Jun 19, 2025
Hey, this might not be related to the actual controller change. Looking at the log, we see that the ephemeral runner finishes, but it exists. That shouldn't happen. After the ephemeral runner is done executing the job, it should self-delete. Therefore, the issue might be on the back-end side. Can you please share the workflow run URL?
nikola-jokic commentedon Jun 19, 2025
Hey, is anyone running ARC with a version older than 0.12.0 and experiencing this?
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic yes, I can share the workflow URL, but just before I want to show you how bad is the situation, this is all our runners right now. https://gist.github.com/kyrylomiro/64c559e7d3608fd459443f4a25328c12 so all of them that has errors, actually doesn't have pods, but the state keeps saying it's running, and actually our scheduling time now is reaching to n minutes. Workflow URL, trying to find it now.
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic this is the url and exact job the cause the runner to end up in
m-runner-hvlmr-runner-4x2cs Running map[53171800-7476-4af6-8a7b-00f286b15671:2025-06-19T15:13:22Z]
kyrylomiro commentedon Jun 19, 2025
@nikola-jokic the run is this one where github is showing that workflow is still running but runner is already gone
nimjor commentedon Jun 19, 2025
I can corroborate this issue, was going to open it myself but didn't get a chance to yet. The EphemeralRunners exist indefinitely with a
.status.phase: Running
. I'll share the cronjob setup I added to buy me time to continue investigating without blocking our users' job startups:zombie-runner-cleanup.yaml
(obviously change the namespace as needed for your env)
This issue was not happening to our runners in 0.11.0. And it is not intermittent, in the sense that the overall issue doesn't come and go by the day, it is always affecting some percentage of our runners, but it IS intermittent in the sense that it seems to happen randomly to our jobs, with no discernible difference. It seems to affect roughly 2-20 jobs an hour for us. If we don't clean them up, the controller seems to be counting those as part of the current scale metric so it doesn't think it needs more runners added to meet demand, thus the increasing length of job queue.
nimjor commentedon Jun 20, 2025
@nikola-jokic I don't know if there are others on 0.11.0 still who are experiencing this same issue, but I doubt it, based on the fact that immediately after upgrading ours from 0.11.0 to 0.12.0 this issue started. The disappointing part is we eagerly upgraded because of #3685, only to get hit with this arguably worse bug.
andresrsanchez commentedon Jun 20, 2025
@nimjor we upgraded from 0.9.3 also because of that bug, now i don't know which one i prefer lol
i can confirm that with your script we didn't have issues this morning, let's see
thanks!
6 remaining items
nikola-jokic commentedon Jun 24, 2025
Hey everyone, I just wanted to let you all know that we identified the problem.
The check to see if the runner exists within the service can sometimes return a false positive result. Even though this will be fixed on the back-end, the PR #4142 should also resolve the issue, since we don't need this check.
As long as the runner image is properly built (i.e. the entrypoint will return the exit code of the runner), the check we are doing right now is not necessary. Therefore, we will remove it.
shivansh-ptr commentedon Jun 24, 2025
We are running into a similar issue like @kyrylomiro mentioned. But for me, the pod is also running yet the workflow is completed and this is causing high queue time for workflows.
Controller Version: 0.11.0
nimjor commentedon Jun 24, 2025
@shivansh-ptr that sounds like a separate type of problem and belongs in a separate issue
nikola-jokic commentedon Jun 24, 2025
Hey @shivansh-ptr,
That is exactly the root of the problem. After the workflow is done, we check if the runner exists. Since it does (in this case), we mark the ephemeral runner as failed, which creates this entry in
Failures
. It would then start the crash loop (since at that point, the runner registration is invalid), and would cause the ephemeral runner to reach the failed state.mgs-garcia commentedon Jun 24, 2025
Hi, I'm using both Controller Version and gha-runner-scale-set version 0.12.0 and experiencing something very similar but with the difference that my workflow doesn't get to run even once. From previous comments my understanding is that the issue is for subsequent runs after at least one successful execution. In my case the EphemeralRunner stays in Pending status and if I describe it I get the same Failure as shown above:
"status": { "failures": {"<uuid>": "<timestamp>"}}
Some useful excerpts of the controller's log are:
@nikola-jokic I wonder if #4142 fixes the issue for my situation also, because from what was said before I believe it might only fix the scenario for certain runners (i.e. Runner Id N where N>1, but not for N=1)
Thanks!
avadhanij commentedon Jun 25, 2025
Any timeline on when the fix for this will be rolled out as part of a new release?
nikola-jokic commentedon Jun 26, 2025
Hey everyone, just to let you all know, we are targeting Monday for the next patch release that will include this fix.
tyrken commentedon Jun 26, 2025
FYI a similar bug we still have with 0.12.0 that others might be experiencing but isn't quite the same (our stuck runners stay forever in
Running
state with failures in status) - #4148nikola-jokic commentedon Jun 27, 2025
Hey everyone, we decided to publish a new release today! The
0.12.1
is out! 😄Tal-E commentedon Jul 7, 2025
@nikola-jokic Hi, I still face this issue even on version 0.12.1