Skip to content

EphemeralRunners are stuck in failed state after the job succeeds #4136

@Dawnflash

Description

@Dawnflash

Checks

Controller Version

0.12.0

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
    I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Not consistently reproducible, there's a small chance for it to happen whenever a job succeeds.
Just run some workload, then check that no EphemeralRunners are stuck in a failed state.

Describe the bug

After the workload succeeds the controller marks the EphemeralRunner as failed and it doesn't create a new pod, just hangs in a failed state until manually removed. There is always exactly 1 failure with a timestamp: "status": { "failures": {"<uuid>": "<timestamp>"}}.

It probably lingers in the Github API a bit longer after the pod dies and the controller treats it as a failure. It calls deletePodAsFailed which is what's visible in the log excerpt.

After that it goes into backoff but it is never processed again. After the backoff period elapses there are no further logs available referencing the EphemeralRunner and it remains stuck and unmanaged.

For now we are removing these orphans periodically but the orphans seem to negatively impact CI job startup times.

The runners are eventually removed from Github API because I manually checked them and they were no longer present in Github. Yet the EphemeralRunners remain stuck.

Describe the expected behavior

The EphemeralRunner is cleanly removed once Github releases it. It should keep reconciling after the backoff period elapses instead of giving up on it silently.

Additional Context

Using a simple runnerset with DinD mode, cloud Github organization installation (via Github app).

Controller Logs

https://gist.github.com/Dawnflash/0a3fc1da0f99dfe67fc17b6987821a53

Runner Pod Logs

Don't have those but the jobs normally succeed. All green in Github.

Activity

added
bugSomething isn't working
needs triageRequires review from the maintainers
gha-runner-scale-setRelated to the gha-runner-scale-set mode
on Jun 18, 2025
github-actions

github-actions commented on Jun 18, 2025

@github-actions
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

kaplan-shaked

kaplan-shaked commented on Jun 19, 2025

@kaplan-shaked

We are suffering from the same issue

kyrylomiro

kyrylomiro commented on Jun 19, 2025

@kyrylomiro

@nikola-jokic I think this is something connected to the change from this pr #4059 ,w e are seeing the same, the pod is get deleted, but the EphemeralRunner object is keep in the state running like this

Status:
  Failures:
    9b0c5e46-bf2c-41cc-90f2-6fcc4f57f599:  2025-06-19T11:29:42Z
  Job Repository Name:                     xxxxshared-workflows-github-actions
  Job Workflow Ref:                        xxxxx.yaml@refs/heads/main
  Phase:                                   Running
  Ready:                                   false
  Runner Id:                               15718317
nikola-jokic

nikola-jokic commented on Jun 19, 2025

@nikola-jokic
Collaborator

Hey, this might not be related to the actual controller change. Looking at the log, we see that the ephemeral runner finishes, but it exists. That shouldn't happen. After the ephemeral runner is done executing the job, it should self-delete. Therefore, the issue might be on the back-end side. Can you please share the workflow run URL?

nikola-jokic

nikola-jokic commented on Jun 19, 2025

@nikola-jokic
Collaborator

Hey, is anyone running ARC with a version older than 0.12.0 and experiencing this?

kyrylomiro

kyrylomiro commented on Jun 19, 2025

@kyrylomiro

@nikola-jokic yes, I can share the workflow URL, but just before I want to show you how bad is the situation, this is all our runners right now. https://gist.github.com/kyrylomiro/64c559e7d3608fd459443f4a25328c12 so all of them that has errors, actually doesn't have pods, but the state keeps saying it's running, and actually our scheduling time now is reaching to n minutes. Workflow URL, trying to find it now.

kyrylomiro

kyrylomiro commented on Jun 19, 2025

@kyrylomiro

@nikola-jokic this is the url and exact job the cause the runner to end up in m-runner-hvlmr-runner-4x2cs Running map[53171800-7476-4af6-8a7b-00f286b15671:2025-06-19T15:13:22Z]

kyrylomiro

kyrylomiro commented on Jun 19, 2025

@kyrylomiro

@nikola-jokic the run is this one where github is showing that workflow is still running but runner is already gone

nimjor

nimjor commented on Jun 19, 2025

@nimjor

I can corroborate this issue, was going to open it myself but didn't get a chance to yet. The EphemeralRunners exist indefinitely with a .status.phase: Running. I'll share the cronjob setup I added to buy me time to continue investigating without blocking our users' job startups:

zombie-runner-cleanup.yaml

(obviously change the namespace as needed for your env)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
rules:
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - list
  - apiGroups:
      - actions.github.com
    resources:
      - ephemeralrunners
    verbs:
      - get
      - list
      - delete
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
subjects:
  - kind: ServiceAccount
    name: zombie-runner-cleanup
    namespace: gha-runners
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: zombie-runner-cleanup
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
spec:
  schedule: "*/10 * * * *" # Every 10 minutes
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: zombie-runner-cleanup
          containers:
            - name: zombie-runner-cleanup
              image: alpine:3
              command:
                - /bin/sh
                - -c
                - |
                  apk add kubectl yq
                  #!/bin/sh
                  set -e

                  echo ""
                  echo "Starting cleanup task..."

                  PODS_FILE="/tmp/pods.txt"
                  RUNNERS_FILE="/tmp/runners.txt"
                  DIFF_FILE="/tmp/runners_diff.txt"
                  NS="gha-runners"
                  SELECTOR="app.kubernetes.io/part-of=gha-runner-scale-set"

                  kubectl -n $NS get pods -l $SELECTOR -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' > $PODS_FILE
                  kubectl -n $NS get ephemeralrunners -o yaml | yq '.items[] | select(.status.phase == "Running") | .metadata.name' > $RUNNERS_FILE

                  ## Subtract Pods from running EphemeralRunners to find runners that no longer have a pod
                  comm -13 <(sort $PODS_FILE) <(sort $RUNNERS_FILE) > $DIFF_FILE

                  echo "Runner pods: $(wc -l $PODS_FILE | awk -F' ' '{print $1}')"
                  echo "Ephemeral runners: $(wc -l $RUNNERS_FILE | awk -F' ' '{print $1}')"
                  echo "Found $(wc -l $DIFF_FILE | awk -F' ' '{print $1}') ephemeral runners without pods"

                  for runner in $(cat $DIFF_FILE); do 
                      kubectl -n $NS delete ephemeralrunner $runner
                  done
                  rm $PODS_FILE $RUNNERS_FILE $DIFF_FILE

                  echo "Done."

          restartPolicy: OnFailure

This issue was not happening to our runners in 0.11.0. And it is not intermittent, in the sense that the overall issue doesn't come and go by the day, it is always affecting some percentage of our runners, but it IS intermittent in the sense that it seems to happen randomly to our jobs, with no discernible difference. It seems to affect roughly 2-20 jobs an hour for us. If we don't clean them up, the controller seems to be counting those as part of the current scale metric so it doesn't think it needs more runners added to meet demand, thus the increasing length of job queue.

nimjor

nimjor commented on Jun 20, 2025

@nimjor

@nikola-jokic I don't know if there are others on 0.11.0 still who are experiencing this same issue, but I doubt it, based on the fact that immediately after upgrading ours from 0.11.0 to 0.12.0 this issue started. The disappointing part is we eagerly upgraded because of #3685, only to get hit with this arguably worse bug.

andresrsanchez

andresrsanchez commented on Jun 20, 2025

@andresrsanchez

@nimjor we upgraded from 0.9.3 also because of that bug, now i don't know which one i prefer lol
i can confirm that with your script we didn't have issues this morning, let's see
thanks!

22 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @mklauber@tyrken@dx0x58@avadhanij@muawiakh

      Issue actions

        EphemeralRunners are stuck in failed state after the job succeeds Β· Issue #4136 Β· actions/actions-runner-controller