EphemeralRunners are stuck in failed state after the job succeeds

### Checks

- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided

### Controller Version

0.12.0

### Deployment Method

ArgoCD

### Checks

- [x] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions)).
- [x] I've read the [Changelog](https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

### To Reproduce

```markdown
Not consistently reproducible, there's a small chance for it to happen whenever a job succeeds.
Just run some workload, then check that no EphemeralRunners are stuck in a failed state.
```

### Describe the bug

After the workload succeeds the controller marks the EphemeralRunner as failed and it doesn't create a new pod, just hangs in a failed state until manually removed. There is always exactly 1 failure with a timestamp: `"status": { "failures": {"<uuid>": "<timestamp>"}}`.

It probably lingers in the Github API a bit longer after the pod dies and the controller treats it as a failure. It calls [deletePodAsFailed](https://github.com/actions/actions-runner-controller/blob/0b2534ebc906c8e14d0affbdccafb922e2a5c2db/controllers/actions.github.com/ephemeralrunner_controller.go#L316) which is what's visible in the log excerpt.

After that it goes into backoff but it is never processed again. After the backoff period elapses there are no further logs available referencing the EphemeralRunner and it remains stuck and unmanaged.

For now we are removing these orphans periodically but the orphans seem to negatively impact CI job startup times.

The runners are eventually removed from Github API because I manually checked them and they were no longer present in Github. Yet the EphemeralRunners remain stuck.


### Describe the expected behavior

The EphemeralRunner is cleanly removed once Github releases it. It should keep reconciling after the backoff period elapses instead of giving up on it silently.

### Additional Context

```yaml
Using a simple runnerset with DinD mode, cloud Github organization installation (via Github app).
```

### Controller Logs

```shell
https://gist.github.com/Dawnflash/0a3fc1da0f99dfe67fc17b6987821a53
```

### Runner Pod Logs

```shell
Don't have those but the jobs normally succeed. All green in Github.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EphemeralRunners are stuck in failed state after the job succeeds #4136

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

22 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

EphemeralRunners are stuck in failed state after the job succeeds #4136

Description

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Activity

github-actions commented on Jun 18, 2025

kaplan-shaked commented on Jun 19, 2025

kyrylomiro commented on Jun 19, 2025

nikola-jokic commented on Jun 19, 2025

nikola-jokic commented on Jun 19, 2025

kyrylomiro commented on Jun 19, 2025

kyrylomiro commented on Jun 19, 2025

kyrylomiro commented on Jun 19, 2025

nimjor commented on Jun 19, 2025

nimjor commented on Jun 20, 2025

andresrsanchez commented on Jun 20, 2025

22 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions