Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
v0.12.1
Deployment Method
ArgoCD
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Create an AutoScalingRunnerSet targeting an arm64 runner with limited memory.
2. Trigger a GitHub Actions job from a repository that consumes high memory during startup.
3. The pod will get OOMKilled, but:
- It remains in a Running state.
- The controller does not detect the failure.
- The EphemeralRunner CRD is not deleted.
4. Observe that new jobs remain stuck in the queued state due to the zombie runner.
Describe the bug
Ephemeral runner pods that are OOMKilled do not get properly cleaned up by the controller. Although the pod is no longer functioning (due to OOMKilled), it stays in a Running state and the associated EphemeralRunner CRD is not removed. This leads to zombie runners that block new job assignments, since the controller believes an active runner is still available.
(In v0.12.0, at least it's easier to detect the killed pod because the EphemeralRunner will be left without any pods )
Describe the expected behavior
When an ephemeral runner pod is OOMKilled, the controller should detect the failure, mark the associated EphemeralRunner CRD as failed, clean up the pod, and (optional)recreate a new runner if needed. This ensures no stale CRDs or zombie runners block new job assignments.
Additional Context
githubConfigUrl: https://github.com/<REDACTED>
controllerServiceAccount:
namespace: arc-system
name: arc-controller-gha-rs-controller
githubConfigSecret:
github_app_id: <REDACTED>
github_app_installation_id: <REDACTED>
github_app_private_key: <REDACTED>
containerMode:
type: "dind"
minRunners: 4
runnerGroup: "k8s"
template:
spec:
serviceAccountName: gha-runner
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
memory: 16Gi
...
Controller Logs
It still shows the runner is healthy after OOMKILL
2025-06-27T17:44:53Z INFO EphemeralRunner Ephemeral runner container is still running {"version": "0.12.1", "ephemeralrunner": {"name":"2cpu-runner-j4k8q","namespace":"arc-runners"}}
2025-06-27T17:44:53Z INFO EphemeralRunner Updating ephemeral runner status {"version": "0.12.1", "ephemeralrunner": {"name":"2cpu-runner-j4k8q","namespace":"arc-runners"}, "statusPhase": "Running", "statusReason": "", "statusMessage": "", "ready": true}
Runner Pod Logs
containerStatuses:
- containerID: >
containerd://
image: >
imageID: >
lastState: {}
name: runner
ready: true
restartCount: 0
started: true
state:
running:
startedAt: '2025-06-27T16:39:21Z'
Logs
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:28Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:41:28Z ERR GitHubActionsService] POST request to https://broker.actions.githubusercontent.com/session failed. HTTP Status: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] Catch exception during create session.
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] GitHub.DistributedTask.WebApi.TaskAgentSessionConflictException: Error: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] at GitHub.Actions.RunService.WebApi.BrokerHttpClient.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] at GitHub.Runner.Common.BrokerServer.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] at GitHub.Runner.Listener.BrokerMessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z ERR Terminal] WRITE ERROR: A session for this runner already exists.
A session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session conflict exception haven't reached retry limit.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] Sleeping for 30 seconds before retrying.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Attempt to create session.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Broker Server...
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] HasCredentials()
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] stored True
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] GetCredentialProvider
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating type OAuth
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating credential type: OAuth
[RUNNER 2025-06-27 16:41:58Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Runner server...
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:59Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:42:00Z INFO BrokerMessageListener] Session created.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Runner reconnected.
2025-06-27 16:42:00Z: Runner reconnected.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: Current runner version: '2.325.0'
Current runner version: '2.325.0'
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Listening for Jobs
2025-06-27 16:42:00Z: Listening for Jobs
[RUNNER 2025-06-27 16:42:00Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.