Skip to content

EphemeralRunner and its pods left stuck Running after runner OOMKILL #4155

Open
@kennedy-whytech

Description

@kennedy-whytech

Checks

Controller Version

v0.12.1

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Create an AutoScalingRunnerSet targeting an arm64 runner with limited memory.
2. Trigger a GitHub Actions job from a repository that consumes high memory during startup.
3. The pod will get OOMKilled, but:
- It remains in a Running state.
- The controller does not detect the failure.
- The EphemeralRunner CRD is not deleted.
4. Observe that new jobs remain stuck in the queued state due to the zombie runner.

Describe the bug

Ephemeral runner pods that are OOMKilled do not get properly cleaned up by the controller. Although the pod is no longer functioning (due to OOMKilled), it stays in a Running state and the associated EphemeralRunner CRD is not removed. This leads to zombie runners that block new job assignments, since the controller believes an active runner is still available.

(In v0.12.0, at least it's easier to detect the killed pod because the EphemeralRunner will be left without any pods )

Describe the expected behavior

When an ephemeral runner pod is OOMKilled, the controller should detect the failure, mark the associated EphemeralRunner CRD as failed, clean up the pod, and (optional)recreate a new runner if needed. This ensures no stale CRDs or zombie runners block new job assignments.

Additional Context

githubConfigUrl: https://github.com/<REDACTED>

controllerServiceAccount:
  namespace: arc-system
  name: arc-controller-gha-rs-controller

githubConfigSecret:
  github_app_id: <REDACTED>
  github_app_installation_id: <REDACTED>
  github_app_private_key: <REDACTED>

containerMode:
  type: "dind"

minRunners: 4

runnerGroup: "k8s"

template:
  spec:
    serviceAccountName: gha-runner
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        resources:
          requests:
            cpu: "2"
            memory: 8Gi
          limits:
            memory: 16Gi
...

Controller Logs

It still shows the runner is healthy after OOMKILL

2025-06-27T17:44:53Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.1", "ephemeralrunner": {"name":"2cpu-runner-j4k8q","namespace":"arc-runners"}}
2025-06-27T17:44:53Z	INFO	EphemeralRunner	Updating ephemeral runner status	{"version": "0.12.1", "ephemeralrunner": {"name":"2cpu-runner-j4k8q","namespace":"arc-runners"}, "statusPhase": "Running", "statusReason": "", "statusMessage": "", "ready": true}

Runner Pod Logs

containerStatuses:
    - containerID: >
        containerd://
      image: >
      imageID: >
      lastState: {}
      name: runner
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: '2025-06-27T16:39:21Z'


Logs
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:28Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:41:28Z ERR  GitHubActionsService] POST request to https://broker.actions.githubusercontent.com/session failed. HTTP Status: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener] Catch exception during create session.
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener] GitHub.DistributedTask.WebApi.TaskAgentSessionConflictException: Error: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Actions.RunService.WebApi.BrokerHttpClient.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Runner.Common.BrokerServer.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Runner.Listener.BrokerMessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z ERR  Terminal] WRITE ERROR: A session for this runner already exists.
A session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session conflict exception haven't reached retry limit.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] Sleeping for 30 seconds before retrying.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Attempt to create session.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Broker Server...
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] HasCredentials()
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] stored True
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] GetCredentialProvider
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating type OAuth
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating credential type: OAuth
[RUNNER 2025-06-27 16:41:58Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Runner server...
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:59Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:42:00Z INFO BrokerMessageListener] Session created.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Runner reconnected.
2025-06-27 16:42:00Z: Runner reconnected.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: Current runner version: '2.325.0'
Current runner version: '2.325.0'
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Listening for Jobs
2025-06-27 16:42:00Z: Listening for Jobs
[RUNNER 2025-06-27 16:42:00Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions