Skip to content

Queued/stuck builds and incorrect runner stats #4138

Closed
@andresrsanchez

Description

@andresrsanchez

Checks

Controller Version

0.12.0

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Happens randomly after a few days

Describe the bug

With a fresh install after a few days the AutoscalingRunnerSet presents incorrect stats, and the builds get stuck.
For example:

There are no EphemeralRunners failed but we do have some EphemeralRunners in running state without a pod runner for a few hours.
I cancelled one workflow and triggered a new run, no runner gets created.
The AutoscalingRunnerSet stats:

status:
  currentRunners: 4
  pendingEphemeralRunners: 0
  runningEphemeralRunners: 4

So summary: run a new job, the job never gets picked and the AutoscalingRunnerSet thinks there are some running jobs when in reality there are zero. If i recreate the AutoscalingRunnerSet then starts working again.

Describe the expected behavior

No job gets stuck and the stats of AutoscalingRunnerSet/EphemeralRunnerset should be fine.

Additional Context

We run EKS and Karpenter.
Listener logs: https://gist.github.com/andresrsanchez/11828b134de057c3fbaf8e6bf308901c

Controller Logs

https://gist.github.com/andresrsanchez/a57261b5ba976f3f283b4dccc42e2d1c

Runner Pod Logs

No runner gets triggered

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions