Skip to content

ephemeralrunner should support a TTL #4100

Open
@srob-ntap

Description

@srob-ntap

What would you like added?

It would be nice if the ephemeral runners supported a TimeToLive value to cleanly expire the pod & ephemeral runners.

Why is this needed?

I have an ephemeralrunner setup defined via gha-runner-scale-set. I have minRunners set, and when the runners startup they connect to GHE and wait for work. Part of my pod spec includes a custom volume that I need to "keep fresh" by rotating every 2 hours. (The volume is time consuming to setup, and our volume provisioner abstracts that time). We installed a liveness probe on the pod to expire after 2 hours, and this works up to a max of 5 times due to this code:

if len(ephemeralRunner.Status.Failures) > maxFailures {
log.Info(fmt.Sprintf("EphemeralRunner has failed more than %d times. Deleting ephemeral runner so it can be re-created", maxFailures))
if err := r.Delete(ctx, ephemeralRunner); err != nil {
log.Error(fmt.Errorf("failed to delete ephemeral runner after %d failures: %w", maxFailures, err), "Failed to delete ephemeral runner")
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}

After hitting that condition, the ephemeralrunner resource hangs around with no new pod and eventually (after ~10 hours of idle time) all ephemeralrunners have no running pods (meaning we lose all our GH Runners).

Metadata

Metadata

Assignees

No one assigned

    Labels

    communityCommunity contributionenhancementNew feature or requestneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions