Ephemeral Runner failure limit hardcoded to 5 causing issues

### Checks

- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided

### Controller Version

0.11.0

### Deployment Method

Helm

### Checks

- [x] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions)).
- [x] I've read the [Changelog](https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

### To Reproduce

```markdown
1. Deploy runner scale set
2. Delete runner pod 5 times; time between deletes does not matter
```

### Describe the bug

Hi, I've got runner scale sets deployed on a Kubernetes cluster which is backed by Karpenter, which will regularly consolidate and move around pods.

Now, normally this isn't a problem. My runner images have `preStop` hooks which are pretty tolerant to the pod being deleted -- if a run is in progress, it'll block `preStop` until the run is done (if there is no run in progress, it immediatly acquiesces), it cleans everything up, it exits 0, etc. The controller will simply deploy the pod again, the runner agent re-registers itself, all is good.

However, on scale sets which aren't used that frequently, I'll occasionally find Failed `ephemeralrunners` with `TooManyPodFailures` ("Pod has failed to start more than 5 times:"). It appears this is because the controller has [a simple hardcoded check for 5 pod "failures"](https://github.com/actions/actions-runner-controller/blob/a1a8dc5606d7d4e5532994f7c4a245354384d8b3/controllers/actions.github.com/ephemeralrunner_controller.go#L204). In this case, it's not even checking for exiting >0 (my runner image exits 0 upon graceful `preStop`); it counts any sort of pod exit as a failure.

This doesn't trigger very often, but it is a problem for less-used scale sets. The runner can stick around long enough for Karpenter to re-arrange enough for 5 exits, and this becomes a permanent problem.

Suggestions for dealing with this:

* Make the (currently hardcoded) 5 count configurable
* Take backoff into account (5 exits within 5 minutes may be a problem, but 5 exits within 5 days wouldn't)
* Allow the pod to return a specific exit code which indicates to the controller "everything's cool, this was expected, just re-launch the pod"

### Describe the expected behavior

Multiple pod exits should not cause a permanent runner failure

### Additional Context

```yaml
N/A
```

### Controller Logs

```shell
Not included, sorry. If needed, I can set up a test to provide log context, but as is, the logs would take a lot of work to redact. The issue should be reproducible on any standard install.
```

### Runner Pod Logs

```shell
Not included, sorry. If needed, I can set up a test to provide log context, but as is, the logs would take a lot of work to redact. The issue should be reproducible on any standard install.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ephemeral Runner failure limit hardcoded to 5 causing issues #4102

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ephemeral Runner failure limit hardcoded to 5 causing issues #4102

Description

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions