Dealing with jobs failing with "lost communication with the server" errors

I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.

This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.

Please see the related issues for more information.

- https://github.com/actions/runner/issues/510
- https://github.com/microsoft/azure-pipelines-agent/issues/2261
- https://github.com/apache/airflow/issues/14337
- https://github.com/actions/runner/issues/921#issuecomment-821118769

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

## Verifying if you're affected by this problem

Note that the error can also happen when:

- The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.
- The runner container got OOM-killed due to that your node has insufficient resource and your runner pod had low priority. Use a more resourceful machine as your node.

If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.

## Information

- Even GitHub support seems to say that [stopping the runner](https://github.community/t/is-there-a-way-to-cause-a-self-hosted-runner-to-disconnect-after-completing-a-single-job/17623/2) and [using `--once`](https://github.community/t/how-to-shutdown-self-hosted-runners-gracefully/127142/2) are the goto solutions. But I believe both are subject to this race condition issue.

## Possible workarounds

- Disabling ephemeral runners (#457) (i.e. removing the `--once` flag from `run.sh`) may "alleviate" this issue, but not completely.
- Don't use ephemeral runners and stop runners only in the maintenance window you've defined, while telling your colleagues to not run jobs while in the maintenance window. (The downside of this approach is that you can't rolling-update runners outside of the maintenance window
- Restart the whole workflow run whenever any job in it failed (Note that [we can't retry individual job on GitHub Actions today](https://github.community/t/manually-restart-actions-and-entire-workflows/16262/4))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dealing with jobs failing with "lost communication with the server" errors #466

Verifying if you're affected by this problem

Information

Possible workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dealing with jobs failing with "lost communication with the server" errors #466

Description

Verifying if you're affected by this problem

Information

Possible workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions