Description
I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.
This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.
Please see the related issues for more information.
- Ephemeral (single use) runner registrations runner#510
- Hosted agent "lost communication with the server" microsoft/azure-pipelines-agent#2261
- Self hosted runners for GitHub actions fail very often on apache/airflow#14337
- Self hosted runner in Linux container exits job prematurely runner#921 (comment)
This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.
Verifying if you're affected by this problem
Note that the error can also happen when:
- The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.
- The runner container got OOM-killed due to that your node has insufficient resource and your runner pod had low priority. Use a more resourceful machine as your node.
If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.
Information
- Even GitHub support seems to say that stopping the runner and using
--once
are the goto solutions. But I believe both are subject to this race condition issue.
Possible workarounds
- Disabling ephemeral runners (Ephemeral Runner: Can we make this optional? #457) (i.e. removing the
--once
flag fromrun.sh
) may "alleviate" this issue, but not completely. - Don't use ephemeral runners and stop runners only in the maintenance window you've defined, while telling your colleagues to not run jobs while in the maintenance window. (The downside of this approach is that you can't rolling-update runners outside of the maintenance window
- Restart the whole workflow run whenever any job in it failed (Note that we can't retry individual job on GitHub Actions today)