Skip to content

Dealing with jobs failing with "lost communication with the server" errors #466

Open
@mumoshu

Description

@mumoshu

I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.

This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.

Please see the related issues for more information.

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Verifying if you're affected by this problem

Note that the error can also happen when:

  • The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.
  • The runner container got OOM-killed due to that your node has insufficient resource and your runner pod had low priority. Use a more resourceful machine as your node.

If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.

Information

  • Even GitHub support seems to say that stopping the runner and using --once are the goto solutions. But I believe both are subject to this race condition issue.

Possible workarounds

  • Disabling ephemeral runners (Ephemeral Runner: Can we make this optional? #457) (i.e. removing the --once flag from run.sh) may "alleviate" this issue, but not completely.
  • Don't use ephemeral runners and stop runners only in the maintenance window you've defined, while telling your colleagues to not run jobs while in the maintenance window. (The downside of this approach is that you can't rolling-update runners outside of the maintenance window
  • Restart the whole workflow run whenever any job in it failed (Note that we can't retry individual job on GitHub Actions today)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationhelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions