Description
We have been having a increased uptick of users getting loss of communication during a healthy workflow run. After deep diving into the issue, we discovered that at some point, the scale-down lambda was getting a blank runner info. This then, as designed, meant the runner got tagged as being orphaned. At this point, we don't check it again and it gets deleted and thus is removed from AWS. This is causing the loss of communication!
At this point we looked into why we are getting a blank response and potentially discovered that it was due to the pagination and data slippage. Doing a quick postman check against the API renders fresh data every request and that data is not consistent amongst requests in terms of pagination. Speaking with Github, it seems that this is known with the REST API but not with the GraphQL, where you can, for instance, use cursor-based pagination/sorting the result pre-pagination. To us, this looks likely our issue. An example would be if runner X is initially in page 7 but then moves to page 5 before we get to page 7, it will not exist by the time we get to page 7 hence, it goes missing and gets orphaned. See https://docs.github.com/en/enterprise-cloud@latest/rest/actions/self-hosted-runners?apiVersion=2022-11-28#list-self-hosted-runners-for-an-organization.
Being that this is causing loss of communication, this is not a desired output at all and hence why I have logged it here. I do plan of doing something to help mitigate this, which is WIP at the moment.
Activity
npalm commentedon May 14, 2025
Seems to be the same problem as reported here: #4376
stuartp44 commentedon May 22, 2025
I have created a PR that will help reduce the effect of this issue but it will not "fix" the route cause of the data slippage!
kalinstaykov commentedon Jun 20, 2025
We've been experiencing this problem a lot, here's what's happening for us:
This shows several attempts to kill the runner and eventually it gets marked as orphan and killed. The job that was running was at 19% at the time the runner was marked as orphan which was about 20 minutes after the job started.
I hope that this fix/workaround will help.