Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancelled workflows lead to extra runners #206

Open
alfred-stokespace opened this issue May 23, 2024 · 2 comments
Open

Cancelled workflows lead to extra runners #206

alfred-stokespace opened this issue May 23, 2024 · 2 comments

Comments

@alfred-stokespace
Copy link
Contributor

This case is specific to two scenarios we have in our org.

  1. The run in question is canceled prior to being scheduled
  2. Dependent runs are cancelled when parent runs fail (I think,... )

In both cases we see that myshoes

  1. is notified
  2. creates a resource
  3. myshoes notices that the runner is idle
    • 2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
  4. myshoes waits 6hours before killing the idle runner

We're using some pretty expensive ec2 instances as well as have several contingent runs (that sometimes fail) so having unnecessary instances running for 6 hours is pretty expensive.

Having looked through your code base and understanding the challenges I can see why this hasn't been solved.

I modified my code base of myshoes to handle this, reasonably well. I'll post my solution in a follow up comment.

@alfred-stokespace
Copy link
Contributor Author

Here is the result (after my local code changes) to both use cases mentioned above...


2024/05/23 19:20:32 will delete runner: 7a8d1181-4452-49d4-93d4-272bada8dc76
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, a recent cancel was found; we will shut down
2024/05/23 19:20:32 will delete runner with GitHub: 7a8d1181-4452-49d4-93d4-272bada8dc76
2024/05/23 19:16:46 7a8d1181-4452-49d4-93d4-272bada8dc76 is not running MustRunningTime
2024/05/23 19:15:46 7a8d1181-4452-49d4-93d4-272bada8dc76 is not running MustRunningTime
2024/05/23 19:14:15 instance create successfully! (job: 7a8d1181-4452-49d4-93d4-272bada8dc76, cloud ID: i-04d5c846a5d0c4adb)
2024/05/23 19:14:14 start create instance (job: 7a8d1181-4452-49d4-93d4-272bada8dc76)
2024/05/23 19:14:14 start job (job id: 7a8d1181-4452-49d4-93d4-272bada8dc76)

the line that indicates new behavior, and hints at how it's implemented is...

2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, a recent cancel was found; we will shut down

Basically, I had to expand the webhook code to catch the canceled job case and I chose to put the event details into a Sync.Map.

I then do two things...

  1. periodically clear the cancelled sync.map when no runners exist (prevent memory leak)
  2. if we notice that a host has been idle and is in that 6hr category, we walk the cancelled map and find the first entry that matches the same repo that the runner was created for,
    • we then remove that entry from the map.

Problems...

So, I know that GH does scheduling and you don't get to control which runner runs which job, so there's no perfection here. We're just doing best effort to match up a cancel with a stale runner.

So far so good but it's only been running for a couple days.

Let me know you want actual code examples and I can put them in another comment.

@alfred-stokespace
Copy link
Contributor Author

I just discovered a race-condition I thought would help anyone else that might be trying to address this problem in a similar way.

I had a case where

  1. a cancel happened
  2. the existing delete_runner.go logic detects that a runner has been idle for to long [...] is idle and not running [...]
  3. my suggestion above kicks in [...]a recent cancel was found [...]
  4. delete_runner.go logic continues and attempts to perform the deleteRunnerWithGitHub(...) behaviour
  5. failure happens do to race-condition, we a 422 from GH 422 Bad request - Runner "myshoes-<uuid>" is still running a job"

I went through the GH logs and found that indeed that runner was chosen at the same time and beat the delete code to the finish line.

In my case the ramifications were, we consumed the cancel but didn't reduce the runner count, so we remained in a over-producing state.

One thought I'm having now is to track "in-flight-cancels" and if this race fails again we put the in-flight-cancel back into the cancel pool so it can hopefully consome another one of the idle runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant