-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cancelled workflows lead to extra runners #206
Comments
Here is the result (after my local code changes) to both use cases mentioned above...
the line that indicates new behavior, and hints at how it's implemented is...
Basically, I had to expand the webhook code to catch the canceled job case and I chose to put the event details into a Sync.Map. I then do two things...
Problems... So, I know that GH does scheduling and you don't get to control which runner runs which job, so there's no perfection here. We're just doing best effort to match up a cancel with a stale runner. So far so good but it's only been running for a couple days. Let me know you want actual code examples and I can put them in another comment. |
I just discovered a race-condition I thought would help anyone else that might be trying to address this problem in a similar way. I had a case where
I went through the GH logs and found that indeed that runner was chosen at the same time and beat the delete code to the finish line. In my case the ramifications were, we consumed the cancel but didn't reduce the runner count, so we remained in a over-producing state. One thought I'm having now is to track "in-flight-cancels" and if this race fails again we put the in-flight-cancel back into the cancel pool so it can hopefully consome another one of the idle runners. |
This case is specific to two scenarios we have in our org.
In both cases we see that myshoes
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
We're using some pretty expensive ec2 instances as well as have several contingent runs (that sometimes fail) so having unnecessary instances running for 6 hours is pretty expensive.
Having looked through your code base and understanding the challenges I can see why this hasn't been solved.
I modified my code base of myshoes to handle this, reasonably well. I'll post my solution in a follow up comment.
The text was updated successfully, but these errors were encountered: