-
Notifications
You must be signed in to change notification settings - Fork 60
Runner to workflow pods take 3 minutes to start on RWX & containerMode: Kubernetes #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@alexgaganashvili @nikola-jokic Hey Nikola & Alex - I've seen y'all encounter to similar issues like this before, let me know if you see something! Deeply appreciated |
I don't think it's the slowness in PV provisioning, since it's the same PV shared between a runner and a workflow pod. Maybe K8s is trying to find a node that fits your resource requests (ACTIONS_RUNNER_USE_KUBE_SCHEDULER=true)? Check also the kube-scheduler logs. |
Hey @alexgaganashvili - thanks for the comment. checked kube-scheduler logs ( The workflow pod does have space to provision in the node (5000m cpu allowable to be requested) - with 1 workflow pod at 3000m cpu request. I feel it has something to do with this process If you look at timestamp, it's stuck for a minute repeating the same pod logs. Wonder what the best way to debug this further would be. |
Sorry, hard to tell what's causing it. I have not personally run into this issue. I'd suggest you also ask in the Discussions. |
@jonathan-fileread , cc: @Link- , @nikola-jokic |
Hey everyone, I transferred this issue here since it is related to container hook, and not ARC. Most likely, the latency comes from k8s itself where mounting NFS is slow to mount across multiple nodes. We need to find a better way to allow workflow pods to land on different nodes without having to rely on volume. The runner and the workflow pod have to share some files, but we can probably find another solution without relying on RWX volumes |
Thanks, @nikola-jokic . |
hi @nikola-jokic . Is there an official plan and ETA to move away from RWX volumes when The current pairing of runner and job/workflow pods makes it problematic to schedule pods when they have resource requests/limits set. For example we would get into a situation where runner pod fits a node, but the job/workflow pod can't fit on the same node due to resource requests. In this case the job fails because the workflow was never scheduled. We use Karpeneter which exacerbates the issue even further since it keeps node utilization pretty high >80%. So, most of job/workflow pods would fail to schedule on the same node. Then we tried with It really feels like there is no good way to get resource requests/limits utilized with ARC. |
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
After initializing the runner pod (which is fairly immediate) - the github actions jobs (6 of them) seems to get stuck polling for 2-3 minutes waiting to spin up the workflow pod to continue the github action job.
The runner pod logs show every 5-10 seconds there is a job that polls for 2-3 minutes before the container hook is called and the workflow pod is spun up.
See Line 6-52 in the scaleset logs gist below, you'll see this line get called every few seconds.
[WORKER 2024-12-03 19:21:58Z INFO HostContext] Well known directory 'Root': '/home/runner'
This bug started occuring when we switched to RWX, new storage class using NFS based azure files. I suppose it might be the slowness to provision a PVC using azure files versus traditional disk based setup on RWO
Describe the expected behavior
After initializing the runner pod on new github actions job- the workflow pods should spin up near immediately to process the docker builds from each GHA job.
Additional Context
Controller Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
Runner Pod Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
The text was updated successfully, but these errors were encountered: