New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935
Comments
@DmitriGekhtman take a look? |
Taking a look |
Reproduced. |
Confirming that this is related to the resource annotation (does not happen without the annotation) |
Reproduced with AWSNodeProvider: 0 CPU resource override for head node prevents workers from launching. Specifically, it looks like there's no request for CPU resources registered with the autoscaler when the tasks are submitted. Digging into why. This is important to fix, as setting 0 CPU for head node to render head a pure control node is a common setting for large clusters. |
|
No resource demands visible from Ray status when the tasks are submitted; suggests Ray Core issue
|
cc @mwtian you might also be interested in interaction of c++ core and autoscaler |
Actually, I'm no longer able to reproduce on AWS. Looks like a K8s issue, but still have no idea why I'm not getting any resource demands from GCS. |
Easiest repro @DmitriGekhtman and I have found is
cc @ericl |
@wuisawesome Ping me on the PR fixing this when core team gets to it -- I'm pretty curious about what silliness is happening here... |
I experience this problem in gcp, no k8s. My config is:
The cluster user reports:
While
What is a solution for this problem, please? My additional expectation is that |
This is in my TODO list. Aiming to have it fixed with Ray 1.7. |
The fix is too late to make into Ray 1.7. It is currently available in Ray nightly wheels. |
What is the problem?
I'm trying to avoid task scheduling on rayHead by setting
.values.rayResourses: { "CPU": 0 }
but it keeps trying to schedule tasks at headPod, failing due to lack of resources and, this way, getting stuck without scaling any workerPods.Reproduction (REQUIRED)
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
The text was updated successfully, but these errors were encountered: