Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow different LocalQueue label for head and worker groups #2099

Closed
2 tasks done
shaowei-su opened this issue Apr 24, 2024 · 6 comments
Closed
2 tasks done
Labels
enhancement New feature or request kueue rayjob

Comments

@shaowei-su
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Per doc (https://kueue.sigs.k8s.io/docs/tasks/run/rayjobs/), the LocalQueue label is a global settings that will be applied to both head and worker pods. As a result, we could accidentally schedule head pod (which usually can be lightweight) in the scarce node types reserved for workers and inherit all the ResourceFlavor labels and tolerations.

Use case

No response

Related issues

#2098

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shaowei-su shaowei-su added enhancement New feature or request triage labels Apr 24, 2024
@kevin85421
Copy link
Member

cc @andrewsykim @alculquicondor

@alculquicondor
Copy link

This would break a lot of semantics.

A better practice, that is already supported, is to define two resource flavors for each CQ, one with GPUs, one without. You can use affinities or taints/tolerations to match heads and workers to each of the flavors. https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/#resourceflavor-labels

@shaowei-su
Copy link
Author

Thanks @kevin85421 @alculquicondor , this makes a lot sense. I'll try the two flavors setup instead.

@shaowei-su
Copy link
Author

Hi @alculquicondor , we did some quick tests and it appears that heads and workers cannot be admitted to different flavors at the same time. I'm wondering if you have existing setups/examples that could share with me for reference?

Here is some controller logs related

24T23:39:21.793884669Z","logger":"events","caller":"recorder/recorder.go:104","msg":"couldn't assign flavors to pod set default: flavor default-cpu doesn't match node affinity, untolerated taint {specialized NoSchedule NoSchedule <nil>} in flavor a100xl-us-east-1a","type":"Normal","object":{"kind":"Workload","namespace":"xxx","name":"xxx-d429d3a2-0293-11ef-86c3-caa498469caf-2816f","uid":"44ac9406-2826-43c6-8970-f9424e632c66","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"807027005"},"reason":"Pending"}

@shaowei-su shaowei-su reopened this Apr 25, 2024
@andrewsykim
Copy link
Contributor

@shaowei-su can you share your full RayJob yaml and your ClusterQueue? (redact any internal details)

@shaowei-su
Copy link
Author

Thanks @andrewsykim, I just got it working by deleting & re-apply the CRDs (resource flavors, cluster queues). It was not working with in-place editing for some reason. Closing this issue and thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request kueue rayjob
Projects
None yet
Development

No branches or pull requests

4 participants