-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support batch scheduling and queueing #213
Comments
That's an interesting use case. What constitutes resource constraints? if a cluster needs 10 replicas (worker pods) but only 9 can be scheduled, is this a resource constraint? or if zero pods for a RayCluster can be scheduled is considered a resource constraint? Currently the RayOperator has a reconciliation loop that tries to schedule any pods that are missing in a RayCluster. It fetches the list of RayClusters from K8s periodically. Autoscaling (Cluster-Autoscaling) with K8s can be used to add more nodes when the Ray pods are in pending state due to lack of resources. Having the Ray Operator keep a queue would make it stateful, and it is preferable to keep the state in K8s. As for batch scheduling configuration, I think it is a very valid feature request, since, for example, if the user wants a RayCluster of 100 workers and gets a Cluster with 1 head node and zero workers, this might not be useful at all. However adding this feature would mean that the Ray Operator should keep track of the available resources in the K8s Cluster, which would add to its complexity and the latency of processing each request to create a RayCluster. |
Thanks for the reply. For the case with 100 workers - suppose that different users created two such clusters in the same Kubernetes cluster. Neither of them have sufficient workers, but both would still be competing for the same resources in the pool. This can lead to resource starvation. If we have support for Gang Scheduling, this would be more optimal for resource utilization. Would it be possible to add an option to integrate with an existing batch scheduler, like volcano.sh? |
I see what you mean. Do you have in mind a design where KubeRay users can chose wether or not to use this Volcano dependency? in other terms using gang scheduling becomes optional? Today we have the option of using
|
FYI, we have noticed the volcano does not work with Kubernetes version >=1.22 out of the box. |
Here I can give a demo code which can support volcano in ray operator. |
@loleek thanks for sharing! I think there are several tools in the k8s ecosystem that can help with this. |
In my opinion, batch scheduling and queuing should not be directly supported out-of-the-box. @loleek from a quick read of the branch you posted, it seems to me the same functionality could be achieved with the existing operator code, by correctly configuring a RayCluster CR. Of course, having the functionality built-in is certainly nice and might be better for your use-case. But then on the other hand, we don't want to be too opinionated on what external schedulers to use. It'd be great if we can figure out how to support using Volcano without modifying the operator code and then document that for the community. |
Configuring a RayCluster using minReplicas might not solve problems when statefule RayActors are involved under a preemption. For example job x requires 10 actors, 1 actor per cpu. Job y also requires 10 actors, 1 actor per cpu. There are 15 cpus in total. If job y perempted job x, half of job x's actors lost, waiting for retry. Actors retry serveral times might always failed because it's hard to predict job y running time. Actors retry infinitely might not be a good practise. Better let gang scheduler removes job x entirely and reschedules job x after job y finishes. |
@wolvever that's a great point, Ray doesn't have built-in gang scheduling and fault tolerance primitives that would mitigate this scenario. While Ray does have some built-in fault tolerance mechanisms, we generally recommend thinking about fault tolerance at the application level, i.e. you should consider various failure scenarios and design your Ray app with those in mind. I think this issue focuses on gang-scheduling at the level of K8s pods, though, rather than at the level of the Ray internals. |
Note that #755 has landed, which adds support for Volcano. The interface has been designed in such a way to make it possible to add new schedulers without a major refactor. |
MCAD is also an option for batch scheduling functionality. Seems this is good to close. Feel free to open further discussion on batch scheduling and queueing! |
@DmitriGekhtman would MCAD work for long-lived clusters? i.e say we have a Ray Cluster and we submit a job to the cluster. This job tells the autoscaler to spawn 5 worker pods. Can MCAD be utilized in Ray's current state to watch this type of scenario? The original issue talks about queuing the instantiation of new Ray Clusters. |
Search before asking
Description
Kuberay currently does not seem to support scheduling policies for Ray clusters. Examples include:
Possible scheduler implementations include https://volcano.sh/en/ and https://github.com/kubernetes-sigs/kueue.
Use case
Example use cases:
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: