[Feature] Support batch scheduling and queueing #213

richardsliu · 2022-03-23T20:26:35Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Kuberay currently does not seem to support scheduling policies for Ray clusters. Examples include:

Queue scheduling
Gang scheduling
Job preemption

Possible scheduler implementations include https://volcano.sh/en/ and https://github.com/kubernetes-sigs/kueue.

Use case

Example use cases:

Multiple users try to deploy Ray clusters in the same Kubernetes cluster. Not all of the requests can be granted due to resource constraints. It would be nice to have some requests stored in a queue and processed at a later time.
Clusters can have configurable retry policies
Clusters can be annotated with priority classes
Batch scheduling configurations: a Ray cluster only gets deployed if a minimum number of quorum nodes can be scheduled.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

akanso · 2022-03-23T21:18:43Z

That's an interesting use case.

What constitutes resource constraints? if a cluster needs 10 replicas (worker pods) but only 9 can be scheduled, is this a resource constraint? or if zero pods for a RayCluster can be scheduled is considered a resource constraint?

Currently the RayOperator has a reconciliation loop that tries to schedule any pods that are missing in a RayCluster. It fetches the list of RayClusters from K8s periodically.

Autoscaling (Cluster-Autoscaling) with K8s can be used to add more nodes when the Ray pods are in pending state due to lack of resources.

Having the Ray Operator keep a queue would make it stateful, and it is preferable to keep the state in K8s.

As for batch scheduling configuration, I think it is a very valid feature request, since, for example, if the user wants a RayCluster of 100 workers and gets a Cluster with 1 head node and zero workers, this might not be useful at all.

However adding this feature would mean that the Ray Operator should keep track of the available resources in the K8s Cluster, which would add to its complexity and the latency of processing each request to create a RayCluster.

richardsliu · 2022-03-23T22:48:03Z

Thanks for the reply.

For the case with 100 workers - suppose that different users created two such clusters in the same Kubernetes cluster. Neither of them have sufficient workers, but both would still be competing for the same resources in the pool. This can lead to resource starvation.

If we have support for Gang Scheduling, this would be more optimal for resource utilization.

Would it be possible to add an option to integrate with an existing batch scheduler, like volcano.sh?

akanso · 2022-03-24T01:20:38Z

I see what you mean.
If we use a scheduling framework like volcano, we would introduce a dependency on the Volcano api:
apiVersion: scheduling.volcano.sh
Moreover, part of deploying KubeRay would involve deploying Volcano.

Do you have in mind a design where KubeRay users can chose wether or not to use this Volcano dependency? in other terms using gang scheduling becomes optional?

Today we have the option of using minReplicas, does it satisfy your use case, if the RayCluster pods are only created if and only if the minimum number of pods can be created?

 replicas: {{ .Values.worker.replicas }}
 minReplicas: {{ .Values.worker.miniReplicas | default 1 }}
 maxReplicas: {{ .Values.worker.maxiReplicas | default 2147483647 }}

asm582 · 2022-06-02T14:11:32Z

FYI, we have noticed the volcano does not work with Kubernetes version >=1.22 out of the box.

loleek · 2022-10-28T07:28:59Z

Here I can give a demo code which can support volcano in ray operator.
loleek@745e12c
I get queue and priority class from head pod spec, but it works well to me.

DmitriGekhtman · 2022-10-28T15:14:56Z

@loleek thanks for sharing!

I think there are several tools in the k8s ecosystem that can help with this.
For example, @asm582 has demo'd and documented how to use MCAD to gang-schedule KubeRay-managed resources
https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/

cc @sihanwang41 @architkulkarni @kevin85421

DmitriGekhtman · 2022-10-28T15:26:36Z

In my opinion, batch scheduling and queuing should not be directly supported out-of-the-box.
However, clear docs on the use of external tools like https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/ are much appreciated!

@loleek from a quick read of the branch you posted, it seems to me the same functionality could be achieved with the existing operator code, by correctly configuring a RayCluster CR.

Of course, having the functionality built-in is certainly nice and might be better for your use-case. But then on the other hand, we don't want to be too opinionated on what external schedulers to use.

It'd be great if we can figure out how to support using Volcano without modifying the operator code and then document that for the community.

wolvever · 2022-10-31T06:50:02Z

Configuring a RayCluster using minReplicas might not solve problems when statefule RayActors are involved under a preemption.

For example job x requires 10 actors, 1 actor per cpu. Job y also requires 10 actors, 1 actor per cpu. There are 15 cpus in total. If job y perempted job x, half of job x's actors lost, waiting for retry.

Actors retry serveral times might always failed because it's hard to predict job y running time. Actors retry infinitely might not be a good practise. Better let gang scheduler removes job x entirely and reschedules job x after job y finishes.

DmitriGekhtman · 2022-10-31T19:36:04Z

@wolvever that's a great point, Ray doesn't have built-in gang scheduling and fault tolerance primitives that would mitigate this scenario. While Ray does have some built-in fault tolerance mechanisms, we generally recommend thinking about fault tolerance at the application level, i.e. you should consider various failure scenarios and design your Ray app with those in mind.

I think this issue focuses on gang-scheduling at the level of K8s pods, though, rather than at the level of the Ray internals.

tgaddair · 2022-12-02T04:34:36Z

Note that #755 has landed, which adds support for Volcano. The interface has been designed in such a way to make it possible to add new schedulers without a major refactor.

DmitriGekhtman · 2022-12-02T04:45:02Z

MCAD is also an option for batch scheduling functionality.
https://www.anyscale.com/blog/gang-scheduling-ray-clusters-on-kubernetes-with-multi-cluster-app-dispatcher

Seems this is good to close. Feel free to open further discussion on batch scheduling and queueing!

peterghaddad · 2023-08-21T18:39:08Z

@DmitriGekhtman would MCAD work for long-lived clusters? i.e say we have a Ray Cluster and we submit a job to the cluster. This job tells the autoscaler to spawn 5 worker pods. Can MCAD be utilized in Ray's current state to watch this type of scenario? The original issue talks about queuing the instantiation of new Ray Clusters.

richardsliu added the enhancement New feature or request label Mar 23, 2022

Jeffwan added the discussion Need community members' input label May 30, 2022

DmitriGekhtman added the P1 Issue that should be fixed within a few weeks label Oct 28, 2022

DmitriGekhtman mentioned this issue Nov 7, 2022

[Feature] Investigate integration of KubeRay and Volcano, add to docs #697

Closed

2 tasks

kevin85421 mentioned this issue Dec 1, 2022

[Feature] Support Volcano for batch scheduling #755

Merged

4 tasks

DmitriGekhtman closed this as completed Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support batch scheduling and queueing #213

[Feature] Support batch scheduling and queueing #213

richardsliu commented Mar 23, 2022

akanso commented Mar 23, 2022 •

edited

Loading

richardsliu commented Mar 23, 2022

akanso commented Mar 24, 2022

asm582 commented Jun 2, 2022

loleek commented Oct 28, 2022 •

edited

Loading

DmitriGekhtman commented Oct 28, 2022 •

edited

Loading

DmitriGekhtman commented Oct 28, 2022

wolvever commented Oct 31, 2022 •

edited

Loading

DmitriGekhtman commented Oct 31, 2022

tgaddair commented Dec 2, 2022

DmitriGekhtman commented Dec 2, 2022

peterghaddad commented Aug 21, 2023 •

edited

Loading

[Feature] Support batch scheduling and queueing #213

[Feature] Support batch scheduling and queueing #213

Comments

richardsliu commented Mar 23, 2022

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

akanso commented Mar 23, 2022 • edited Loading

richardsliu commented Mar 23, 2022

akanso commented Mar 24, 2022

asm582 commented Jun 2, 2022

loleek commented Oct 28, 2022 • edited Loading

DmitriGekhtman commented Oct 28, 2022 • edited Loading

DmitriGekhtman commented Oct 28, 2022

wolvever commented Oct 31, 2022 • edited Loading

DmitriGekhtman commented Oct 31, 2022

tgaddair commented Dec 2, 2022

DmitriGekhtman commented Dec 2, 2022

peterghaddad commented Aug 21, 2023 • edited Loading

akanso commented Mar 23, 2022 •

edited

Loading

loleek commented Oct 28, 2022 •

edited

Loading

DmitriGekhtman commented Oct 28, 2022 •

edited

Loading

wolvever commented Oct 31, 2022 •

edited

Loading

peterghaddad commented Aug 21, 2023 •

edited

Loading