Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support batch scheduling and queueing #213

Closed
2 tasks done
richardsliu opened this issue Mar 23, 2022 · 12 comments
Closed
2 tasks done

[Feature] Support batch scheduling and queueing #213

richardsliu opened this issue Mar 23, 2022 · 12 comments
Labels
discussion Need community members' input enhancement New feature or request P1 Issue that should be fixed within a few weeks

Comments

@richardsliu
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Kuberay currently does not seem to support scheduling policies for Ray clusters. Examples include:

  • Queue scheduling
  • Gang scheduling
  • Job preemption

Possible scheduler implementations include https://volcano.sh/en/ and https://github.com/kubernetes-sigs/kueue.

Use case

Example use cases:

  • Multiple users try to deploy Ray clusters in the same Kubernetes cluster. Not all of the requests can be granted due to resource constraints. It would be nice to have some requests stored in a queue and processed at a later time.
  • Clusters can have configurable retry policies
  • Clusters can be annotated with priority classes
  • Batch scheduling configurations: a Ray cluster only gets deployed if a minimum number of quorum nodes can be scheduled.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@richardsliu richardsliu added the enhancement New feature or request label Mar 23, 2022
@akanso
Copy link
Collaborator

akanso commented Mar 23, 2022

That's an interesting use case.

What constitutes resource constraints? if a cluster needs 10 replicas (worker pods) but only 9 can be scheduled, is this a resource constraint? or if zero pods for a RayCluster can be scheduled is considered a resource constraint?

Currently the RayOperator has a reconciliation loop that tries to schedule any pods that are missing in a RayCluster. It fetches the list of RayClusters from K8s periodically.

Autoscaling (Cluster-Autoscaling) with K8s can be used to add more nodes when the Ray pods are in pending state due to lack of resources.

Having the Ray Operator keep a queue would make it stateful, and it is preferable to keep the state in K8s.

As for batch scheduling configuration, I think it is a very valid feature request, since, for example, if the user wants a RayCluster of 100 workers and gets a Cluster with 1 head node and zero workers, this might not be useful at all.

However adding this feature would mean that the Ray Operator should keep track of the available resources in the K8s Cluster, which would add to its complexity and the latency of processing each request to create a RayCluster.

@richardsliu
Copy link
Contributor Author

Thanks for the reply.

For the case with 100 workers - suppose that different users created two such clusters in the same Kubernetes cluster. Neither of them have sufficient workers, but both would still be competing for the same resources in the pool. This can lead to resource starvation.

If we have support for Gang Scheduling, this would be more optimal for resource utilization.

Would it be possible to add an option to integrate with an existing batch scheduler, like volcano.sh?

@akanso
Copy link
Collaborator

akanso commented Mar 24, 2022

I see what you mean.
If we use a scheduling framework like volcano, we would introduce a dependency on the Volcano api:
apiVersion: scheduling.volcano.sh
Moreover, part of deploying KubeRay would involve deploying Volcano.

Do you have in mind a design where KubeRay users can chose wether or not to use this Volcano dependency? in other terms using gang scheduling becomes optional?

Today we have the option of using minReplicas, does it satisfy your use case, if the RayCluster pods are only created if and only if the minimum number of pods can be created?

 replicas: {{ .Values.worker.replicas }}
 minReplicas: {{ .Values.worker.miniReplicas | default 1 }}
 maxReplicas: {{ .Values.worker.maxiReplicas | default 2147483647 }}

@Jeffwan Jeffwan added the discussion Need community members' input label May 30, 2022
@asm582
Copy link
Contributor

asm582 commented Jun 2, 2022

FYI, we have noticed the volcano does not work with Kubernetes version >=1.22 out of the box.

@loleek
Copy link

loleek commented Oct 28, 2022

Here I can give a demo code which can support volcano in ray operator.
loleek@745e12c
I get queue and priority class from head pod spec, but it works well to me.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Oct 28, 2022

@loleek thanks for sharing!

I think there are several tools in the k8s ecosystem that can help with this.
For example, @asm582 has demo'd and documented how to use MCAD to gang-schedule KubeRay-managed resources
https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/

cc @sihanwang41 @architkulkarni @kevin85421

@DmitriGekhtman
Copy link
Collaborator

In my opinion, batch scheduling and queuing should not be directly supported out-of-the-box.
However, clear docs on the use of external tools like https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD/ are much appreciated!

@loleek from a quick read of the branch you posted, it seems to me the same functionality could be achieved with the existing operator code, by correctly configuring a RayCluster CR.

Of course, having the functionality built-in is certainly nice and might be better for your use-case. But then on the other hand, we don't want to be too opinionated on what external schedulers to use.

It'd be great if we can figure out how to support using Volcano without modifying the operator code and then document that for the community.

@DmitriGekhtman DmitriGekhtman added the P1 Issue that should be fixed within a few weeks label Oct 28, 2022
@wolvever
Copy link

wolvever commented Oct 31, 2022

Configuring a RayCluster using minReplicas might not solve problems when statefule RayActors are involved under a preemption.

For example job x requires 10 actors, 1 actor per cpu. Job y also requires 10 actors, 1 actor per cpu. There are 15 cpus in total. If job y perempted job x, half of job x's actors lost, waiting for retry.

Actors retry serveral times might always failed because it's hard to predict job y running time. Actors retry infinitely might not be a good practise. Better let gang scheduler removes job x entirely and reschedules job x after job y finishes.

@DmitriGekhtman
Copy link
Collaborator

@wolvever that's a great point, Ray doesn't have built-in gang scheduling and fault tolerance primitives that would mitigate this scenario. While Ray does have some built-in fault tolerance mechanisms, we generally recommend thinking about fault tolerance at the application level, i.e. you should consider various failure scenarios and design your Ray app with those in mind.

I think this issue focuses on gang-scheduling at the level of K8s pods, though, rather than at the level of the Ray internals.

@tgaddair
Copy link
Contributor

tgaddair commented Dec 2, 2022

Note that #755 has landed, which adds support for Volcano. The interface has been designed in such a way to make it possible to add new schedulers without a major refactor.

@DmitriGekhtman
Copy link
Collaborator

MCAD is also an option for batch scheduling functionality.
https://www.anyscale.com/blog/gang-scheduling-ray-clusters-on-kubernetes-with-multi-cluster-app-dispatcher

Seems this is good to close. Feel free to open further discussion on batch scheduling and queueing!

@peterghaddad
Copy link

peterghaddad commented Aug 21, 2023

@DmitriGekhtman would MCAD work for long-lived clusters? i.e say we have a Ray Cluster and we submit a job to the cluster. This job tells the autoscaler to spawn 5 worker pods. Can MCAD be utilized in Ray's current state to watch this type of scenario? The original issue talks about queuing the instantiation of new Ray Clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Need community members' input enhancement New feature or request P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

9 participants