Skip to content

[Feature] [scheduler-plugins] Support second scheduler mode #3852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

CheyuWu
Copy link
Contributor

@CheyuWu CheyuWu commented Jul 9, 2025

Why are these changes needed?

Currently, KubeRay only supports scheduler plugins when it is deployed as a single scheduler.
This change adds support for using a secondary scheduler with scheduler-plugins

Manual Testing

Common Portion

Ray operator setup

Set helm-chart/kuberay-operator/values.yaml's batchScheduler.name to scheduler-plugins-scheduler

batchScheduler:
  enabled: false
  name: "scheduler-plugins-scheduler"

Testing YAML file

  • Create a yaml file - deploy.yaml
#### deploy.yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
  labels:
    ray.io/gang-scheduling-enabled: "true"
    ray.io/scheduler-name: cheduler-plugins-scheduler
spec:
  rayVersion: '2.46.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 1
              memory: 2G
            requests:
              cpu: 1
              memory: 2G
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
  workerGroupSpecs:
  - replicas: 3
    minReplicas: 1
    maxReplicas: 5
    groupName: workergroup
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 1
              memory: 1G
            requests:
              cpu: 1
              memory: 1G

Single scheduler

CoScheduler setup

Follow the instruction - Reference

Some things that are different from instruction

  • Install vim in kube-scheduler-kind-control-plane

    $ apt update
    $ apt install vim
  • Fix the permission problem in kube-scheduler-kind-control-plane

    $ chmod 644 /etc/kubernetes/scheduler.conf
  • Apply missing YAML

    $ k apply -f manifests/crds/scheduling.x-k8s.io_elasticquotas.yaml
  • /etc/kubernetes/sched-cc.yaml

    Keep both default-scheduler and scheduler-plugins-scheduler to make sure the ray-operator can be deployed.

    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      # (Optional) Change true to false if you are not running a HA control-plane.
      leaderElect: true
    clientConnection:
      kubeconfig: /etc/kubernetes/scheduler.conf
    profiles:
    - schedulerName: default-scheduler
      plugins:
        queueSort:
          enabled:
            - name: Coscheduling
          disabled:
            - name: PrioritySort
        multiPoint:
          enabled:
            - name: Coscheduling
    - schedulerName: scheduler-plugins-scheduler
      plugins:
        queueSort:
          enabled:
            - name: Coscheduling
          disabled:
            - name: PrioritySort
        multiPoint:
          enabled:
          - name: Coscheduling

Apply deploy.yaml

Run Cmd to deploy raycluster with scheduler-plugins-scheduler and gang-scheduling-enabled

$ k apply -f deploy.yaml

Result

Get Status

$ k get raycluster

NAME                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster-kuberay   3                 3                   4      5G       0      ready    47s
$ k get podgroup raycluster-kuberay -o yaml

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  creationTimestamp: "2025-07-13T10:28:09Z"
  generation: 1
  name: raycluster-kuberay
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    kind: RayCluster
    name: raycluster-kuberay
    uid: 2ec902e3-5d4f-4a82-b153-0ee088a8d1fe
  resourceVersion: "4685"
  uid: 9d59585d-a2a1-4523-8225-6df31b9eabd0
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4G
status:
  occupiedBy: default/raycluster-kuberay
  phase: Running
  running: 4

Get scheduler Name - ray operator & ray head & ray worker

$ k get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.schedulerName}{"\n"}{end}'

kuberay-operator-5f997dbf6c-gf9g2       default-scheduler
raycluster-kuberay-head scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-44jd4     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-cqbks     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-nnl79     scheduler-plugins-scheduler

Modify the deploy.yaml and apply it

  workerGroupSpecs:
  - replicas: 100
    minReplicas: 1
    maxReplicas: 200

Run cmd to check is all of the pod in pending status

$ kubectl get pods -A

NAMESPACE            NAME                                            READY   STATUS    RESTARTS   AGE
default              kuberay-operator-5f997dbf6c-gf9g2               1/1     Running   0          51m
default              raycluster-kuberay-head                         0/1     Pending   0          2m58s
default              raycluster-kuberay-workergroup-worker-2h25p     0/1     Pending   0          2m57s
... -> skip lots of pending worker pods
default              raycluster-kuberay-workergroup-worker-xqvzv     0/1     Pending   0          2m57s
default              raycluster-kuberay-workergroup-worker-z4dhc     0/1     Pending   0          2m54s
default              raycluster-kuberay-workergroup-worker-zbl9f     0/1     Pending   0          2m54s
kube-system          coredns-6f6b679f8f-4fcbz                        1/1     Running   0          74m
kube-system          coredns-6f6b679f8f-v8fm6                        1/1     Running   0          74m
kube-system          etcd-kind-control-plane                         1/1     Running   0          74m
kube-system          kindnet-9p4st                                   1/1     Running   0          74m
kube-system          kube-apiserver-kind-control-plane               1/1     Running   0          74m
kube-system          kube-controller-manager-kind-control-plane      1/1     Running   0          74m
kube-system          kube-proxy-5xv2w                                1/1     Running   0          74m
kube-system          kube-scheduler-kind-control-plane               1/1     Running   0          57m
local-path-storage   local-path-provisioner-57c5987fd4-sfx5n         1/1     Running   0          74m
scheduler-plugins    scheduler-plugins-controller-845cfd89c6-vvg4p   1/1     Running   0          60m

Second scheduler

According to the instruction - Reference

Install the scheduler-plugins

$ helm install --repo https://scheduler-plugins.sigs.k8s.io scheduler-plugins scheduler-plugins

Check the scheduler-plugins is running

$ kubectl get deploy

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
kuberay-operator               1/1     1            1           19s
scheduler-plugins-controller   1/1     1            1           72s
scheduler-plugins-scheduler    1/1     1            1           72s

Ray operator and apply config

Set helm-chart/kuberay-operator/values.yaml batchScheduler.name to scheduler-plugins

batchScheduler:
  enabled: false
  name: "scheduler-plugins-scheduler"

Apply deploy.yaml

Run Cmd to deploy raycluster with scheduler-plugins-scheduler and gang-scheduling-enabled

$ k apply -f deploy.yaml

Result

Get Status

$ k get raycluster

NAME                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster-kuberay   3                 3                   4      5G       0      ready    8m41s
$ k get podgroup raycluster-kuberay -o yaml

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  creationTimestamp: "2025-07-13T11:15:31Z"
  generation: 1
  name: raycluster-kuberay
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    kind: RayCluster
    name: raycluster-kuberay
    uid: 626e1351-ca01-4759-a3ec-96fb9747019c
  resourceVersion: "2760"
  uid: 802b40d9-3ec9-4234-aaa1-f4580189a403
spec:
  minMember: 4
  minResources:
    cpu: "4"
    memory: 5G
status:
  occupiedBy: default/raycluster-kuberay
  phase: Running
  running: 4

Get scheduler Name - ray operator & ray head & ray worker

$ k get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.schedulerName}{"\n"}{end}'

kuberay-operator-5f997dbf6c-mdj8c       default-scheduler
raycluster-kuberay-head scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-kgjbc     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-rgml8     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-xlpwp     scheduler-plugins-scheduler
scheduler-plugins-controller-845cfd89c6-f7bqt   default-scheduler
scheduler-plugins-scheduler-5dd667cb77-99t6x    default-scheduler

Modify the deploy.yaml and apply it

  workerGroupSpecs:
  - replicas: 100
    minReplicas: 1
    maxReplicas: 200

Run cmd to check is all of the pod in pending status

$ kubectl get pods -A

NAMESPACE            NAME                                            READY   STATUS    RESTARTS   AGE
default              kuberay-operator-5f997dbf6c-mdj8c               1/1     Running   0          13m
default              raycluster-kuberay-head                         0/1     Pending   0          22s
default              raycluster-kuberay-workergroup-worker-24rss     0/1     Pending   0          15s
default              raycluster-kuberay-workergroup-worker-2l7cp     0/1     Pending   0          19s
default              raycluster-kuberay-workergroup-worker-2lrxx     0/1     Pending   0          15s
... -> skip lots of pending worker pods
default              raycluster-kuberay-workergroup-worker-xrw7w     0/1     Pending   0          15s
default              raycluster-kuberay-workergroup-worker-xz88j     0/1     Pending   0          16s
default              raycluster-kuberay-workergroup-worker-z98xm     0/1     Pending   0          22s
default              raycluster-kuberay-workergroup-worker-zl6qs     0/1     Pending   0          17s
default              raycluster-kuberay-workergroup-worker-zv4md     0/1     Pending   0          17s
default              scheduler-plugins-controller-845cfd89c6-f7bqt   1/1     Running   0          14m
default              scheduler-plugins-scheduler-5dd667cb77-99t6x    1/1     Running   0          14m
kube-system          coredns-6f6b679f8f-gjcvt                        1/1     Running   0          22m
kube-system          coredns-6f6b679f8f-ldrt2                        1/1     Running   0          22m
kube-system          etcd-kind-control-plane                         1/1     Running   0          22m
kube-system          kindnet-mpbzl                                   1/1     Running   0          22m
kube-system          kube-apiserver-kind-control-plane               1/1     Running   0          22m
kube-system          kube-controller-manager-kind-control-plane      1/1     Running   0          22m
kube-system          kube-proxy-ftjnh                                1/1     Running   0          22m
kube-system          kube-scheduler-kind-control-plane               1/1     Running   0          22m
local-path-storage   local-path-provisioner-57c5987fd4-w2sx9         1/1     Running   0          22m

Related issue number

Closes #3769

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Cheyu Wu <cheyu1220@gmail.com>
@CheyuWu CheyuWu force-pushed the feat/second-schedule branch from 09c3853 to ea71807 Compare July 9, 2025 14:52
@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 9, 2025

Hi @kevin85421, PTAL

@kevin85421
Copy link
Member

Why do you use single scheduler for manual test?

@kevin85421
Copy link
Member

cc @troychiu for review

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 10, 2025

Why do you use single scheduler for manual test?

Hi @kevin85421
Although both default-scheduler and scheduler-plugins are configured in /etc/kubernetes/sched-cc.yaml
the Ray pods (head and workers) are explicitly assigned to the scheduler-plugins scheduler, as shown in:

labels:
  ray.io/scheduler-name: scheduler-plugins

and verified via:

$ kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.schedulerName}{"\n"}{end}'

This setup follows the multi-scheduler setup, where KubeRay operator itself is scheduled using default-scheduler, and RayCluster pods are scheduled using scheduler-plugins.

I’ll revise the wording in the PR description to avoid confusion around the "single scheduler" statement.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, you deploy scheduler-plugins in "single scheduler" mode to replace the default scheduler. For "second scheduler" mode, you need to use the Helm chart to install scheduler-plugins in a separate Pod.

https://github.com/kubernetes-sigs/scheduler-plugins/blob/93126eabdf526010bf697d5963d849eab7e8e898/doc/install.md#as-a-second-scheduler
image

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 10, 2025

As I understand it, you deploy scheduler-plugins in "single scheduler" mode to replace the default scheduler. For "second scheduler" mode, you need to use the Helm chart to install scheduler-plugins in a separate Pod.

https://github.com/kubernetes-sigs/scheduler-plugins/blob/93126eabdf526010bf697d5963d849eab7e8e898/doc/install.md#as-a-second-scheduler image

Ops, I have a misunderstanding. I will use the second scheduler mode instead.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 11, 2025

Hi @kevin85421 @troychiu, I have updated the manual testing procedure, PTAL

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 12, 2025

I have also updated the 100 pods manual testing, and all of them are in pending status

@kevin85421
Copy link
Member

I have also updated the 100 pods manual testing, and all of them are in pending status

Have you tested for both single scheduler and second scheduler for this 100 Pods RayCluster CR?

@@ -90,8 +90,7 @@ func (k *KubeScheduler) AddMetadataToPod(_ context.Context, app *rayv1.RayCluste
if k.isGangSchedulingEnabled(app) {
pod.Labels[kubeSchedulerPodGroupLabelKey] = app.Name
}
// TODO(kevin85421): Currently, we only support "single scheduler" mode. If we want to support
// "second scheduler" mode, we need to add `schedulerName` to the pod spec.
pod.Spec.SchedulerName = k.Name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 136 to 143
if cluster.Labels == nil {
cluster.Labels = make(map[string]string)
}
if tt.enableGang {
cluster.Labels["ray.io/gang-scheduling-enabled"] = "true"
} else {
delete(cluster.Labels, "ray.io/gang-scheduling-enabled")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be cleaner?

Suggested change
if cluster.Labels == nil {
cluster.Labels = make(map[string]string)
}
if tt.enableGang {
cluster.Labels["ray.io/gang-scheduling-enabled"] = "true"
} else {
delete(cluster.Labels, "ray.io/gang-scheduling-enabled")
}
cluster.Labels = make(map[string]string)
if tt.enableGang {
cluster.Labels["ray.io/gang-scheduling-enabled"] = "true"
}

scheduler := &KubeScheduler{}
scheduler.AddMetadataToPod(context.TODO(), &cluster, "worker", pod)

if tt.expectedPodGroup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use enableGang instead of having another parameter? I think they have similar intention.

@troychiu
Copy link
Contributor

troychiu commented Jul 13, 2025

As @kevin85421 mentioned, can you also double check if both modes work fine?

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 13, 2025

@kevin85421 @troychiu ,

  • I have updated the Manual Testing portion for both the single scheduler and the second scheduler.
  • Use scheduler-plugin-scheduler instead
  • Fix the redundant parameter in the test

@@ -21,7 +21,7 @@ import (
)

const (
schedulerName string = "scheduler-plugins"
schedulerName string = "scheduler-plugins-scheduler"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@CheyuWu CheyuWu Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is important. I will add the comment.

@@ -69,13 +69,13 @@ logging:
#
# 4. Use PodGroup
# batchScheduler:
# name: scheduler-plugins
# name: scheduler-plugins-scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user facing config, I am not sure if we should use "scheduler-plugins" or "scheduler-plugins-scheduler". Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, and it's easier to understand.

Copy link
Contributor Author

@CheyuWu CheyuWu Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I think this is a little awkward, we cannot directly change GetPluginName, because

case schedulerplugins.GetPluginName():

If we need to change batchScheduler to scheduler-plugins, the code will probably be

const (
	schedulerName                 string = "scheduler-plugins"
+      defaultSchedulerName     string = "scheduler-plugins-scheduler"
	kubeSchedulerPodGroupLabelKey string = "scheduling.x-k8s.io/pod-group"
)

func GetPluginName() string {
	return schedulerName
}

func (k *KubeScheduler) Name() string {
	return defaultSchedulerName -> Is this fine to change something like this?
}

I am not sure if there is a better idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, user experience is more important so this is fine to me. However, we'll need good variable naming and comments explaining why there are two names and their corresponding responsibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[scheduler-plugins] Support second scheduler mode
3 participants