[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

tmyhu · 2024-04-25T23:44:51Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I've updated from 1.0.0 and it seems like under 1.1.0 no worker pods are created when a Ray Cluster is provisioned by the operator, so if the head node is configured to not run any workloads (`num-cpus: 0') the cluster will never be ready.
Note that this did not seem to impact an pre-existing Ray cluster immediately but I had to delete and recreate our cluster due to #2088 after which I observed this behaviour.

The following logs repeat in the operator logs:

kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.720Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","1 head service found":"ray-py311-raycluster-jc9qt-head-svc"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","Found 1 head Pod":"ray-py311-raycluster-jc9qt-head-n29gn","Pod status":"Running","Pod restart policy":"Always","Ray container terminated status":"nil"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","head Pod":"ray-py311-raycluster-jc9qt-head-n29gn","shouldDelete":false,"reason":"KubeRay does not need to delete the head Pod ray-py311-raycluster-jc9qt-head-n29gn. The Pod status is Running, and the Ray container terminated status is nil."}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","desired workerReplicas (always adhering to minReplicas/maxReplica)":1,"worker group":"small-group","maxReplicas":5,"minReplicas":1,"replicas":1}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","removing the pods in the scaleStrategy of":"small-group"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","workerReplicas":1,"runningPods":0,"diff":0}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","all workers already exist for group":"small-group"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","cluster name":"ray-py311-raycluster-jc9qt"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","cluster name":"ray-py311-raycluster-jc9qt","seconds":300}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","RayCluster name":"ray-py311-raycluster-jc9qt"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","head service name":"ray-py311-raycluster-jc9qt-head-svc","namespace":"ray"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","head service URL":"ray-py311-raycluster-jc9qt-head-svc.ray.svc.cluster.local:8265","port":"dashboard"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"shouldUpdate","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","shouldUpdateServe":false,"reason":"Current Serve config matches cached Serve config, and some deployments have been deployed for cluster ray-py311-raycluster-jc9qt","cachedServeConfig":"OUR_CONFIGURATION"}

Looks like the diff is calculated wrong being reported as 0 when workerReplicas are 1 and runningPods are 0?

Reproduction script

Update to KubeRay operator 1.1.0 and create a RayService like:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-py311
  namespace: ray
spec:
  serveConfigV2: |
   YOUR SERVE CONFIG
  rayClusterConfig:
    rayVersion: '2.11.0'
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: '0' # Prevent Ray workloads with non-zero CPU requirements from being scheduled on the head
      template:
        spec:
          serviceAccountName: ray
          containers:
            - name: ray-head
              image: rayproject/ray:2.11.0-py311
              resources:
                limits:
                  cpu: 2
                  memory: 4Gi
                requests:
                  cpu: 2
                  memory: 4Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
                - containerPort: 44217
                  name: as-metrics # autoscaler
                - containerPort: 44227
                  name: dash-metrics # dashboard
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh","-c","ray stop"]
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            serviceAccountName: ray
            containers:
              - name: ray-worker
                image: rayproject/ray:2.11.0-py311
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                env:
                  - name: TZ
                    value: Pacific/Auckland
                resources:
                  limits:
                    cpu: 2
                    memory: 4Gi
                  requests:
                    cpu: 2
                    memory: 4Gi

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-04-26T06:55:19Z

I think it has already been solved by #2087. I will prepare KubeRay v1.1.1 next week.

tmyhu added bug Something isn't working triage labels Apr 25, 2024

kevin85421 closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

tmyhu commented Apr 25, 2024

kevin85421 commented Apr 26, 2024

[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

Comments

tmyhu commented Apr 25, 2024

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Apr 26, 2024