You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I've updated from 1.0.0 and it seems like under 1.1.0 no worker pods are created when a Ray Cluster is provisioned by the operator, so if the head node is configured to not run any workloads (`num-cpus: 0') the cluster will never be ready.
Note that this did not seem to impact an pre-existing Ray cluster immediately but I had to delete and recreate our cluster due to #2088 after which I observed this behaviour.
The following logs repeat in the operator logs:
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.720Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","1 head service found":"ray-py311-raycluster-jc9qt-head-svc"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","Found 1 head Pod":"ray-py311-raycluster-jc9qt-head-n29gn","Pod status":"Running","Pod restart policy":"Always","Ray container terminated status":"nil"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","head Pod":"ray-py311-raycluster-jc9qt-head-n29gn","shouldDelete":false,"reason":"KubeRay does not need to delete the head Pod ray-py311-raycluster-jc9qt-head-n29gn. The Pod status is Running, and the Ray container terminated status is nil."}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","desired workerReplicas (always adhering to minReplicas/maxReplica)":1,"worker group":"small-group","maxReplicas":5,"minReplicas":1,"replicas":1}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","removing the pods in the scaleStrategy of":"small-group"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","workerReplicas":1,"runningPods":0,"diff":0}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","all workers already exist for group":"small-group"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","cluster name":"ray-py311-raycluster-jc9qt"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","cluster name":"ray-py311-raycluster-jc9qt","seconds":300}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","RayCluster name":"ray-py311-raycluster-jc9qt"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","head service name":"ray-py311-raycluster-jc9qt-head-svc","namespace":"ray"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","head service URL":"ray-py311-raycluster-jc9qt-head-svc.ray.svc.cluster.local:8265","port":"dashboard"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"shouldUpdate","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","shouldUpdateServe":false,"reason":"Current Serve config matches cached Serve config, and some deployments have been deployed for cluster ray-py311-raycluster-jc9qt","cachedServeConfig":"OUR_CONFIGURATION"}
Looks like the diff is calculated wrong being reported as 0 when workerReplicas are 1 and runningPods are 0?
Reproduction script
Update to KubeRay operator 1.1.0 and create a RayService like:
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I've updated from 1.0.0 and it seems like under 1.1.0 no worker pods are created when a Ray Cluster is provisioned by the operator, so if the head node is configured to not run any workloads (`num-cpus: 0') the cluster will never be ready.
Note that this did not seem to impact an pre-existing Ray cluster immediately but I had to delete and recreate our cluster due to #2088 after which I observed this behaviour.
The following logs repeat in the operator logs:
Looks like the diff is calculated wrong being reported as 0 when workerReplicas are 1 and runningPods are 0?
Reproduction script
Update to KubeRay operator 1.1.0 and create a RayService like:
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: