Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

Closed
1 of 2 tasks
tmyhu opened this issue Apr 25, 2024 · 1 comment
Closed
1 of 2 tasks

[Bug] No worker pods created after updating to Kuberay 1.1.0 #2105

tmyhu opened this issue Apr 25, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@tmyhu
Copy link

tmyhu commented Apr 25, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I've updated from 1.0.0 and it seems like under 1.1.0 no worker pods are created when a Ray Cluster is provisioned by the operator, so if the head node is configured to not run any workloads (`num-cpus: 0') the cluster will never be ready.
Note that this did not seem to impact an pre-existing Ray cluster immediately but I had to delete and recreate our cluster due to #2088 after which I observed this behaviour.

The following logs repeat in the operator logs:

kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.720Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","1 head service found":"ray-py311-raycluster-jc9qt-head-svc"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","Found 1 head Pod":"ray-py311-raycluster-jc9qt-head-n29gn","Pod status":"Running","Pod restart policy":"Always","Ray container terminated status":"nil"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","head Pod":"ray-py311-raycluster-jc9qt-head-n29gn","shouldDelete":false,"reason":"KubeRay does not need to delete the head Pod ray-py311-raycluster-jc9qt-head-n29gn. The Pod status is Running, and the Ray container terminated status is nil."}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","desired workerReplicas (always adhering to minReplicas/maxReplica)":1,"worker group":"small-group","maxReplicas":5,"minReplicas":1,"replicas":1}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","removing the pods in the scaleStrategy of":"small-group"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","workerReplicas":1,"runningPods":0,"diff":0}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","all workers already exist for group":"small-group"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","cluster name":"ray-py311-raycluster-jc9qt"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:29.721Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"ray-py311-raycluster-jc9qt","namespace":"ray"},"reconcileID":"17eb30fd-edb5-467e-bc3b-82910187b923","cluster name":"ray-py311-raycluster-jc9qt","seconds":300}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","RayCluster name":"ray-py311-raycluster-jc9qt"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","head service name":"ray-py311-raycluster-jc9qt-head-svc","namespace":"ray"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","head service URL":"ray-py311-raycluster-jc9qt-head-svc.ray.svc.cluster.local:8265","port":"dashboard"}
kuberay-operator-5f6864dc59-rvq5f kuberay-operator {"level":"info","ts":"2024-04-25T23:12:30.240Z","logger":"controllers.RayService","msg":"shouldUpdate","RayService":{"name":"ray-py311","namespace":"ray"},"reconcileID":"54a437c6-9c87-4dcb-b057-eb6f9fb62033","shouldUpdateServe":false,"reason":"Current Serve config matches cached Serve config, and some deployments have been deployed for cluster ray-py311-raycluster-jc9qt","cachedServeConfig":"OUR_CONFIGURATION"}

Looks like the diff is calculated wrong being reported as 0 when workerReplicas are 1 and runningPods are 0?

Reproduction script

Update to KubeRay operator 1.1.0 and create a RayService like:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-py311
  namespace: ray
spec:
  serveConfigV2: |
   YOUR SERVE CONFIG
  rayClusterConfig:
    rayVersion: '2.11.0'
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: '0' # Prevent Ray workloads with non-zero CPU requirements from being scheduled on the head
      template:
        spec:
          serviceAccountName: ray
          containers:
            - name: ray-head
              image: rayproject/ray:2.11.0-py311
              resources:
                limits:
                  cpu: 2
                  memory: 4Gi
                requests:
                  cpu: 2
                  memory: 4Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
                - containerPort: 44217
                  name: as-metrics # autoscaler
                - containerPort: 44227
                  name: dash-metrics # dashboard
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh","-c","ray stop"]
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            serviceAccountName: ray
            containers:
              - name: ray-worker
                image: rayproject/ray:2.11.0-py311
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                env:
                  - name: TZ
                    value: Pacific/Auckland
                resources:
                  limits:
                    cpu: 2
                    memory: 4Gi
                  requests:
                    cpu: 2
                    memory: 4Gi

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@tmyhu tmyhu added bug Something isn't working triage labels Apr 25, 2024
@kevin85421
Copy link
Member

I think it has already been solved by #2087. I will prepare KubeRay v1.1.1 next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants