Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] KubeRay cluster resource status is reporting Ready when there are pods still pending #2188

Open
1 of 2 tasks
tsailiming opened this issue Jun 12, 2024 · 4 comments
Open
1 of 2 tasks
Labels
bug Something isn't working triage

Comments

@tsailiming
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

apiserver

What happened + What you expected to happen

When there are pods stuck in Pending because of insufficient resources, the RayCluster state is reported as ready.

status:
  desiredCPU: "22"
  desiredGPU: "4"
  desiredMemory: 24G
  desiredTPU: "0"
  desiredWorkerReplicas: 2
  endpoints:
    client: "10001"
    dashboard: "8265"
    gcs: "6379"
    metrics: "8080"
  head:
    serviceIP: 172.30.12.150
  lastUpdateTime: "2024-06-12T13:35:00Z"
  maxWorkerReplicas: 2
  minWorkerReplicas: 2
  observedGeneration: 2
  state: ready

This is the status from the head pod

status:
phase: Pending
conditions:
  - type: PodScheduled
    status: 'False'
    lastProbeTime: null
    lastTransitionTime: '2024-06-12T13:55:11Z'
    reason: Unschedulable
    message: '0/5 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) didn''t match Pod''s node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..'
qosClass: Burstable

Reproduction script

  1. Submit a RayCluster that meets the ClusterQueue quota requirement so that it runs and not in Suspended state
  2. The worker node(s) has insufficient resources to run the pods.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@tsailiming tsailiming added bug Something isn't working triage labels Jun 12, 2024
@tsailiming tsailiming changed the title [Bug] KubeRay cluster resource status is reporting Ready when there are pods are still pending [Bug] KubeRay cluster resource status is reporting Ready when there are pods still pending Jun 12, 2024
@tsailiming
Copy link
Author

@astefanutti Filed this as per your request.

@andrewsykim
Copy link
Contributor

andrewsykim commented Jun 12, 2024

@tsailiming what's the KubeRay version? In previous versions it is a known isuse that RayCluster status indefinitly ready once it observes all worker pods as running. There's some discussion about it in #1930

@tsailiming
Copy link
Author

From one of the head pod. This is from OpenShift AI 2.9.1.

$ ray --version
ray, version 2.7.1

@andrewsykim
Copy link
Contributor

@tsailiming I meant the KubeRay version, not the Ray version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants