Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Readiness probe failed: timeout on minikube #2158

Open
1 of 2 tasks
anovv opened this issue May 20, 2024 · 7 comments
Open
1 of 2 tasks

[Bug] Readiness probe failed: timeout on minikube #2158

anovv opened this issue May 20, 2024 · 7 comments
Assignees
Labels
bug Something isn't working raycluster

Comments

@anovv
Copy link
Contributor

anovv commented May 20, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

After RayCluster is launched (about 40s) operator kills all worker pods due to failed readiness probe, nothing is restarted, only head node stays (which passes the probe okay). Events:

Readiness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out

Repeated for each worker pod which are then killed. Head node stays healthy, workers are not restarted.

Reproduction script

  1. Launch minikube cluster
  2. Install kuberay-operator via helm
  3. Install RayCluster via helm
  4. Wait 40s, watch all worker pods are terminated by raycluster-controller

Anything else

I'm running minikube inside colima on m2 mac. Tried different arm64 versions of kuberay operator (1.1.0 and 1.1.1) - same problem.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@anovv anovv added bug Something isn't working triage labels May 20, 2024
@kevin85421
Copy link
Member

Which Ray images are you using? You should use images that include aarch64 in the image tag.

@kevin85421 kevin85421 self-assigned this May 21, 2024
@anovv
Copy link
Contributor Author

anovv commented May 21, 2024

@kevin85421 yes, I'm using aarch64 images, 2.22.0-py310-aarch64 for Ray to be exact

@anovv
Copy link
Contributor Author

anovv commented May 21, 2024

@kevin85421 do you have any idea what may be happening? This blocks me.

@kevin85421
Copy link
Member

I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.

kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64
  • We may have some differences: (1) kind vs minikube (2) M1 vs M2 (3) different instructions.
    • You can try kind to determine whether the question is minikube-only or not.
    • Use the exact the same instructions above in your environment.

Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions channel.

@anovv
Copy link
Contributor Author

anovv commented May 26, 2024

@kevin85421 what container runtime do you use? Colima or Docker Desktop?

@kevin85421
Copy link
Member

I use Docker.

@anovv
Copy link
Contributor Author

anovv commented May 27, 2024

Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas parameters with enabled autoscaling head.enableInTreeAutoscaling: true

Example cases:

  • worker:
      replicas: 4
      minReplicas: 0
      maxReplicas: 1000 
    

    I get 4 pods launched, then (about 60s) all 4 failing readiness probe and getting killed

  • worker:
      replicas: 4
      minReplicas: 2
      maxReplicas: 1000 
    

    I get 4 pods launched, then (about 60s) 2 fail readiness probe and die, 2 stay healthy and work

  • If I set no min

    worker:
      replicas: 4
      maxReplicas: 1000 
    

    I get 4 pods launched, then (about 60s) 3 failing readiness probe and getting killed, 1 stays healthy and works

  • If I set worker.replicas = worker.minReplicas = 4, I get all 4 working properly.

Also noticed not setting worker.maxReplicas leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly

So I see two possible things here (which may be interconnected):

  • KubeRay uses worker.minReplicas as default when autoscaler is on after recovering from readiness probe fail (which is unexpected as it should use worker.replicas value)?
  • readiness probes fail only on pods not tracked by autoscaler (not sure why)?

Disabling enableInTreeAutoscaling makes everything work as expected.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working raycluster
Projects
None yet
Development

No branches or pull requests

2 participants