[Bug] Readiness probe failed: timeout on minikube #2158

anovv · 2024-05-20T05:51:47Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

After RayCluster is launched (about 40s) operator kills all worker pods due to failed readiness probe, nothing is restarted, only head node stays (which passes the probe okay). Events:

Readiness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out

Repeated for each worker pod which are then killed. Head node stays healthy, workers are not restarted.

Reproduction script

Launch minikube cluster
Install kuberay-operator via helm
Install RayCluster via helm
Wait 40s, watch all worker pods are terminated by raycluster-controller

Anything else

I'm running minikube inside colima on m2 mac. Tried different arm64 versions of kuberay operator (1.1.0 and 1.1.1) - same problem.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-05-21T06:31:50Z

Which Ray images are you using? You should use images that include aarch64 in the image tag.

anovv · 2024-05-21T06:50:42Z

@kevin85421 yes, I'm using aarch64 images, 2.22.0-py310-aarch64 for Ray to be exact

anovv · 2024-05-21T06:55:36Z

@kevin85421 do you have any idea what may be happening? This blocks me.

kevin85421 · 2024-05-25T18:04:47Z

I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.

kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64

We may have some differences: (1) kind vs minikube (2) M1 vs M2 (3) different instructions.
- You can try kind to determine whether the question is minikube-only or not.
- Use the exact the same instructions above in your environment.

Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions channel.

anovv · 2024-05-26T05:17:57Z

@kevin85421 what container runtime do you use? Colima or Docker Desktop?

kevin85421 · 2024-05-26T16:16:56Z

I use Docker.

anovv · 2024-05-27T07:16:39Z

Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas parameters with enabled autoscaling head.enableInTreeAutoscaling: true

Example cases:

```
worker:
  replicas: 4
  minReplicas: 0
  maxReplicas: 1000 
```
I get 4 pods launched, then (about 60s) all 4 failing readiness probe and getting killed
```
worker:
  replicas: 4
  minReplicas: 2
  maxReplicas: 1000 
```
I get 4 pods launched, then (about 60s) 2 fail readiness probe and die, 2 stay healthy and work
If I set no min
```
worker:
  replicas: 4
  maxReplicas: 1000 
```
I get 4 pods launched, then (about 60s) 3 failing readiness probe and getting killed, 1 stays healthy and works
If I set worker.replicas = worker.minReplicas = 4, I get all 4 working properly.

Also noticed not setting worker.maxReplicas leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly

So I see two possible things here (which may be interconnected):

KubeRay uses worker.minReplicas as default when autoscaler is on after recovering from readiness probe fail (which is unexpected as it should use worker.replicas value)?
readiness probes fail only on pods not tracked by autoscaler (not sure why)?

Disabling enableInTreeAutoscaling makes everything work as expected.

What do you think?

anovv added bug Something isn't working triage labels May 20, 2024

kevin85421 added raycluster and removed triage labels May 21, 2024

kevin85421 self-assigned this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Readiness probe failed: timeout on minikube #2158

[Bug] Readiness probe failed: timeout on minikube #2158

anovv commented May 20, 2024

kevin85421 commented May 21, 2024

anovv commented May 21, 2024 •

edited

Loading

anovv commented May 21, 2024

kevin85421 commented May 25, 2024

anovv commented May 26, 2024

kevin85421 commented May 26, 2024

anovv commented May 27, 2024 •

edited

Loading

[Bug] Readiness probe failed: timeout on minikube #2158

[Bug] Readiness probe failed: timeout on minikube #2158

Comments

anovv commented May 20, 2024

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented May 21, 2024

anovv commented May 21, 2024 • edited Loading

anovv commented May 21, 2024

kevin85421 commented May 25, 2024

anovv commented May 26, 2024

kevin85421 commented May 26, 2024

anovv commented May 27, 2024 • edited Loading

anovv commented May 21, 2024 •

edited

Loading

anovv commented May 27, 2024 •

edited

Loading