Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Should we also set PublishNotReadyAddresses if the service is not headless? #2157

Open
2 tasks done
rueian opened this issue May 19, 2024 · 3 comments
Open
2 tasks done
Assignees
Labels
enhancement New feature or request raycluster

Comments

@rueian
Copy link
Contributor

rueian commented May 19, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

In the PR #2117, we found a Ray worker may have a serious delay in connecting to the Head through its headless service with the default PublishNotReadyAddresses=false. So we set PublishNotReadyAddresses=true for headless services in the PR.

This issue is meant to track the following questions:

  1. Determining the source of the delay. Why a RayCluster with one worker node with a default headless service can take 100 seconds to be fully ready in a fresh kind cluster?
  2. Should we also set PublishNotReadyAddresses=true for other service types?

Use case

https://github.com/ray-project/kuberay/pull/2117/files#r1603897666

Related issues

#2117

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@rueian rueian added enhancement New feature or request triage labels May 19, 2024
@rueian
Copy link
Contributor Author

rueian commented Jun 2, 2024

Regarding the source of the delay when PublishNotReadyAddresses=false, I have recorded a demo of creating a RayCluster with one worker in a fresh Kind environment:

Full Demo Recording:

output.mp4

Timeline Breakdown:

  1. The Kind took about 36 seconds to download the Ray image.
  2. The worker took about 53 seconds to pass the ray health-check.
  3. The worker took another 10 seconds to pass the k8s readiness check.

Result Screenshot:

image

The result is the RayCluster took about 100 seconds to be fully ready.

Root Cause Analysis

From the above timeline breakdown, we can see that most of the time was spent on multiple retires of ray health-check.

To inspect what is happening underneath, we can add the GRPC_VERBOSITY=DEBUG before doing the ray health-check and preserve its output:

diff --git a/ray-operator/controllers/ray/common/pod.go b/ray-operator/controllers/ray/common/pod.go
index 9939e63..1d747b7 100644
--- a/ray-operator/controllers/ray/common/pod.go
+++ b/ray-operator/controllers/ray/common/pod.go
@@ -172,16 +172,19 @@ func DefaultWorkerPodTemplate(ctx context.Context, instance rayv1.RayCluster, wo
 			Args: []string{
 				fmt.Sprintf(`
 					SECONDS=0
 					while true; do
 						if (( SECONDS <= 120 )); then
+							export GRPC_VERBOSITY=DEBUG
-							if ray health-check --address %s:%s > /dev/null 2>&1; then
+							if ray health-check --address %s:%s; then
 								break
 							fi
 							echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."

Then we apply the RayCluster again to a fresh Kind environment:

Full Debug Recording:

output2.mp4

Debug Screenshot:

image

We now clearly confirm that the ray health-check can fail multiple times with the Domain name not found error which is indeed caused by PublishNotReadyAddresses=false.

@rueian
Copy link
Contributor Author

rueian commented Jun 2, 2024

Regarding the second question: Should we also set PublishNotReadyAddresses=true for other service types?

My previous thought was that other service types would not raise the Domain name not found error because they had a virtual IP, so there was no need to PublishNotReadyAddresses=true.

However, given that

  1. KubeRay guarantees a Head service will apply to only one Pod at any given time. No worry about accessing a wrong Head.
  2. Allowing users to access the Ray dashboard for troubleshooting through the service when readiness fails is a big use case.

I think we can safely set PublishNotReadyAddresses=true for all Head service types. WDYT @kevin85421? I'd value your feedback on this.

@kevin85421
Copy link
Member

The worker took about 53 seconds to pass the ray health-check.

Interesting. I am very surprised that DNS registration takes much longer than my expectation, although the head Pod is already ready. Thank you for the investigation!

I think we can safely set PublishNotReadyAddresses=true for all Head service types.

Make sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request raycluster
Projects
None yet
Development

No branches or pull requests

2 participants