[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173

kevin85421 · 2023-06-18T07:13:08Z

Why are these changes needed?

Trace code

500 Internal Server Error Traceback (most recent call last):
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/optional_utils.py", line 279, in decorator
 raise e from None
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/optional_utils.py", line 271, in decorator
 ray.init(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
 return func(*args, **kwargs)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 1509, in init
 _global_node = ray._private.node.Node(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/node.py", line 244, in __init__
 node_info = ray._private.services.get_node_to_connect_for_driver(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/services.py", line 444, in get_node_to_connect_for_driver
 return global_state.get_node_to_connect_for_driver(node_ip_address)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/state.py", line 752, in get_node_to_connect_for_driver
 node_info_str = self.global_state_accessor.get_node_to_connect_for_driver(
 File "python/ray/includes/global_state_accessor.pxi", line 156, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
RuntimeError: b"This node has an IP address of 100.64.101.199, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at rayservice-raycluster-4tcq6-head-svc.ns-team-colligo-rayservice.svc.cluster.local and found raylets at 100.64.55.99 but none of these match this node's IP 100.64.101.199. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."

#1125 provides the aforementioned error message. This user is using Ray 2.3.0, so I traced the code of Ray 2.3.0 instead of 2.5.0.

Step 1: KubeRay uses an HTTP client to send a GET $DASHBOARD_AGENT_SERVICE:52365/api/serve/deployments/status request to retrieve Serve status.
Step 2: The request in Step 1 will be received and processed by the get_all_deployment_statuses function in the Serve agent. You can find the code for this function here.
Step 3: The function get_all_deployment_statuses uses @optional_utils.init_ray_and_catch_exceptions() as a decorator. This decorator checks whether Ray is initialized or not. If Ray is not initialized, the decorator will call ray.init().

Step 4: The function ray.init() in worker.py invokes the following function to initialize a Ray node. In addition, we can determine that this node is a worker node based on the head=False parameter.

         _global_node = ray._private.node.Node(
             ray_params,
             head=False,
             shutdown_at_exit=False,
             spawn_reaper=False,
             connect_only=True,
         )

Step 5: The constructor of class Node (node.py) calls the function ray._private.services.get_node_to_connect_for_driver to retrieve address information from GCS.
Step 6: services.py global_state.get_node_to_connect_for_driver(node_ip_address).
Step 7: state.py get_node_to_connect_for_driver(self, node_ip_address)
Step 8: global_state_accessor.pxi (Cython wrapper) get_node_to_connect_for_driver

Step 9: global_state_accessor.cc GetNodeToConnectForDriver => The error message is reported by this function

 RuntimeError: b"This node has an IP address of 100.64.101.199, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at rayservice-raycluster-4tcq6-head-svc.ns-team-colligo-rayservice.svc.cluster.local and found raylets at 100.64.55.99 but none of these match this node's IP 100.64.101.199. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."

Root cause of #1125

To summarize, the root cause is that KubeRay sends a request to the dashboard agent process on a worker node, but GCS does not have the address information for this Raylet. This means that the dashboard agent process starts serving requests before the node registration process with GCS finishes. See the "Node management" section in the Ray architecture whitepaper for more details.

KubeRay will start sending requests to the dashboard agent processes once the head Pod is running and ready. In other words, KubeRay does not check the status of workers, so it is possible for the dashboard agent processes on workers to receive requests at any moment. See #1074 for more details.

Solution

Without this PR, the Kubernetes service for the dashboard agent uses a round-robin algorithm to evenly distribute traffic among the available Pods, including the head Pod and worker Pods. You can refer to this link for more details. Hence, this PR decides to only send requests to the head Pod by changing from the dashboard agent service to the head service.

Something needs to be verified.

Is it possible for the dashboard agent process starts serving requests before the node registration process with GCS finishes?
Does it make sense to restrict the sending of only GET and PUT requests to the head Pod? In my understanding, requests will typically be forwarded to the ServeController actor, which is expected to be running on the head Pod.

Related issue number

Closes #1125

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

The following experiment is used to verify the statement "KubeRay will start sending requests to the dashboard agent processes once the head Pod is running and ready. In other words, KubeRay does not check the status of workers, so it is possible for the dashboard agent processes on workers to receive requests at any moment.".

# Step 1: Create a Kind cluster
kind create cluster --image=kindest/node:v1.23.0

# Step 2: Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Step 3: Load the Ray image into the Kind cluster
docker pull rayproject/ray:2.4.0
kind load docker-image rayproject/ray:2.4.0

# Step 4: Install gist https://gist.github.com/kevin85421/d64e730acc2e1e3b5223610e450e4478
# In this gist, the worker Pod will sleep for 60 seconds before executing the `ray start` command.
kubectl apply -f rayservice.yaml

# Step 5: Run the following commands and the script `loop_curl.sh` immediately after the head Pod is ready
export DASHBOARD_AGENT_SVC=$(kubectl get svc -l=ray.io/cluster-dashboard  -o custom-columns=SERVICE:metadata.name --no-headers)
kubectl port-forward svc/$DASHBOARD_AGENT_SVC 52365
./loop_curl.sh 2>&1 | tee log

"""loop_curl.sh
#!/bin/bash

kubectl run curl --image=radial/busyboxplus:curl --command -- /bin/sh -c "while true; do sleep 10;done"
export DASHBOARD_AGENT_SVC=$(kubectl get svc -l=ray.io/cluster-dashboard  -o custom-columns=SERVICE:metadata.name --no-headers)
ITERATIONS=120

# Run the loop
for ((i=1; i<=$ITERATIONS; i++)); do
    echo -e "Iteration: $i\n"
    kubectl exec -it curl -- curl -X GET $DASHBOARD_AGENT_SVC:52365/api/serve/deployments/status
    echo -e "\n"
    sleep 1
done
"""

# You will observe some requests fail because they are forwarded to the worker Pod before it is ready.

# Step 6: Update the loop_curl script to send the request to the head service. No request should fail after the head Pod is ready and running.

kevin85421 · 2023-06-20T00:54:01Z

cc @anshulomar @msumitjain would you mind reviewing this PR? Thanks!

kevin85421 · 2023-06-20T17:09:23Z

Hi @rickyyx,

Is it possible for the dashboard agent process on a worker starts serving requests before the node registration process with GCS finishes?
Is it possible for the dashboard agent process on a head starts serving requests before the node registration process with GCS finishes?

Thanks!

architkulkarni

Thanks for the detailed log of the debugging process.

The code looks good to me. I'll defer to @sihanwang41 / @zcin to confirm whether the overall approach makes sense (send requests to head node, not agent)

ray-operator/controllers/ray/utils/dashboard_httpclient.go

kevin85421 · 2023-06-21T17:23:06Z

cc @sihanwang41 would you mind reviewing this PR? Thanks!

rickyyx · 2023-06-21T21:44:53Z

Is it possible for the dashboard agent process on a worker starts serving requests before the node registration process with GCS finishes?

Is it possible for the dashboard agent process on a head starts serving requests before the node registration process with GCS finishes?

I think that might not happen since the agent should only be discoverable after node registration is done. (But I could be wrong). What's the implication of this variance? I prob need to take a look at the code to be 100% sure.

kevin85421 · 2023-06-21T22:30:48Z

I think that might not happen since the agent should only be discoverable after node registration is done. (But I could be wrong).

Hi @rickyyx, thank you for the reply! Would you mind explaining more about "discoverable"?

Context:

Users get this error when KubeRay sends a GET /api/serve/deployments/status request to a worker's dashboard agent.
Without this PR, it is highly possible for KubeRay to send the GET requests to worker Pods before they are ready and running. That's why I guess that the issue is related to sending requests before node registration is done.
[Bug] Raylet missing on worker group #1063 has the same error message, and the root cause is that Ray script is executed before the ray start command (i.e. node registration ?).

rickyyx · 2023-06-21T23:26:53Z

How does kuberay know of the worker node's agent address? Is this something derivable outside of ray? (e.g. from node's ip address, and a port being passed in?)

If so, yes, I think it's possible for the request to arrive on the worker node before node registration is done (which will lead to the original error stack, where the GCS is not able to find the node)

kevin85421 · 2023-06-21T23:38:50Z

How does kuberay know of the worker node's agent address? Is this something derivable outside of ray? (e.g. from node's ip address, and a port being passed in?)

KubeRay uses a Kubernetes service for the dashboard agents using a round-robin algorithm to evenly distribute traffic among the available Pods, including the head Pod and worker Pods.

If so, yes, I think it's possible for the request to arrive on the worker node before node registration is done (which will lead to the original error stack, where the GCS is not able to find the node)

Thank you for the confirmation!

msumitjain · 2023-06-22T06:12:01Z

Thank you for picking this much needed fix. Apologies for being late in the review party.
The changes look good to me.

msumitjain

LGTM

anshulomar

Thanks @kevin85421 for making this change! Looks good to me.

…0 Internal Server Error (ray-project#1173) [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error

kevin85421 added 3 commits June 18, 2023 07:12

fix

6b01c9f

add test

c0974c0

improve test

c7b5c96

kevin85421 marked this pull request as ready for review June 20, 2023 00:40

kevin85421 requested review from architkulkarni, sihanwang41, zcin, shrekris-anyscale and gvspraveen June 20, 2023 17:06

architkulkarni approved these changes Jun 20, 2023

View reviewed changes

ray-operator/controllers/ray/utils/dashboard_httpclient.go Show resolved Hide resolved

architkulkarni assigned architkulkarni, sihanwang41 and zcin Jun 20, 2023

sihanwang41 approved these changes Jun 21, 2023

View reviewed changes

kevin85421 merged commit c22fbfa into ray-project:master Jun 21, 2023
20 checks passed

msumitjain reviewed Jun 22, 2023

View reviewed changes

anshulomar reviewed Jun 22, 2023

View reviewed changes

kevin85421 mentioned this pull request Jun 29, 2023

[Refactor] Remove Dashboard Agent service #1207

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173

kevin85421 commented Jun 18, 2023 •

edited

kevin85421 commented Jun 20, 2023

kevin85421 commented Jun 20, 2023

architkulkarni left a comment

kevin85421 commented Jun 21, 2023

rickyyx commented Jun 21, 2023

kevin85421 commented Jun 21, 2023

rickyyx commented Jun 21, 2023

kevin85421 commented Jun 21, 2023

msumitjain commented Jun 22, 2023

msumitjain left a comment

anshulomar left a comment

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173

Conversation

kevin85421 commented Jun 18, 2023 • edited

Why are these changes needed?

Trace code

Root cause of #1125

Solution

Something needs to be verified.

Related issue number

Checks

kevin85421 commented Jun 20, 2023

kevin85421 commented Jun 20, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Jun 21, 2023

rickyyx commented Jun 21, 2023

kevin85421 commented Jun 21, 2023

rickyyx commented Jun 21, 2023

kevin85421 commented Jun 21, 2023

msumitjain commented Jun 22, 2023

msumitjain left a comment

Choose a reason for hiding this comment

anshulomar left a comment

Choose a reason for hiding this comment

kevin85421 commented Jun 18, 2023 •

edited