Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173

Merged
merged 3 commits into from
Jun 21, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jun 18, 2023

Why are these changes needed?

Trace code

500 Internal Server Error Traceback (most recent call last):
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/optional_utils.py", line 279, in decorator
 raise e from None
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/optional_utils.py", line 271, in decorator
 ray.init(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
 return func(*args, **kwargs)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 1509, in init
 _global_node = ray._private.node.Node(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/node.py", line 244, in __init__
 node_info = ray._private.services.get_node_to_connect_for_driver(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/services.py", line 444, in get_node_to_connect_for_driver
 return global_state.get_node_to_connect_for_driver(node_ip_address)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/state.py", line 752, in get_node_to_connect_for_driver
 node_info_str = self.global_state_accessor.get_node_to_connect_for_driver(
 File "python/ray/includes/global_state_accessor.pxi", line 156, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
RuntimeError: b"This node has an IP address of 100.64.101.199, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at rayservice-raycluster-4tcq6-head-svc.ns-team-colligo-rayservice.svc.cluster.local and found raylets at 100.64.55.99 but none of these match this node's IP 100.64.101.199. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."

#1125 provides the aforementioned error message. This user is using Ray 2.3.0, so I traced the code of Ray 2.3.0 instead of 2.5.0.

  • Step 1: KubeRay uses an HTTP client to send a GET $DASHBOARD_AGENT_SERVICE:52365/api/serve/deployments/status request to retrieve Serve status.
  • Step 2: The request in Step 1 will be received and processed by the get_all_deployment_statuses function in the Serve agent. You can find the code for this function here.
  • Step 3: The function get_all_deployment_statuses uses @optional_utils.init_ray_and_catch_exceptions() as a decorator. This decorator checks whether Ray is initialized or not. If Ray is not initialized, the decorator will call ray.init().
  • Step 4: The function ray.init() in worker.py invokes the following function to initialize a Ray node. In addition, we can determine that this node is a worker node based on the head=False parameter.
             _global_node = ray._private.node.Node(
                 ray_params,
                 head=False,
                 shutdown_at_exit=False,
                 spawn_reaper=False,
                 connect_only=True,
             )
  • Step 5: The constructor of class Node (node.py) calls the function ray._private.services.get_node_to_connect_for_driver to retrieve address information from GCS.
  • Step 6: services.py global_state.get_node_to_connect_for_driver(node_ip_address).
  • Step 7: state.py get_node_to_connect_for_driver(self, node_ip_address)
  • Step 8: global_state_accessor.pxi (Cython wrapper) get_node_to_connect_for_driver
  • Step 9: global_state_accessor.cc GetNodeToConnectForDriver => The error message is reported by this function
     RuntimeError: b"This node has an IP address of 100.64.101.199, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at rayservice-raycluster-4tcq6-head-svc.ns-team-colligo-rayservice.svc.cluster.local and found raylets at 100.64.55.99 but none of these match this node's IP 100.64.101.199. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
    

Root cause of #1125

To summarize, the root cause is that KubeRay sends a request to the dashboard agent process on a worker node, but GCS does not have the address information for this Raylet. This means that the dashboard agent process starts serving requests before the node registration process with GCS finishes. See the "Node management" section in the Ray architecture whitepaper for more details.

KubeRay will start sending requests to the dashboard agent processes once the head Pod is running and ready. In other words, KubeRay does not check the status of workers, so it is possible for the dashboard agent processes on workers to receive requests at any moment. See #1074 for more details.

Solution

Without this PR, the Kubernetes service for the dashboard agent uses a round-robin algorithm to evenly distribute traffic among the available Pods, including the head Pod and worker Pods. You can refer to this link for more details. Hence, this PR decides to only send requests to the head Pod by changing from the dashboard agent service to the head service.

Something needs to be verified.

  1. Is it possible for the dashboard agent process starts serving requests before the node registration process with GCS finishes?
  2. Does it make sense to restrict the sending of only GET and PUT requests to the head Pod? In my understanding, requests will typically be forwarded to the ServeController actor, which is expected to be running on the head Pod.

Related issue number

Closes #1125

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

The following experiment is used to verify the statement "KubeRay will start sending requests to the dashboard agent processes once the head Pod is running and ready. In other words, KubeRay does not check the status of workers, so it is possible for the dashboard agent processes on workers to receive requests at any moment.".

# Step 1: Create a Kind cluster
kind create cluster --image=kindest/node:v1.23.0

# Step 2: Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Step 3: Load the Ray image into the Kind cluster
docker pull rayproject/ray:2.4.0
kind load docker-image rayproject/ray:2.4.0

# Step 4: Install gist https://gist.github.com/kevin85421/d64e730acc2e1e3b5223610e450e4478
# In this gist, the worker Pod will sleep for 60 seconds before executing the `ray start` command.
kubectl apply -f rayservice.yaml

# Step 5: Run the following commands and the script `loop_curl.sh` immediately after the head Pod is ready
export DASHBOARD_AGENT_SVC=$(kubectl get svc -l=ray.io/cluster-dashboard  -o custom-columns=SERVICE:metadata.name --no-headers)
kubectl port-forward svc/$DASHBOARD_AGENT_SVC 52365
./loop_curl.sh 2>&1 | tee log

"""loop_curl.sh
#!/bin/bash

kubectl run curl --image=radial/busyboxplus:curl --command -- /bin/sh -c "while true; do sleep 10;done"
export DASHBOARD_AGENT_SVC=$(kubectl get svc -l=ray.io/cluster-dashboard  -o custom-columns=SERVICE:metadata.name --no-headers)
ITERATIONS=120

# Run the loop
for ((i=1; i<=$ITERATIONS; i++)); do
    echo -e "Iteration: $i\n"
    kubectl exec -it curl -- curl -X GET $DASHBOARD_AGENT_SVC:52365/api/serve/deployments/status
    echo -e "\n"
    sleep 1
done
"""

# You will observe some requests fail because they are forwarded to the worker Pod before it is ready.

# Step 6: Update the loop_curl script to send the request to the head service. No request should fail after the head Pod is ready and running.

@kevin85421 kevin85421 marked this pull request as ready for review June 20, 2023 00:40
@kevin85421
Copy link
Member Author

cc @anshulomar @msumitjain would you mind reviewing this PR? Thanks!

@kevin85421
Copy link
Member Author

Hi @rickyyx,

  1. Is it possible for the dashboard agent process on a worker starts serving requests before the node registration process with GCS finishes?
  2. Is it possible for the dashboard agent process on a head starts serving requests before the node registration process with GCS finishes?

Thanks!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed log of the debugging process.

The code looks good to me. I'll defer to @sihanwang41 / @zcin to confirm whether the overall approach makes sense (send requests to head node, not agent)

@kevin85421
Copy link
Member Author

cc @sihanwang41 would you mind reviewing this PR? Thanks!

@kevin85421 kevin85421 merged commit c22fbfa into ray-project:master Jun 21, 2023
20 checks passed
@rickyyx
Copy link
Contributor

rickyyx commented Jun 21, 2023

  • Is it possible for the dashboard agent process on a worker starts serving requests before the node registration process with GCS finishes?
  • Is it possible for the dashboard agent process on a head starts serving requests before the node registration process with GCS finishes?

I think that might not happen since the agent should only be discoverable after node registration is done. (But I could be wrong). What's the implication of this variance? I prob need to take a look at the code to be 100% sure.

@kevin85421
Copy link
Member Author

I think that might not happen since the agent should only be discoverable after node registration is done. (But I could be wrong).

Hi @rickyyx, thank you for the reply! Would you mind explaining more about "discoverable"?

Context:

  • Users get this error when KubeRay sends a GET /api/serve/deployments/status request to a worker's dashboard agent.
  • Without this PR, it is highly possible for KubeRay to send the GET requests to worker Pods before they are ready and running. That's why I guess that the issue is related to sending requests before node registration is done.
  • [Bug] Raylet missing on worker group #1063 has the same error message, and the root cause is that Ray script is executed before the ray start command (i.e. node registration ?).

@rickyyx
Copy link
Contributor

rickyyx commented Jun 21, 2023

How does kuberay know of the worker node's agent address? Is this something derivable outside of ray? (e.g. from node's ip address, and a port being passed in?)

If so, yes, I think it's possible for the request to arrive on the worker node before node registration is done (which will lead to the original error stack, where the GCS is not able to find the node)

@kevin85421
Copy link
Member Author

How does kuberay know of the worker node's agent address? Is this something derivable outside of ray? (e.g. from node's ip address, and a port being passed in?)

KubeRay uses a Kubernetes service for the dashboard agents using a round-robin algorithm to evenly distribute traffic among the available Pods, including the head Pod and worker Pods.

If so, yes, I think it's possible for the request to arrive on the worker node before node registration is done (which will lead to the original error stack, where the GCS is not able to find the node)

Thank you for the confirmation!

@msumitjain
Copy link
Contributor

Thank you for picking this much needed fix. Apologies for being late in the review party.
The changes look good to me.

Copy link
Contributor

@msumitjain msumitjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@anshulomar anshulomar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevin85421 for making this change! Looks good to me.

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…0 Internal Server Error (ray-project#1173)

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error
7 participants