Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. #1420

Merged
merged 2 commits into from
Sep 14, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Sep 14, 2023

Why are these changes needed?

The function checkIfNeedSubmitServeDeployment is used to determine whether to update or create Ray Serve applications. Without this PR, this function returns true in the following two conditions:

(1) When there is no Serve config cache in the KubeRay operator, indicating that either a new RayCluster has just become ready or the KubeRay operator has crashed.

(2) When the Serve config in the cache differs from the RayService CR's ServeConfigV2, indicating an in-place Serve config update.

However, in the case of a head Pod crash and restart without GCS FT, the KubeRay operator will not send a request to create new Ray Serve applications because the Serve config cache remains the same as the RayService CR's ServeConfigV2.

Furthermore, the RayCluster will not be labeled as unhealthy because the function getAndCheckServeStatus does not handle the edge case when the number of Ray Serve applications is 0.

TODO

e2e test

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Step 1: Install a KubeRay operator with an image built by this PR
# Step 2: Create a RayService (path: ray-operator/config/samples)
kubectl apply -f ray_v1alpha1_rayservice.yaml

# Step 3: Forward the dashboard
kubectl port-forward --address 0.0.0.0 svc/rayservice-sample-head-svc 8265

# Step 4: Check http://127.0.0.1:8265/#/serve

# Step 5: Delete the head Pod
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl delete pod $HEAD_POD

# Step 6: Check the dashboard again after the head Pod restarts
  • Step 4 screenshot
    Screen Shot 2023-09-14 at 3 28 54 PM
  • Step 6 screenshot
    Screen Shot 2023-09-14 at 3 29 49 PM

Without this PR, you cannot see any Serve application in Step 6.

  • Example:
    Screen Shot 2023-09-14 at 3 37 25 PM

@kevin85421 kevin85421 changed the title [WIP][Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. Sep 14, 2023
@kevin85421 kevin85421 marked this pull request as ready for review September 14, 2023 22:08
@architkulkarni architkulkarni self-assigned this Sep 14, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. For my understanding, what happens to the old Serve application in this case (when the head pod without GCS FT crashes and restarts)?

@kevin85421
Copy link
Member Author

what happens to the old Serve application in this case (when the head pod without GCS FT crashes and restarts)?

Sync offline: When the head Pod without GCS FT crashes and restarts, the Ray head node will kill all workers because they are "unknown" workers for the head node. In addition, without this PR, the KubeRay operator will not create the Serve applications again.

@kevin85421 kevin85421 merged commit 9b26ba7 into ray-project:master Sep 14, 2023
20 of 21 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ead Pod without GCS FT recovers from a failure. (ray-project#1420)

[Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants