Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][2/n] Redefine ready for RayCluster to avoid using HTTP requests to check dashboard status #1733

Merged
merged 3 commits into from
Dec 12, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Dec 11, 2023

Why are these changes needed?

#1674 introduces default readiness and liveness probes for Ray Pods, regardless of whether GCS fault tolerance is enabled. For the head Pod, the readiness probe verifies the status of GCS and Raylet. The GCS check utilizes the API endpoint from the Ray dashboard at DASHBOARD_ADDRESS:8265/api/gcs_healthz.

I checked with @architkulkarni and @shrekris-anyscale. I can imply that the Ray dashboard is ready for “Ray Job Submission” and Ray Serve RESTful API if DASHBOARD_ADDRESS:8265/api/gcs_healthz returns “success”. Hence, a "ready" RayCluster implies that all Ray Pods are running and ready, including Ray head.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title WIP [RayJob][Status][2/n] Redefine ready for RayCluster to avoid using HTTP requests to check dashboard status Dec 11, 2023
@kevin85421 kevin85421 marked this pull request as ready for review December 11, 2023 23:07
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Question: In the state "Wait for the RayCluster.Status.State to be ready", what is the JobDeploymentStatus?

@kevin85421
Copy link
Member Author

Looks good to me!

Question: In the state "Wait for the RayCluster.Status.State to be ready", what is the JobDeploymentStatus?

It should be "Initializing".

  • JobDeploymentStatus is set to Initializing after JobID and RayClusterName are set to avoid double creations.
  • Then, KubeRay tries to create RayCluster and submit ray job only when RayCluster is "ready".

@kevin85421
Copy link
Member Author

  • Run the following command twice. Pass all tests (4 tests for each run).
    RAY_IMAGE=rayproject/ray:2.8.0 OPERATOR_IMAGE=controller:latest python3 tests/test_sample_rayjob_yamls.py 2>&1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants