Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add a utility to check GCS / Ray cluster health #23382

Merged
merged 10 commits into from
Apr 18, 2022

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Mar 21, 2022

Why are these changes needed?

  • Provide a utility to test connecting to a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
  • Improve the error message when the address provided to Ray does not contain port.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mwtian mwtian marked this pull request as ready for review March 22, 2022 04:09
@mwtian mwtian changed the title [WIP] Ping Ray cluster [Core] Add a utility to ping Ray cluster Mar 22, 2022
@rkooo567
Copy link
Contributor

Just curious, but how would this be different from

request = gcs_service_pb2.CheckAliveRequest()
?

@mwtian
Copy link
Member Author

mwtian commented Mar 23, 2022

They look very similar. The only difference is that ping_cluster() has the additional check for Python and Ray version compatibility. Should we update dashboard to use this utility or use CheckAliveRequest() here instead?

@rkooo567
Copy link
Contributor

They look very similar. The only difference is that ping_cluster() has the additional check for Python and Ray version compatibility. Should we update dashboard to use this utility or use CheckAliveRequest() here instead?

I don't have a strong opinion what to use, but I think we should unify to one (seems like both are for health checking the cluster).

cc @wuisawesome do you see any downsides of just using this util?

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please please add this functionality to ray healthcheck (which uses CheckAliveRequest). We shouldn't let this type of functionality deviate between components

@mwtian
Copy link
Member Author

mwtian commented Mar 23, 2022

How about using CheckAliveRequest() to check connectivity, and use cluster metadata to check Python & Ray versions? I'm not sure we want to add unnecessary data to the health check response.

@mwtian
Copy link
Member Author

mwtian commented Mar 24, 2022

I thought about the use case and only verifying Ray versions match should be sufficient.

@wuisawesome
Copy link
Contributor

Ok can we change the api name to make it clear that this is not a healthcheck or way to ping the cluster? some form of cluster metadata that we can expose through the dashboard api seems reasonable.

@mwtian
Copy link
Member Author

mwtian commented Mar 24, 2022

@wuisawesome which API are you referring to, ping_cluster() or CheckAlive?

@rkooo567
Copy link
Contributor

rkooo567 commented Apr 4, 2022

I will leave it up to @wuisawesome also cc @jon-chuang

@wuisawesome
Copy link
Contributor

@wuisawesome which API are you referring to, ping_cluster() or CheckAlive?

CheckAlive is a grpc health check, and powers ray health-check. (I don't think the grpc endpoint is stable across versions since only ray health-check is).

If you're trying to add a cluster metadata api and not a health check, I'm saying you should name it GetClusterMetadata. Calling it ping is extremely confusing since well known libraries like Redis use ping as a health check.

Btw it's also still not clear to me why you don't just add this in the health check. The main reason to separate them would be if we have some big metadata/expensive query that we didn't want to run in a health check (which we don't).

@mwtian
Copy link
Member Author

mwtian commented Apr 14, 2022

Btw it's also still not clear to me why you don't just add this in the health check. The main reason to separate them would be if we have some big metadata/expensive query that we didn't want to run in a health check (which we don't).

Actually I'm using the CheckAlive rpc right now. Renaming the function to health check.

@mwtian mwtian changed the title [Core] Add a utility to ping Ray cluster [Core] Add a utility to check GCS / Ray cluster health Apr 14, 2022
@mwtian
Copy link
Member Author

mwtian commented Apr 14, 2022

The plan is to add a public health check API in a subsequent PR.

python/ray/_private/gcs_utils.py Outdated Show resolved Hide resolved
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
@mwtian mwtian added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 15, 2022
@jjyao jjyao merged commit d5d2ef4 into ray-project:master Apr 18, 2022
@mwtian mwtian deleted the ping branch April 18, 2022 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants