Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] The New GcsClient binding #46186

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Jun 21, 2024

Creates a direct Cython binding for ray::gcs::GcsClient and replaces the existing PythonGcsClient binding. The new binding is enabled by default; one can switch back with RAY_USE_OLD_GCS_CLIENT=1.

The new binding is in its own file gcs_client.pxi included by _raylet.pyx.

Changes:

  • Adds a Cython binding for GcsClient, and use it as default to replace PythonGcsClient.
  • GcsClient: move cluster_id from arg to a GcsClientOptions field.
  • GcsClient: adds timeout_ms arg for NodeInfoAccessor::AsyncGetAll and JobInfoAccessor::AsyncGetAll.
  • GcsClient: adds JobInfoAccessor::GetAll and NodeInfoAccessor::DrainNodes and NodeResourceInfoAccessor::GetAllResourceUsage.
  • GcsClient: adds a non-cached version NodeInfoAccessor::GetAllNoCache.
  • TaskEventBufferImpl: change gcs_client_ connection to after thread start.
  • Moves check_status from _raylet.pyx to common.pxi for reuse
  • GcsClient now only retry grpc UNAVAILABLE and not other codes, since we only want to retry on GCS down and not other cases (e.g. RESOURCE_EXHAUSTED). See python/ray/tune/tests/test_tune_restore.py::ResourceExhaustedTest::test_resource_exhausted_info
  • Fixes test fixtures: env vars set after loaded modules. Added module reload
  • Better test message

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
rynewang and others added 6 commits June 24, 2024 11:36
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang self-assigned this Jun 25, 2024
@rynewang rynewang added the go add ONLY when ready to merge, run all tests label Jun 25, 2024
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang marked this pull request as ready for review June 27, 2024 17:54
@rynewang rynewang changed the title New gcs client sync [core] The New GcsClient binding Jun 27, 2024
@@ -661,6 +662,8 @@ def test_get_applications_while_gcs_down(
):
# Test serve REST API availability when the GCS is down.
monkeypatch.setenv("RAY_SERVE_KV_TIMEOUT_S", "3")
importlib.reload(ray.serve._private.constants) # to reload the constants set above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, Ray Serve uses infinite timeout for internal kv put/get. To test when GCS is down, it sets the timeout to 3s. However this setting never worked because the env did not load.

Previous PythonGcsClient would return error of GrpcUnavailable on GCS down even if timeout is inf. The new GcsClient properly infinitely retries and hangs. To make the env work, we need to reload it.


cdef class GcsClientOptions:
"""Cython wrapper class of C++ `ray::gcs::GcsClientOptions`."""
cdef:
unique_ptr[CGcsClientOptions] inner

@classmethod
def from_gcs_address(cls, gcs_address):
def from_gcs_address(cls, gcs_address, cluster_id_hex=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method name is no longer accurate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also moving cluster_id into GcsClientOptions can be its own PR?

@@ -104,7 +104,7 @@ def ping(self):

gcs_client = GcsClient(address=ray.get_runtime_context().gcs_address)

with pytest.raises(ray.exceptions.RpcError):
with pytest.raises(ray.exceptions.RaySystemError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any breaking changes. Ideally this PR shouldn't touch serve code or any tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix forward. drain_node returns an error from GCS side which should not be considered a RpcError (indicates network issue). Ray serve changes fix bad test fixtures.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants