[State Observability] Basic functionality for centralized data #23744

rkooo567 · 2022-04-06T15:59:39Z

Why are these changes needed?

Support listing actor/pg/job/node/workers

Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2

Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

src/ray/protobuf/gcs.proto

jjyao · 2022-04-06T23:17:07Z

dashboard/modules/node/node_head.py

@@ -319,6 +320,11 @@ def process_error(error_data):
            except Exception:
                logger.exception("Error receiving error info from GCS.")

+    @routes.get("/api/v0/nodes")
+    async def get_nodes(self, req) -> aiohttp.web.Response:
+        nodes = await self._dashboard_head.gcs_state_aggregator.get_nodes()


What's returned for node? Are we returning the node id as hex string?

I believe it returns hex string. I will add tests

rkooo567 · 2022-04-08T15:03:47Z

This is ready for the initial review. I will add

tests
Limit (it is passed to the server, but it is not applied now).

The biggest question is how to avoid duplicated implementation + is the current way of generating schema the right approach (I assumed we cannot use Pydantic & need to stick to aiohttp implementation for now)?

rkooo567 · 2022-04-08T15:05:15Z

dashboard/modules/placement_group/placement_group_head.py

+    def __init__(self, dashboard_head):
+        super().__init__(dashboard_head)
+
+    @routes.get("/api/v0/placement_groups")


Originally, I was thinking to create a state_head.py file and add all routes in there, but it seems like the current codebase makes it difficult to do so. It is because it assumes each of module cannot communicate each other directly through the interface, and only data is shared. This means if we want to use some of data from the job module, it is not possible within the state module.

Should we at least combine all the routes that use gcs_state_aggregator into a single head.py?

I think the code will be structured in this way;

Routes (per each state)
------------------------
Post Processing (e.g., limit)
------------------------
Aggregator (from agent, GCS, and raylet)

so grouping routes by aggregator seems a bit unnatural to me. Wdyt?

No strong preference here: initially I was thinking it's nice to have all public apis in a single place so people can find them easily.

Yeah totally on the same page. I prefer to go with this route, but we might need to allow each module to share interface in this case. I will take a look at it in the sooner future, but for now, I will stick to the current style if you don't have strong preference here!

rkooo567 · 2022-04-08T15:07:16Z

python/ray/experimental/state/api.py

+    return _list("nodes", ListApiOptions(limit=limit, timeout=timeout), address=address)
+
+
+def list_jobs(address: str = None, limit: int = 1000, timeout: int = 30):


Right now address is API server address. Should we accept the GCS address instead? (that's how other APIs are doing it seems like)

rkooo567 · 2022-04-08T15:32:13Z

python/ray/experimental/state/api.py

+
+
+# TODO(sang): Replace it with auto-generated methods.
+def list_actors(address: str = None, limit: int = 1000, timeout: int = 30):


Does the current style make sense? It will accept options in the API spec as each argument. https://docs.google.com/document/d/1eyvSPnYgXBdEXB2-qm_gKDDfO2eEnHe7R1Mu6y-9FWU/edit#

scv119

:dogscience:

dashboard/modules/placement_group/placement_group_head.py

dashboard/state_aggregator.py

python/ray/experimental/state/common.py

raulchen · 2022-04-13T02:47:58Z

@rkooo567 could you make the doc public?

jjyao · 2022-04-13T05:52:38Z

dashboard/modules/placement_group/placement_group_head.py

+    def __init__(self, dashboard_head):
+        super().__init__(dashboard_head)
+
+    @routes.get("/api/v0/placement_groups")


Should we at least combine all the routes that use gcs_state_aggregator into a single head.py?

dashboard/state_aggregator.py

python/ray/experimental/state/api.py

jjyao · 2022-04-13T06:02:06Z

dashboard/modules/actor/actor_head.py

@@ -230,6 +230,11 @@ async def kill_actor(self, req) -> aiohttp.web.Response:

        return rest_response(success=True, message=f"Killed actor with id {actor_id}")

+    @routes.get("/api/v0/actors")
+    async def get_actors(self, req) -> aiohttp.web.Response:


How are we supporting list options like timeout and limit?

I haven't added the code yet (sorry my bad). limit will be used for post-processing (after aggregation), and timeout will be used to decide the internal RPC timeout.

It will be handled in a separate PR. If you'd like me to, I can remove the options from this PR.

jjyao · 2022-04-13T06:02:52Z

python/ray/experimental/state/api.py

+def list_workers(api_server_url: str = None, limit: int = 1000, timeout: int = 30):
+    return _list(
+        "workers",
+        ListApiOptions(limit=limit, timeout=timeout),


no pagination?

We don't plan to support pagination in the scope of the project (but we will only limit the output). Pagination could be a follow-up project if it is required.

python/ray/tests/test_state_api.py

src/ray/protobuf/gcs.proto

dashboard/modules/actor/actor_head.py

rkooo567 · 2022-04-13T10:03:13Z

@raulchen I need some modification to the doc before making it public. Note that it is the project written in this RFC (state observability) #22833

The high-level idea is that we will expose all states of Ray (extended version of global_state), and we are planning to use the API server to do this. I will create a RFC by the end of this week unless other priorities come up.

rkooo567 · 2022-04-13T14:41:56Z

@jjyao So, these items will be done in a follow up;

implement limit & timeout
improving the API output and docstring
error handling according to the spec

dashboard/modules/actor/actor_head.py

python/ray/experimental/state/common.py

rkooo567 · 2022-04-13T14:58:10Z

python/ray/experimental/state/api.py

+
+
+# TODO(sang): Replace it with auto-generated methods.
+def _list(resource_name: str, options: ListApiOptions, api_server_url: str = None):


@edoakes should I implement this using SubmissionClient?

rkooo567 · 2022-04-13T14:59:37Z

dashboard/modules/placement_group/placement_group_head.py

+    def __init__(self, dashboard_head):
+        super().__init__(dashboard_head)
+
+    @routes.get("/api/v0/placement_groups")


Yeah totally on the same page. I prefer to go with this route, but we might need to allow each module to share interface in this case. I will take a look at it in the sooner future, but for now, I will stick to the current style if you don't have strong preference here!

rkooo567 · 2022-04-13T15:50:01Z

All comments are addressed

jjyao · 2022-04-13T18:21:46Z

python/ray/experimental/state/api.py

+    return r.json()["data"]["result"]
+
+
+def list_actors(api_server_url: str = None, limit: int = 1000, timeout: int = 30):


Do we have any guideline here, but I feel limit and timeout are better to be kw args?

isn't this already a kw args? Or are you saying we should use list_actors(**kwargs)?

Like adding a * to force arguments after it being kwargs

#23744)" This reverts commit 51a4a1a.

#23744)" (#23918) This reverts commit 51a4a1a. breaking tune multinode tests and kuberay:test_autoscaling_e2e

…ized data (ray-project#23744)" (ray-project#23918)" This reverts commit fb14e82.

#23933) …ized data (#23744)" (#23918)" This reverts commit fb14e82.

The first iteration

0e5b042

rkooo567 requested review from wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen, fishbone, scv119 and mwtian as code owners April 6, 2022 15:59

wuisawesome reviewed Apr 6, 2022

View reviewed changes

src/ray/protobuf/gcs.proto Outdated Show resolved Hide resolved

jjyao reviewed Apr 6, 2022

View reviewed changes

Merge branch 'master' into basic-state-apis

6e5e674

rkooo567 assigned alanwguo, scv119 and edoakes Apr 8, 2022

rkooo567 commented Apr 8, 2022

View reviewed changes

Merge branch 'master' into basic-state-apis

f82084a

rkooo567 commented Apr 8, 2022

View reviewed changes

rkooo567 added 2 commits April 8, 2022 08:34

Improve some parts

1ab615d

Merge branch 'master' into basic-state-apis

7d53772

scv119 reviewed Apr 12, 2022

View reviewed changes

rkooo567 assigned jjyao Apr 12, 2022

jjyao reviewed Apr 13, 2022

View reviewed changes

rkooo567 commented Apr 13, 2022

View reviewed changes

Addressed code review.

07e0f7f

jjyao approved these changes Apr 13, 2022

View reviewed changes

rkooo567 added 2 commits April 14, 2022 00:27

Merge branch 'master' into basic-state-apis

9c08743

lint

9f3b170

rkooo567 merged commit 51a4a1a into ray-project:master Apr 14, 2022

amogkam added a commit that referenced this pull request Apr 14, 2022

Revert "[State Observability] Basic functionality for centralized data (

434158f

#23744)" This reverts commit 51a4a1a.

amogkam mentioned this pull request Apr 14, 2022

Revert "[State Observability] Basic functionality for centralized data" #23918

Merged

amogkam added a commit that referenced this pull request Apr 14, 2022

Revert "[State Observability] Basic functionality for centralized data (

fb14e82

#23744)" (#23918) This reverts commit 51a4a1a. breaking tune multinode tests and kuberay:test_autoscaling_e2e

rkooo567 added a commit to rkooo567/ray that referenced this pull request Apr 15, 2022

Revert "Revert "[State Observability] Basic functionality for central…

00c54dd

…ized data (ray-project#23744)" (ray-project#23918)" This reverts commit fb14e82.

rkooo567 mentioned this pull request Apr 15, 2022

Revert "Revert "[State Observability] Basic functionality for central… #23933

Merged

6 tasks

rkooo567 added a commit that referenced this pull request Apr 19, 2022

Revert "Revert "[State Observability] Basic functionality for central… (

1c3329f

#23933) …ized data (#23744)" (#23918)" This reverts commit fb14e82.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[State Observability] Basic functionality for centralized data #23744

[State Observability] Basic functionality for centralized data #23744

rkooo567 commented Apr 6, 2022 •

edited

Loading

jjyao Apr 6, 2022

rkooo567 Apr 7, 2022

rkooo567 commented Apr 8, 2022 •

edited

Loading

rkooo567 Apr 8, 2022

jjyao Apr 13, 2022

rkooo567 Apr 13, 2022 •

edited

Loading

jjyao Apr 13, 2022

rkooo567 Apr 13, 2022

rkooo567 Apr 8, 2022

rkooo567 Apr 8, 2022

scv119 left a comment

raulchen commented Apr 13, 2022

jjyao Apr 13, 2022

jjyao Apr 13, 2022

rkooo567 Apr 13, 2022

rkooo567 Apr 13, 2022

jjyao Apr 13, 2022

rkooo567 Apr 13, 2022

rkooo567 commented Apr 13, 2022 •

edited

Loading

rkooo567 commented Apr 13, 2022

rkooo567 Apr 13, 2022

rkooo567 Apr 13, 2022

rkooo567 commented Apr 13, 2022

jjyao Apr 13, 2022

rkooo567 Apr 14, 2022

jjyao Apr 18, 2022

		return _list("nodes", ListApiOptions(limit=limit, timeout=timeout), address=address)


		def list_jobs(address: str = None, limit: int = 1000, timeout: int = 30):



		# TODO(sang): Replace it with auto-generated methods.
		def list_actors(address: str = None, limit: int = 1000, timeout: int = 30):



		# TODO(sang): Replace it with auto-generated methods.
		def _list(resource_name: str, options: ListApiOptions, api_server_url: str = None):

		return r.json()["data"]["result"]


		def list_actors(api_server_url: str = None, limit: int = 1000, timeout: int = 30):

[State Observability] Basic functionality for centralized data #23744

[State Observability] Basic functionality for centralized data #23744

Conversation

rkooo567 commented Apr 6, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Apr 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 left a comment

Choose a reason for hiding this comment

raulchen commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Apr 13, 2022 • edited Loading

rkooo567 commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Apr 6, 2022 •

edited

Loading

rkooo567 commented Apr 8, 2022 •

edited

Loading

rkooo567 Apr 13, 2022 •

edited

Loading

rkooo567 commented Apr 13, 2022 •

edited

Loading