Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[State Observability] Basic functionality for centralized data #23744

Merged
merged 8 commits into from
Apr 14, 2022

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Apr 6, 2022

Why are these changes needed?

Support listing actor/pg/job/node/workers

Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2

Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -319,6 +320,11 @@ def process_error(error_data):
except Exception:
logger.exception("Error receiving error info from GCS.")

@routes.get("/api/v0/nodes")
async def get_nodes(self, req) -> aiohttp.web.Response:
nodes = await self._dashboard_head.gcs_state_aggregator.get_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's returned for node? Are we returning the node id as hex string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it returns hex string. I will add tests

@rkooo567
Copy link
Contributor Author

rkooo567 commented Apr 8, 2022

This is ready for the initial review. I will add

  • tests
  • Limit (it is passed to the server, but it is not applied now).

The biggest question is how to avoid duplicated implementation + is the current way of generating schema the right approach (I assumed we cannot use Pydantic & need to stick to aiohttp implementation for now)?

def __init__(self, dashboard_head):
super().__init__(dashboard_head)

@routes.get("/api/v0/placement_groups")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I was thinking to create a state_head.py file and add all routes in there, but it seems like the current codebase makes it difficult to do so. It is because it assumes each of module cannot communicate each other directly through the interface, and only data is shared. This means if we want to use some of data from the job module, it is not possible within the state module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least combine all the routes that use gcs_state_aggregator into a single head.py?

Copy link
Contributor Author

@rkooo567 rkooo567 Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code will be structured in this way;

Routes (per each state)
------------------------
Post Processing (e.g., limit)
------------------------
Aggregator (from agent, GCS, and raylet)

so grouping routes by aggregator seems a bit unnatural to me. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference here: initially I was thinking it's nice to have all public apis in a single place so people can find them easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally on the same page. I prefer to go with this route, but we might need to allow each module to share interface in this case. I will take a look at it in the sooner future, but for now, I will stick to the current style if you don't have strong preference here!

return _list("nodes", ListApiOptions(limit=limit, timeout=timeout), address=address)


def list_jobs(address: str = None, limit: int = 1000, timeout: int = 30):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now address is API server address. Should we accept the GCS address instead? (that's how other APIs are doing it seems like)



# TODO(sang): Replace it with auto-generated methods.
def list_actors(address: str = None, limit: int = 1000, timeout: int = 30):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current style make sense? It will accept options in the API spec as each argument. https://docs.google.com/document/d/1eyvSPnYgXBdEXB2-qm_gKDDfO2eEnHe7R1Mu6y-9FWU/edit#

Copy link
Contributor

@scv119 scv119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:dogscience:

dashboard/state_aggregator.py Show resolved Hide resolved
dashboard/state_aggregator.py Outdated Show resolved Hide resolved
dashboard/state_aggregator.py Outdated Show resolved Hide resolved
dashboard/state_aggregator.py Outdated Show resolved Hide resolved
python/ray/experimental/state/common.py Show resolved Hide resolved
@raulchen
Copy link
Contributor

@rkooo567 could you make the doc public?

def __init__(self, dashboard_head):
super().__init__(dashboard_head)

@routes.get("/api/v0/placement_groups")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least combine all the routes that use gcs_state_aggregator into a single head.py?

dashboard/state_aggregator.py Show resolved Hide resolved
python/ray/experimental/state/api.py Outdated Show resolved Hide resolved
@@ -230,6 +230,11 @@ async def kill_actor(self, req) -> aiohttp.web.Response:

return rest_response(success=True, message=f"Killed actor with id {actor_id}")

@routes.get("/api/v0/actors")
async def get_actors(self, req) -> aiohttp.web.Response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we supporting list options like timeout and limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added the code yet (sorry my bad). limit will be used for post-processing (after aggregation), and timeout will be used to decide the internal RPC timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be handled in a separate PR. If you'd like me to, I can remove the options from this PR.

def list_workers(api_server_url: str = None, limit: int = 1000, timeout: int = 30):
return _list(
"workers",
ListApiOptions(limit=limit, timeout=timeout),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no pagination?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't plan to support pagination in the scope of the project (but we will only limit the output). Pagination could be a follow-up project if it is required.

python/ray/tests/test_state_api.py Outdated Show resolved Hide resolved
python/ray/tests/test_state_api.py Show resolved Hide resolved
src/ray/protobuf/gcs.proto Outdated Show resolved Hide resolved
dashboard/modules/actor/actor_head.py Show resolved Hide resolved
@rkooo567
Copy link
Contributor Author

rkooo567 commented Apr 13, 2022

@raulchen I need some modification to the doc before making it public. Note that it is the project written in this RFC (state observability) #22833

The high-level idea is that we will expose all states of Ray (extended version of global_state), and we are planning to use the API server to do this. I will create a RFC by the end of this week unless other priorities come up.

@rkooo567
Copy link
Contributor Author

@jjyao So, these items will be done in a follow up;

  • implement limit & timeout
  • improving the API output and docstring
  • error handling according to the spec

dashboard/modules/actor/actor_head.py Show resolved Hide resolved
python/ray/experimental/state/common.py Show resolved Hide resolved


# TODO(sang): Replace it with auto-generated methods.
def _list(resource_name: str, options: ListApiOptions, api_server_url: str = None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes should I implement this using SubmissionClient?

def __init__(self, dashboard_head):
super().__init__(dashboard_head)

@routes.get("/api/v0/placement_groups")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally on the same page. I prefer to go with this route, but we might need to allow each module to share interface in this case. I will take a look at it in the sooner future, but for now, I will stick to the current style if you don't have strong preference here!

@rkooo567
Copy link
Contributor Author

All comments are addressed

return r.json()["data"]["result"]


def list_actors(api_server_url: str = None, limit: int = 1000, timeout: int = 30):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any guideline here, but I feel limit and timeout are better to be kw args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this already a kw args? Or are you saying we should use list_actors(**kwargs)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like adding a * to force arguments after it being kwargs

@rkooo567 rkooo567 merged commit 51a4a1a into ray-project:master Apr 14, 2022
amogkam added a commit that referenced this pull request Apr 14, 2022
amogkam added a commit that referenced this pull request Apr 14, 2022
#23744)" (#23918)

This reverts commit 51a4a1a.

breaking tune multinode tests and kuberay:test_autoscaling_e2e
rkooo567 added a commit to rkooo567/ray that referenced this pull request Apr 15, 2022
rkooo567 added a commit that referenced this pull request Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants