Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][observability] Report idle node information in status and dashboard #39638

Merged
merged 13 commits into from
Sep 25, 2023

Conversation

vitsai
Copy link
Contributor

@vitsai vitsai commented Sep 13, 2023

Plumbs through idle node information to be reflected in ray status both in the CLI and on the dashboard. Does not include additional changes in the cluster tab of the dashboard UI, but does plumb the status field through to the datasource for dashboard consumption.

  • List of idle nodes
  • In verbose mode, will print node activity (reasons node is not idle) for each node
======== Autoscaler status: 2023-09-22 23:08:42.399287 ========
GCS request time: 0.000781s

Node status
---------------------------------------------------------------
Active:
 1 node_328da3b1e9273cf946f6ac3dfee9404dacf429566cd0809a7f04f01c
Idle:
 (no idle nodes)
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 1.0/36.0 CPU
 0B/36.04GiB memory
 0B/18.02GiB object_store_memory

Total Demands:
 (no resource demands)

Node: 328da3b1e9273cf946f6ac3dfee9404dacf429566cd0809a7f04f01c
 Usage:
  1.0/36.0 CPU
  0B/36.04GiB memory
  0B/18.02GiB object_store_memory
 Activity:
  Resource: CPU currently in use.
  Busy workers on node.

Why are these changes needed?

Related issue number

#35411

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good to me! some nits and comments.

So i guess the actual printing or showing on the dashboard part will be in another PR?

dashboard/datacenter.py Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved
src/ray/protobuf/autoscaler.proto Outdated Show resolved Hide resolved
src/ray/protobuf/gcs.proto Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/local_resource_manager.cc Outdated Show resolved Hide resolved
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me!

The only thing I wanna double check is just the Snapshot status setting logic (if the idle one will get overridden), and if we could have a test that tests the new node activity thing e2e? I think so far it's either being mocked. If e2e is non trivial, then some manual tests with ray status -v is also fine, and some unit testing.

And a couple of nits, including

  • Write in the PR description for what's being changed here in terms of output?

dashboard/datacenter.py Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved
python/ray/autoscaler/v2/tests/test_utils.py Show resolved Hide resolved
python/ray/core/generated/autoscaler_pb2.py Outdated Show resolved Hide resolved
python/ray/tests/test_resource_demand_scheduler.py Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/accessor.cc Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/local_resource_manager.cc Outdated Show resolved Hide resolved
@rickyyx
Copy link
Contributor

rickyyx commented Sep 22, 2023

Also - i guess the PR doesn't update the dashboard view yet? Or it's automatically handled with changes in the PR?

i.e. the active status:

image

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the PR description?

Ideally also show some example outputs of status and dashboard after your change.

python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Show resolved Hide resolved
@vitsai
Copy link
Contributor Author

vitsai commented Sep 22, 2023

Screenshot 2023-09-22 at 10 19 52 AM

Right now, the idle state is reflected in dashboard here. I wasn't sure about adding it to that part of the cluster tab because it displays information at a per-worker granularity, whereas we have idle information at a per-node granularity.

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idle time thing being done outside of this PR? If so, let's document it in a TODO.

dashboard/datacenter.py Show resolved Hide resolved
python/ray/autoscaler/v2/tests/test_utils.py Outdated Show resolved Hide resolved
src/ray/protobuf/autoscaler.proto Show resolved Hide resolved
for (const auto &iter : last_idle_times_) {
if (iter.second == absl::nullopt) {
// If it is a WorkFootprint
if (iter.first.index() == 0) {
switch (std::get<WorkFootprint>(iter.first)) {
case WorkFootprint::NODE_WORKERS:
node_activity << " Node currently has leased workers." << std::endl;
resources_data.add_node_activity("Busy workers on node.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resources_data.add_node_activity("Busy workers on node.");
resources_data.add_node_activity("Active workers.");

?

@rickyyx
Copy link
Contributor

rickyyx commented Sep 23, 2023

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


via GIPHY

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@rickyyx rickyyx merged commit a2dedf1 into ray-project:master Sep 25, 2023
103 of 107 checks passed
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Sep 26, 2023
…board (ray-project#39638)


---------

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
rkooo567 pushed a commit to rkooo567/ray that referenced this pull request Sep 28, 2023
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…board (ray-project#39638)

---------

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants