Skip to content

[core][autoscaler] Add Pod names to the output of ray status -v#51192

Merged
edoakes merged 3 commits into
ray-project:masterfrom
kevin85421:20250308-devbox1-tmux-6-ray2
Mar 10, 2025
Merged

[core][autoscaler] Add Pod names to the output of ray status -v#51192
edoakes merged 3 commits into
ray-project:masterfrom
kevin85421:20250308-devbox1-tmux-6-ray2

Conversation

@kevin85421
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 commented Mar 9, 2025

Why are these changes needed?

  1. Currently, the output of ray status -v only includes information on node types (i.e., group names in KubeRay) and Ray node IDs. However, it is not easy to map a Ray node ID to the name of the corresponding Ray Pod (i.e. instance id in Autoscaler).
Screenshot 2025-03-08 at 11 50 43 PM
  1. Refactor

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title [core][autoscaler v2] Add instance_id (Pod name) to the output of ray status -v [core][autoscaler v2] Add Pod names to the output of ray status -v Mar 9, 2025
@kevin85421 kevin85421 changed the title [core][autoscaler v2] Add Pod names to the output of ray status -v [core][autoscaler] Add Pod names to the output of ray status -v Mar 9, 2025
usage_by_node = {}
node_type_mapping = {}
idle_time_map = {}
def _node_usage_report(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ryanaoleary I refactored this function a bit:

  1. Avoid passing the whole ClusterStatus to the function to make it more unit testable.
  2. It's not necessary to pass verbose into this function.
  3. Rename dictionaries to ....to.....

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421
Copy link
Copy Markdown
Member Author

cc @ryanaoleary @rueian for review

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 9, 2025
@kevin85421 kevin85421 marked this pull request as ready for review March 9, 2025 17:42
@kevin85421 kevin85421 requested a review from a team as a code owner March 9, 2025 17:42
Comment thread python/ray/autoscaler/v2/utils.py Outdated
Comment thread python/ray/autoscaler/v2/utils.py
Comment thread python/ray/autoscaler/v2/utils.py Outdated
{'GPU': 2, 'CPU': 100}: 2+ from request_resources()

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)
Node: instance1 (head_node)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that instance1 here is the name of the node that the autoscaler provides (in the case of kuberay, it'd be the pod name)

Am I correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "instance id". In KubeRay, "instance id" is Pod name. You can see the screenshot I added in the PR description for more details.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct?

For KubeRay, you are correct.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(outside of the scope of this PR and a bit esoteric)

I think the terminology I used in the top-level comment is more general. That is, it's an abstraction leak to call it "instance ID" within the autoscaler because it is not an "instance" in all cases (e.g., it's a pod in Kubernetes). So using a more generic term such as "node name" would be preferable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the concrete example of the line being changed here.

With the terminology you used, it's: Node: <instance_id (but sometimes pod name)>

With my suggestion, it's: Node: <node name>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an abstraction leak to call it "instance ID" within the autoscaler because it is not an "instance" in all cases

What's your definition of 'instance' here? Are you referring to a VM? In Autoscaler, an 'instance' is defined as the Ray node runner created by node providers.

https://docs.google.com/document/d/1NzQjA8Mh-oMc-QxXOa529oneWCoA8sDiVoNkBqqDb4U/edit?tab=t.0

image

Do you suggest to also allow node providers to set "instance name" which is possible to be different from "instance id" and ray status -v shows the "instance name" instead of "instance id"?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically just suggesting that choosing the name "instance" in this chart was misguided :)

Thanks for posting that link it is quite helpful actually

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@edoakes edoakes enabled auto-merge (squash) March 10, 2025 16:29
@edoakes edoakes merged commit 902b55a into ray-project:master Mar 10, 2025
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…y-project#51192)

1. Currently, the output of `ray status -v` only includes information on
node types (i.e., group names in KubeRay) and Ray node IDs. However, it
is not easy to map a Ray node ID to the name of the corresponding Ray
Pod (i.e. instance id in Autoscaler).

<img width="496" alt="Screenshot 2025-03-08 at 11 50 43 PM"
src="https://github.com/user-attachments/assets/89c66096-88c2-47fb-80d6-08067c7b9d90"
/>

2. Refactor

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…y-project#51192)

1. Currently, the output of `ray status -v` only includes information on
node types (i.e., group names in KubeRay) and Ray node IDs. However, it
is not easy to map a Ray node ID to the name of the corresponding Ray
Pod (i.e. instance id in Autoscaler).

<img width="496" alt="Screenshot 2025-03-08 at 11 50 43 PM"
src="https://github.com/user-attachments/assets/89c66096-88c2-47fb-80d6-08067c7b9d90"
/>

2. Refactor

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants