Skip to content

[Core][GPU fraction][3/n] Encapsulate NodeResources::available behind wrapper methods#63184

Open
dancingactor wants to merge 6 commits into
ray-project:masterfrom
dancingactor:wrap_available_method
Open

[Core][GPU fraction][3/n] Encapsulate NodeResources::available behind wrapper methods#63184
dancingactor wants to merge 6 commits into
ray-project:masterfrom
dancingactor:wrap_available_method

Conversation

@dancingactor
Copy link
Copy Markdown
Contributor

@dancingactor dancingactor commented May 7, 2026

Description

Context

  • This PR is a preparatory refactor for the final goal to change available type from NodeResourceSet (scalar) to NodeResourceInstanceSet (per-instance vector) for accurate GPU fragmentation detection([core] Fix GPU fractional scheduling by tracking per-instance availability #62005)
  • To safely and gradually migrate this big change, we are leveraging branch by abstraction achieved via runtime polymorphism. We introduces a base case and lets both the old class and the new class inherit from it, and change the codebase to interact with the base class pointer. Then, we can use a config flag to toggle what the base class pointer actually points to (the old class or the new class) to control which logic the codebase uses

What This PR Does

  • To achieve runtime polymorphism, this PR encapsulates the direct access to the available field behind wrapper methods so that we can include the wrapper methods in the base case

Note: This PR contains no logic changes, it only wraps original direct access calls with wrapper methods

Introduced methods in NodeResources

Method Purpose
GetAvailableSum(ResourceID) Get the scalar available amount for a resource
GetAvailableResourceIds() Get the set of resource IDs that have explicit available entries
GetAvailableResourceMap() Return available resources as a name->value map
SubtractAvailable(ResourceSet) Subtract resources from available, clamping each entry to 0
SetAvailableResource(ResourceID, FixedPoint) Set a single resource's available to an explicit scalar value
SetAvailable(NodeResourceSet) Replace the entire available field from a NodeResourceSet
HasAvailableResource(ResourceID) Returns true if the resource has an explicit available entry
GetAvailable() Read-only access to the entire available resource set
TakeAvailable() Transfer ownership of the available resource set (move out)

Related PRs

Related to #62005

@dancingactor dancingactor requested a review from a team as a code owner May 7, 2026 06:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request encapsulates the available resource set in the NodeResources class by adding getter and setter methods, updating the GCS and cluster resource managers accordingly. The review feedback suggests optimizing the SetAvailable method and its usage by employing std::move to prevent unnecessary copies of resource sets.

Comment thread src/ray/common/scheduling/cluster_resource_data.h Outdated
Comment thread src/ray/raylet/scheduling/cluster_resource_manager.cc Outdated
Comment thread src/ray/common/scheduling/cluster_resource_data.h Outdated
Comment thread src/ray/common/scheduling/cluster_resource_data.h Outdated
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core gpu GPU related issues community-contribution Contributed by the community labels May 7, 2026
@dancingactor dancingactor force-pushed the wrap_available_method branch from acabe64 to a39dbd6 Compare May 7, 2026 11:23
@dancingactor dancingactor changed the title [WIP][Core][GPU fraction][3/n] Encapsulate NodeResources::available behind wrapper methods [Core][GPU fraction][3/n] Encapsulate NodeResources::available behind wrapper methods May 7, 2026
@Yicheng-Lu-llll Yicheng-Lu-llll self-assigned this May 7, 2026
@Yicheng-Lu-llll Yicheng-Lu-llll added the go add ONLY when ready to merge, run all tests label May 7, 2026
@Yicheng-Lu-llll
Copy link
Copy Markdown
Member

Yicheng-Lu-llll commented May 7, 2026

Thanks so much for the PR! The code change LGTM.

Could you please update the PR description to explain why we need to first refactor to use wrapper methods?

@dancingactor
Copy link
Copy Markdown
Contributor Author

Thanks for review! I have updated the PR description to clarify the context and the reason for this refactor

Copy link
Copy Markdown
Member

@Yicheng-Lu-llll Yicheng-Lu-llll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks! cc @MengjinYan for review/merge.

Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

I have a couple of general questions:

  • Wondering are you aiming to replace all the reference to available with the newly added base functions?
  • After the change, the available field is still public so we cannot prevent future PRs from accessing it elsewhere. Should we make it private to complete the encapsulation?
  • I understand that it is not necessary to create the functions for accessing total. At the same time, the access will be asymmetric which might confuse people. So wondering what's the end state of the available and total fields?

@dancingactor
Copy link
Copy Markdown
Contributor Author

dancingactor commented May 12, 2026

Really appreciate your feedbacks!

  • Wondering are you aiming to replace all the reference to available with the newly added base functions?

Yes, the goal is to replace all direct references to NodeResources.available with these wrapper. Note that there's a really similiar class called NodeResourceInstances also have available field. That class is out of scope for this specific PR

  • After the change, the available field is still public so we cannot prevent future PRs from accessing it elsewhere. Should we make it private to complete the encapsulation?

Yes, I will make it private. Thanks!

  • I understand that it is not necessary to create the functions for accessing total. At the same time, the access will be asymmetric which might confuse people. So wondering what's the end state of the available and total fields?

I think we should encapsulate both the available and total fields. As you pointed out, total is not necessary right now, so I will add the changes for total in the future cleanup PR

@dancingactor dancingactor requested review from a team, dayshah and edoakes as code owners May 13, 2026 09:14
… wrapper methods

Signed-off-by: dancingactor <s990346@gmail.com>
Signed-off-by: dancingactor <s990346@gmail.com>
Signed-off-by: dancingactor <s990346@gmail.com>
- Modify tests to use wrapper methods
- Modify some missing direct access to use wrapper methods

Signed-off-by: dancingactor <s990346@gmail.com>
@dancingactor dancingactor force-pushed the wrap_available_method branch from 92aa44a to 24f1f0f Compare May 13, 2026 09:24
@dancingactor
Copy link
Copy Markdown
Contributor Author

@dayshah @edoakes Sorry for the accidental ping! A messy rebase temporarily triggered an automated review request. The history is clean now, so please feel free to ignore this PR.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 24f1f0f. Configure here.

Comment thread src/ray/raylet/scheduling/cluster_resource_manager.cc Outdated
@dancingactor
Copy link
Copy Markdown
Contributor Author

dancingactor commented May 13, 2026

Hi @MengjinYan, I have applied the changes in 24f1f0f. PTAL

  • Moved available field to private
  • Modified all tests to use wrapper methods
  • Updated remaining places of direct access to use wrapper methods.
  • Introduced HasAvailableResource, GetAvailable and TakeAvailable methods

Signed-off-by: dancingactor <s990346@gmail.com>
@dancingactor dancingactor force-pushed the wrap_available_method branch from a9d3949 to ad48475 Compare May 13, 2026 16:48
@dancingactor dancingactor requested a review from MengjinYan May 14, 2026 01:51
@dancingactor
Copy link
Copy Markdown
Contributor Author

Hi @MengjinYan, gentle ping on this PR for review/merge whenever you have a moment. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests gpu GPU related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants