Skip to content

[WIP][Core][GPU fraction][6/n] Migrate Node struct and callers from original class to base class pointer type#63508

Open
dancingactor wants to merge 4 commits into
ray-project:masterfrom
dancingactor:newNodeResource_third
Open

[WIP][Core][GPU fraction][6/n] Migrate Node struct and callers from original class to base class pointer type#63508
dancingactor wants to merge 4 commits into
ray-project:masterfrom
dancingactor:newNodeResource_third

Conversation

@dancingactor
Copy link
Copy Markdown
Contributor

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

… wrapper methods

Signed-off-by: dancingactor <s990346@gmail.com>
…ass for branch-by-abstraction migration strategy

Signed-off-by: dancingactor <s990346@gmail.com>
…d related methods from scalar to per-instance view version

Signed-off-by: dancingactor <s990346@gmail.com>
@dancingactor dancingactor requested a review from a team as a code owner May 19, 2026 16:11
…al class to base class pointer type

Signed-off-by: dancingactor <s990346@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a polymorphic resource management architecture by establishing a NodeResourcesBase abstract class and a NodeResourcesV2 implementation. The V2 implementation enables per-instance resource tracking (specifically for GPU fractional scheduling) using NodeResourceInstanceSet. The Node structure and ClusterResourceManager have been refactored to manage these resources via unique pointers. The review feedback identifies several critical issues, including unsafe static_cast operations that will cause undefined behavior when V2 nodes are enabled, object slicing in GetNodeResources resulting in data loss, and malformed JSON generation in the DebugString method. There is also a recommendation to replace manual type checking with a virtual Clone pattern to improve the robustness of polymorphic copying.

}
local_view->total.Set(resource_id, total);
local_view->available.Set(resource_id, available);
static_cast<NodeResources *>(local_view)->SetAvailableResource(resource_id, available);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The static_cast<NodeResources *>(local_view) is unsafe and will lead to undefined behavior. NodeResourcesV2 and NodeResources are siblings inheriting from NodeResourcesBase; NodeResourcesV2 does not inherit from NodeResources. If enable_per_instance_resource_scheduling is true, local_view will be a NodeResourcesV2 instance, making this cast invalid. Additionally, NodeResourcesV2 does not have the scalar available field that NodeResources has, so even if the cast were valid, the logic would be incorrect for V2 nodes.


resources->available -= resource_request.GetResourceSet();
resources->available.RemoveNegative();
static_cast<NodeResources *>(resources)->SubtractAvailable(resource_request.GetResourceSet());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the issue in UpdateResourceCapacity, this static_cast is unsafe when V2 nodes are enabled. Furthermore, NodeResourcesV2 currently lacks a SubtractAvailable implementation. This means that when per-instance scheduling is enabled, resource subtraction (which is critical for the scheduler's local view) will either crash or fail to update the per-instance availability correctly.

Comment on lines 151 to 155
bool ClusterResourceManager::GetNodeResources(scheduling::NodeID node_id,
NodeResources *ret_resources) const {
NodeResourcesBase *ret_resources) const {
auto it = nodes_.find(node_id);
if (it != nodes_.end()) {
*ret_resources = it->second.GetLocalView();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This implementation of GetNodeResources causes object slicing. The assignment *ret_resources = it->second.GetLocalView() only copies the members defined in NodeResourcesBase (like total and labels). The available resources, which are stored in the subclasses (NodeResources or NodeResourcesV2), are lost during this copy. This contradicts the comment in the header which suggests that V2 nodes should be cast/converted to V1. To fix this, you should implement a virtual Clone() or CopyTo() method in the base class, or change the API to return a reference/pointer to the existing polymorphic object.

Comment on lines +306 to +318
std::string NodeResourcesV2::DebugString() const {
std::stringstream buffer;
buffer << "{\"total\":" << total.DebugString();
buffer << "}, \"available\": " << available.DebugString();
buffer << "}, \"labels\":{";
for (const auto &[key, value] : labels) {
buffer << "\"" << key << "\":\"" << value << "\",";
}
buffer << "}, \"is_draining\": " << is_draining;
buffer << ", \"draining_deadline_timestamp_ms\": " << draining_deadline_timestamp_ms
<< "}";
return buffer.str();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DebugString implementation for NodeResourcesV2 produces malformed JSON. The extra closing braces at the start of lines 309 and 310 will terminate the outer object prematurely. Also, ensure that boolean values and labels are formatted correctly for JSON (e.g., using true/false literals and handling trailing commas).

Suggested change
std::string NodeResourcesV2::DebugString() const {
std::stringstream buffer;
buffer << "{\"total\":" << total.DebugString();
buffer << "}, \"available\": " << available.DebugString();
buffer << "}, \"labels\":{";
for (const auto &[key, value] : labels) {
buffer << "\"" << key << "\":\"" << value << "\",";
}
buffer << "}, \"is_draining\": " << is_draining;
buffer << ", \"draining_deadline_timestamp_ms\": " << draining_deadline_timestamp_ms
<< "}";
return buffer.str();
}
std::string NodeResourcesV2::DebugString() const {
std::stringstream buffer;
buffer << "{\"total\":" << total.DebugString();
buffer << ", \"available\": " << available.DebugString();
buffer << ", \"labels\":{";
bool first = true;
for (const auto &[key, value] : labels) {
if (!first) buffer << ",";
buffer << "\"" << key << "\":\"" << value << "\"";
first = false;
}
buffer << "}, \"is_draining\": " << (is_draining ? "true" : "false");
buffer << ", \"draining_deadline_timestamp_ms\": " << draining_deadline_timestamp_ms
<< "}";
return buffer.str();
}

Comment on lines +496 to +518
Node(const Node &other) : local_view_modified_ts_(other.local_view_modified_ts_) {
if (other.local_view_->IsV2()) {
local_view_ = std::make_unique<NodeResourcesV2>(
static_cast<const NodeResourcesV2 &>(*other.local_view_));
} else {
local_view_ = std::make_unique<NodeResources>(
static_cast<const NodeResources &>(*other.local_view_));
}
}

Node &operator=(const Node &other) {
if (this != &other) {
if (other.local_view_->IsV2()) {
local_view_ = std::make_unique<NodeResourcesV2>(
static_cast<const NodeResourcesV2 &>(*other.local_view_));
} else {
local_view_ = std::make_unique<NodeResources>(
static_cast<const NodeResources &>(*other.local_view_));
}
local_view_modified_ts_ = other.local_view_modified_ts_;
}
return *this;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual type checking (IsV2()) and explicit casting in the Node copy constructor and assignment operator are fragile and violate the Open/Closed Principle. It is highly recommended to add a virtual Clone() method to NodeResourcesBase to handle polymorphic copying cleanly.

@dancingactor dancingactor force-pushed the newNodeResource_third branch 2 times, most recently from 36cd79f to 1a5bd77 Compare May 19, 2026 16:13
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 1a5bd77. Configure here.

resources->available -= resource_request.GetResourceSet();
resources->available.RemoveNegative();
static_cast<NodeResources *>(resources)->SubtractAvailable(
resource_request.GetResourceSet());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe static_cast to NodeResources causes undefined behavior

High Severity

Multiple functions (SubtractNodeAvailableResources, UpdateResourceCapacity, DeleteResources, AddNodeAvailableResources) perform static_cast<NodeResources *> on a NodeResourcesBase* without checking IsV2(). When enable_per_instance_resource_scheduling is true, nodes are NodeResourcesV2 instances, and these unchecked downcasts cause undefined behavior. The cast will reinterpret the NodeResourcesV2::available (NodeResourceInstanceSet) as if it were a NodeResources::available (NodeResourceSet), corrupting memory or crashing.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1a5bd77. Configure here.

class NodeResources {
/// Temporary Abstract base class for node resource.
/// Provides the common interface shared by NodeResources and NodeResourcesV2.
class NodeResourcesBase {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description is blank boilerplate without explanation

Low Severity

⚠️ This PR needs a clearer title and/or description.

To help reviewers, please ensure your PR includes:

  • Title: A concise summary of the change
  • Description:
    • What problem does this solve?
    • How does this PR solve it?
    • Any relevant context for reviewers such as:
      • Why is the problem important to solve?
      • Why was this approach chosen over others?

See this list of PRs as examples for PRs that have gone above and beyond:

Fix in Cursor Fix in Web

Triggered by project rule: Bugbot Rules

Reviewed by Cursor Bugbot for commit 1a5bd77. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

1 participant