[WIP][Core][GPU fraction][6/n] Migrate Node struct and callers from original class to base class pointer type#63508
Conversation
… wrapper methods Signed-off-by: dancingactor <s990346@gmail.com>
…ass for branch-by-abstraction migration strategy Signed-off-by: dancingactor <s990346@gmail.com>
…d related methods from scalar to per-instance view version Signed-off-by: dancingactor <s990346@gmail.com>
…al class to base class pointer type Signed-off-by: dancingactor <s990346@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a polymorphic resource management architecture by establishing a NodeResourcesBase abstract class and a NodeResourcesV2 implementation. The V2 implementation enables per-instance resource tracking (specifically for GPU fractional scheduling) using NodeResourceInstanceSet. The Node structure and ClusterResourceManager have been refactored to manage these resources via unique pointers. The review feedback identifies several critical issues, including unsafe static_cast operations that will cause undefined behavior when V2 nodes are enabled, object slicing in GetNodeResources resulting in data loss, and malformed JSON generation in the DebugString method. There is also a recommendation to replace manual type checking with a virtual Clone pattern to improve the robustness of polymorphic copying.
| } | ||
| local_view->total.Set(resource_id, total); | ||
| local_view->available.Set(resource_id, available); | ||
| static_cast<NodeResources *>(local_view)->SetAvailableResource(resource_id, available); |
There was a problem hiding this comment.
The static_cast<NodeResources *>(local_view) is unsafe and will lead to undefined behavior. NodeResourcesV2 and NodeResources are siblings inheriting from NodeResourcesBase; NodeResourcesV2 does not inherit from NodeResources. If enable_per_instance_resource_scheduling is true, local_view will be a NodeResourcesV2 instance, making this cast invalid. Additionally, NodeResourcesV2 does not have the scalar available field that NodeResources has, so even if the cast were valid, the logic would be incorrect for V2 nodes.
|
|
||
| resources->available -= resource_request.GetResourceSet(); | ||
| resources->available.RemoveNegative(); | ||
| static_cast<NodeResources *>(resources)->SubtractAvailable(resource_request.GetResourceSet()); |
There was a problem hiding this comment.
Similar to the issue in UpdateResourceCapacity, this static_cast is unsafe when V2 nodes are enabled. Furthermore, NodeResourcesV2 currently lacks a SubtractAvailable implementation. This means that when per-instance scheduling is enabled, resource subtraction (which is critical for the scheduler's local view) will either crash or fail to update the per-instance availability correctly.
| bool ClusterResourceManager::GetNodeResources(scheduling::NodeID node_id, | ||
| NodeResources *ret_resources) const { | ||
| NodeResourcesBase *ret_resources) const { | ||
| auto it = nodes_.find(node_id); | ||
| if (it != nodes_.end()) { | ||
| *ret_resources = it->second.GetLocalView(); |
There was a problem hiding this comment.
This implementation of GetNodeResources causes object slicing. The assignment *ret_resources = it->second.GetLocalView() only copies the members defined in NodeResourcesBase (like total and labels). The available resources, which are stored in the subclasses (NodeResources or NodeResourcesV2), are lost during this copy. This contradicts the comment in the header which suggests that V2 nodes should be cast/converted to V1. To fix this, you should implement a virtual Clone() or CopyTo() method in the base class, or change the API to return a reference/pointer to the existing polymorphic object.
| std::string NodeResourcesV2::DebugString() const { | ||
| std::stringstream buffer; | ||
| buffer << "{\"total\":" << total.DebugString(); | ||
| buffer << "}, \"available\": " << available.DebugString(); | ||
| buffer << "}, \"labels\":{"; | ||
| for (const auto &[key, value] : labels) { | ||
| buffer << "\"" << key << "\":\"" << value << "\","; | ||
| } | ||
| buffer << "}, \"is_draining\": " << is_draining; | ||
| buffer << ", \"draining_deadline_timestamp_ms\": " << draining_deadline_timestamp_ms | ||
| << "}"; | ||
| return buffer.str(); | ||
| } |
There was a problem hiding this comment.
The DebugString implementation for NodeResourcesV2 produces malformed JSON. The extra closing braces at the start of lines 309 and 310 will terminate the outer object prematurely. Also, ensure that boolean values and labels are formatted correctly for JSON (e.g., using true/false literals and handling trailing commas).
| std::string NodeResourcesV2::DebugString() const { | |
| std::stringstream buffer; | |
| buffer << "{\"total\":" << total.DebugString(); | |
| buffer << "}, \"available\": " << available.DebugString(); | |
| buffer << "}, \"labels\":{"; | |
| for (const auto &[key, value] : labels) { | |
| buffer << "\"" << key << "\":\"" << value << "\","; | |
| } | |
| buffer << "}, \"is_draining\": " << is_draining; | |
| buffer << ", \"draining_deadline_timestamp_ms\": " << draining_deadline_timestamp_ms | |
| << "}"; | |
| return buffer.str(); | |
| } | |
| std::string NodeResourcesV2::DebugString() const { | |
| std::stringstream buffer; | |
| buffer << "{\"total\":" << total.DebugString(); | |
| buffer << ", \"available\": " << available.DebugString(); | |
| buffer << ", \"labels\":{"; | |
| bool first = true; | |
| for (const auto &[key, value] : labels) { | |
| if (!first) buffer << ","; | |
| buffer << "\"" << key << "\":\"" << value << "\""; | |
| first = false; | |
| } | |
| buffer << "}, \"is_draining\": " << (is_draining ? "true" : "false"); | |
| buffer << ", \"draining_deadline_timestamp_ms\": " << draining_deadline_timestamp_ms | |
| << "}"; | |
| return buffer.str(); | |
| } |
| Node(const Node &other) : local_view_modified_ts_(other.local_view_modified_ts_) { | ||
| if (other.local_view_->IsV2()) { | ||
| local_view_ = std::make_unique<NodeResourcesV2>( | ||
| static_cast<const NodeResourcesV2 &>(*other.local_view_)); | ||
| } else { | ||
| local_view_ = std::make_unique<NodeResources>( | ||
| static_cast<const NodeResources &>(*other.local_view_)); | ||
| } | ||
| } | ||
|
|
||
| Node &operator=(const Node &other) { | ||
| if (this != &other) { | ||
| if (other.local_view_->IsV2()) { | ||
| local_view_ = std::make_unique<NodeResourcesV2>( | ||
| static_cast<const NodeResourcesV2 &>(*other.local_view_)); | ||
| } else { | ||
| local_view_ = std::make_unique<NodeResources>( | ||
| static_cast<const NodeResources &>(*other.local_view_)); | ||
| } | ||
| local_view_modified_ts_ = other.local_view_modified_ts_; | ||
| } | ||
| return *this; | ||
| } |
There was a problem hiding this comment.
36cd79f to
1a5bd77
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 1a5bd77. Configure here.
| resources->available -= resource_request.GetResourceSet(); | ||
| resources->available.RemoveNegative(); | ||
| static_cast<NodeResources *>(resources)->SubtractAvailable( | ||
| resource_request.GetResourceSet()); |
There was a problem hiding this comment.
Unsafe static_cast to NodeResources causes undefined behavior
High Severity
Multiple functions (SubtractNodeAvailableResources, UpdateResourceCapacity, DeleteResources, AddNodeAvailableResources) perform static_cast<NodeResources *> on a NodeResourcesBase* without checking IsV2(). When enable_per_instance_resource_scheduling is true, nodes are NodeResourcesV2 instances, and these unchecked downcasts cause undefined behavior. The cast will reinterpret the NodeResourcesV2::available (NodeResourceInstanceSet) as if it were a NodeResources::available (NodeResourceSet), corrupting memory or crashing.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 1a5bd77. Configure here.
| class NodeResources { | ||
| /// Temporary Abstract base class for node resource. | ||
| /// Provides the common interface shared by NodeResources and NodeResourcesV2. | ||
| class NodeResourcesBase { |
There was a problem hiding this comment.
PR description is blank boilerplate without explanation
Low Severity
To help reviewers, please ensure your PR includes:
- Title: A concise summary of the change
- Description:
- What problem does this solve?
- How does this PR solve it?
- Any relevant context for reviewers such as:
- Why is the problem important to solve?
- Why was this approach chosen over others?
See this list of PRs as examples for PRs that have gone above and beyond:
- [Core] Introduce local port service discovery #59613
- [Core] Improve Large-Scale Resource View Synchronization Through Sync Message Batching #57641
- Remove node observability information from hot path of core components #56474
- [core][rdt] Support out-of-order actors by extracting metadata when creating #59610
- [core] fix open leak for plasma store memory (shm/fallback) by workers #52622
Triggered by project rule: Bugbot Rules
Reviewed by Cursor Bugbot for commit 1a5bd77. Configure here.


Description
Related issues
Additional information