-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CHECK_FAIL when scheduling task with duplicate object requests #16063
Conversation
auto it = object_pull_requests_.find(object_id); | ||
RAY_CHECK(it != object_pull_requests_.end()); | ||
if (it != object_pull_requests_.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change here? I think we should try to keep the invariant that an object that was in the active pull requests should always be in this map too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is failing in another modin issue linked. I was not able to reproduce the reason why, but removing this add adding a defensive remove above also fixes the problem.
In general, I think we should prefer to write code in a more defensive style (correct even if invariants are violated), if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can you log a warning, though? This case really should not occur so if it does, it's likely a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit more digging and fixed the root cause bug (see other comments), which the defensive erase above was also handling.
@@ -335,6 +335,7 @@ std::vector<ObjectID> PullManager::CancelPull(uint64_t request_id) { | |||
// here could be a no-op. | |||
it->second.bundle_request_ids.erase(bundle_it->first); | |||
if (it->second.bundle_request_ids.empty()) { | |||
active_object_pull_requests_.erase(obj_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defensive removal.
@@ -25,20 +25,29 @@ PullManager::PullManager( | |||
uint64_t PullManager::Pull(const std::vector<rpc::ObjectReference> &object_ref_bundle, | |||
bool is_worker_request, | |||
std::vector<rpc::ObjectReference> *objects_to_locate) { | |||
// To avoid edge cases dealing with duplicated object ids in the bundle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also fixes an issue where we failed to issue a pull since num_objects_missing_sizes was calculated incorrectly with duplicate object ids.
if (!active_object_pull_requests_[obj_id].erase(request_it->first)) { | ||
// If a bundle contains multiple duplicated object ids, the active pull request | ||
// could've been already removed. Then do nothing. | ||
auto it = active_object_pull_requests_.find(obj_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the []
operator caused the errant CHECK failure, since it default-allocated an empty set as the value.
auto it = active_object_pull_requests_.find(obj_id); | ||
if (it == active_object_pull_requests_.end() || | ||
!it->second.erase(request_it->first)) { | ||
// The object is already deactivated, no action is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this isn't needed, but leaving this in just in case.
Looks like there is a build error on osx. |
Windows tests is existing flaky; also passed previously. |
This will be picked into 1.4 |
cc @devin-petersohn (I also followed up with him in slack) |
Why are these changes needed?
This fixes #15990 by handling tasks that request the same object multiple times correctly. The particular edge case happens if we have multiple tasks requesting the same object multiple times.
I was not able to easily reproduce this in a Python unit test, so a C++ unit test will have to do for now. It is easily reproducible in Modin CI tests.