Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CHECK_FAIL when scheduling task with duplicate object requests #16063

Merged
merged 9 commits into from
May 26, 2021

Conversation

ericl
Copy link
Contributor

@ericl ericl commented May 25, 2021

Why are these changes needed?

This fixes #15990 by handling tasks that request the same object multiple times correctly. The particular edge case happens if we have multiple tasks requesting the same object multiple times.

I was not able to easily reproduce this in a Python unit test, so a C++ unit test will have to do for now. It is easily reproducible in Modin CI tests.

auto it = object_pull_requests_.find(object_id);
RAY_CHECK(it != object_pull_requests_.end());
if (it != object_pull_requests_.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change here? I think we should try to keep the invariant that an object that was in the active pull requests should always be in this map too.

Copy link
Contributor Author

@ericl ericl May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is failing in another modin issue linked. I was not able to reproduce the reason why, but removing this add adding a defensive remove above also fixes the problem.

In general, I think we should prefer to write code in a more defensive style (correct even if invariants are violated), if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can you log a warning, though? This case really should not occur so if it does, it's likely a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit more digging and fixed the root cause bug (see other comments), which the defensive erase above was also handling.

@@ -335,6 +335,7 @@ std::vector<ObjectID> PullManager::CancelPull(uint64_t request_id) {
// here could be a no-op.
it->second.bundle_request_ids.erase(bundle_it->first);
if (it->second.bundle_request_ids.empty()) {
active_object_pull_requests_.erase(obj_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive removal.

@@ -25,20 +25,29 @@ PullManager::PullManager(
uint64_t PullManager::Pull(const std::vector<rpc::ObjectReference> &object_ref_bundle,
bool is_worker_request,
std::vector<rpc::ObjectReference> *objects_to_locate) {
// To avoid edge cases dealing with duplicated object ids in the bundle,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fixes an issue where we failed to issue a pull since num_objects_missing_sizes was calculated incorrectly with duplicate object ids.

if (!active_object_pull_requests_[obj_id].erase(request_it->first)) {
// If a bundle contains multiple duplicated object ids, the active pull request
// could've been already removed. Then do nothing.
auto it = active_object_pull_requests_.find(obj_id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the [] operator caused the errant CHECK failure, since it default-allocated an empty set as the value.

auto it = active_object_pull_requests_.find(obj_id);
if (it == active_object_pull_requests_.end() ||
!it->second.erase(request_it->first)) {
// The object is already deactivated, no action is required.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this isn't needed, but leaving this in just in case.

@stephanie-wang
Copy link
Contributor

Looks like there is a build error on osx.

@ericl
Copy link
Contributor Author

ericl commented May 26, 2021

Windows tests is existing flaky; also passed previously.

@ericl ericl merged commit 2f4628f into ray-project:master May 26, 2021
@ericl
Copy link
Contributor Author

ericl commented May 26, 2021

This will be picked into 1.4

@rkooo567
Copy link
Contributor

cc @devin-petersohn (I also followed up with him in slack)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ray 1.3.0 CHECK fail during normal task scheduling
3 participants