Fix CHECK_FAIL when scheduling task with duplicate object requests #16063

ericl · 2021-05-25T19:08:27Z

Why are these changes needed?

This fixes #15990 by handling tasks that request the same object multiple times correctly. The particular edge case happens if we have multiple tasks requesting the same object multiple times.

I was not able to easily reproduce this in a Python unit test, so a C++ unit test will have to do for now. It is easily reproducible in Modin CI tests.

stephanie-wang · 2021-05-25T20:39:26Z

src/ray/object_manager/pull_manager.cc

  auto it = object_pull_requests_.find(object_id);
-  RAY_CHECK(it != object_pull_requests_.end());
+  if (it != object_pull_requests_.end()) {


Why the change here? I think we should try to keep the invariant that an object that was in the active pull requests should always be in this map too.

The check is failing in another modin issue linked. I was not able to reproduce the reason why, but removing this add adding a defensive remove above also fixes the problem.

In general, I think we should prefer to write code in a more defensive style (correct even if invariants are violated), if possible.

I see. Can you log a warning, though? This case really should not occur so if it does, it's likely a bug.

I did a bit more digging and fixed the root cause bug (see other comments), which the defensive erase above was also handling.

stephanie-wang · 2021-05-25T20:39:44Z

src/ray/object_manager/pull_manager.cc

@@ -335,6 +335,7 @@ std::vector<ObjectID> PullManager::CancelPull(uint64_t request_id) {
      // here could be a no-op.
      it->second.bundle_request_ids.erase(bundle_it->first);
      if (it->second.bundle_request_ids.empty()) {
+        active_object_pull_requests_.erase(obj_id);


Why the change?

Defensive removal.

ericl · 2021-05-26T06:40:46Z

src/ray/object_manager/pull_manager.cc

@@ -25,20 +25,29 @@ PullManager::PullManager(
 uint64_t PullManager::Pull(const std::vector<rpc::ObjectReference> &object_ref_bundle,
                           bool is_worker_request,
                           std::vector<rpc::ObjectReference> *objects_to_locate) {
+  // To avoid edge cases dealing with duplicated object ids in the bundle,


This also fixes an issue where we failed to issue a pull since num_objects_missing_sizes was calculated incorrectly with duplicate object ids.

ericl · 2021-05-26T06:43:41Z

src/ray/object_manager/pull_manager.cc

-    if (!active_object_pull_requests_[obj_id].erase(request_it->first)) {
-      // If a bundle contains multiple duplicated object ids, the active pull request
-      // could've been already removed. Then do nothing.
+    auto it = active_object_pull_requests_.find(obj_id);


The use of the [] operator caused the errant CHECK failure, since it default-allocated an empty set as the value.

ericl · 2021-05-26T06:45:27Z

src/ray/object_manager/pull_manager.cc

+    auto it = active_object_pull_requests_.find(obj_id);
+    if (it == active_object_pull_requests_.end() ||
+        !it->second.erase(request_it->first)) {
+      // The object is already deactivated, no action is required.


Technically this isn't needed, but leaving this in just in case.

stephanie-wang · 2021-05-26T17:59:56Z

Looks like there is a build error on osx.

ericl · 2021-05-26T22:13:11Z

Windows tests is existing flaky; also passed previously.

ericl · 2021-05-26T22:42:07Z

This will be picked into 1.4

rkooo567 · 2021-05-27T00:24:55Z

cc @devin-petersohn (I also followed up with him in slack)

…16063)

fix it

45db24a

ericl force-pushed the fix-dup-dup branch from b7c5e2c to 45db24a Compare May 25, 2021 19:08

ericl assigned stephanie-wang and rkooo567 May 25, 2021

stephanie-wang reviewed May 25, 2021

View reviewed changes

ericl added 2 commits May 25, 2021 23:03

fix the bug properly

3280642

fix it for good

7518436

ericl force-pushed the fix-dup-dup branch from b1a522c to 7518436 Compare May 26, 2021 06:40

lint

ecaac21

ericl commented May 26, 2021

View reviewed changes

comment

5e7eaaa

ericl commented May 26, 2021

View reviewed changes

rkooo567 approved these changes May 26, 2021

View reviewed changes

ericl commented May 26, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix-dup-dup

268c46f

stephanie-wang approved these changes May 26, 2021

View reviewed changes

ericl added 3 commits May 26, 2021 11:14

sleep

c9c39ff

remove check since it requires mutex

a5822de

remove sleep

6389fbe

ericl merged commit 2f4628f into ray-project:master May 26, 2021

DmitriGekhtman pushed a commit that referenced this pull request May 27, 2021

Fix CHECK_FAIL when scheduling task with duplicate object requests (#…

f82091d

…16063)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CHECK_FAIL when scheduling task with duplicate object requests #16063

Fix CHECK_FAIL when scheduling task with duplicate object requests #16063

ericl commented May 25, 2021

stephanie-wang May 25, 2021

ericl May 25, 2021 •

edited

Loading

stephanie-wang May 25, 2021

ericl May 26, 2021

stephanie-wang May 25, 2021

ericl May 25, 2021

ericl May 26, 2021

ericl May 26, 2021

ericl May 26, 2021

stephanie-wang commented May 26, 2021

ericl commented May 26, 2021

ericl commented May 26, 2021

rkooo567 commented May 27, 2021

Fix CHECK_FAIL when scheduling task with duplicate object requests #16063

Fix CHECK_FAIL when scheduling task with duplicate object requests #16063

Conversation

ericl commented May 25, 2021

Why are these changes needed?

Choose a reason for hiding this comment

ericl May 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented May 26, 2021

ericl commented May 26, 2021

ericl commented May 26, 2021

rkooo567 commented May 27, 2021

ericl May 25, 2021 •

edited

Loading