[core] Make object directory robust to out-of-order updates #16314

stephanie-wang · 2021-06-08T17:50:49Z

Why are these changes needed?

The ownership-based object directory (OBOD) can lose updates if they arrive out of order. Under heavy load and especially if there's thrashing, this can lead to memory leaks (location that never gets deleted) and possibly hanging (the OBOD registers a location that doesn't actually exist). This fixes the issue by collecting all the updates as a per-location count instead of adding/removing the location entry from a set.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2021-06-08T18:13:23Z

src/ray/core_worker/reference_count.cc

@@ -1012,8 +1017,11 @@ bool ReferenceCounter::RemoveObjectLocation(const ObjectID &object_id,
                  << " that doesn't exist in the reference table";
    return false;
  }
-  it->second.locations.erase(node_id);
-  PushToLocationSubscribers(it);
+  it->second.locations[node_id]--;


What if it goes negative due to out of order?

It's definitely okay for this to be negative. The assumption is that eventually you will receive the corresponding Add request and then this will go back to 0.

ericl

A possible complication of counting here is if we have duplicate add locs, we could leak entries (count > 0 persistently if there were two adds and one remove). Would this be a concern?

Also, what would happen if the count goes below zero?

stephanie-wang · 2021-06-08T18:52:26Z

A possible complication of counting here is if we have duplicate add locs, we could leak entries (count > 0 persistently if there were two adds and one remove). Would this be a concern?

Also, what would happen if the count goes below zero?

The assumption in this PR is that every add will have a corresponding remove (or a node failure). So a leak could definitely happen if there are duplicate adds, although it can happen today too just with message reordering.

I think we should merge this as is since it will be more robust under heavy loads, but in the future we could have a method to handle duplicates, e.g., resetting the directory. I'll add a note about this.

ericl · 2021-06-08T18:55:51Z

What about adding sequencer blocks around OBOD calls instead? I'm assuming add/remove for a particular loc only comes from the client at that loc, is that correct? The sequencer seems like a more robust way to guarantee this property without introducing edge cases like negative ref counts.

…

On Tue, Jun 8, 2021, 11:52 AM Stephanie Wang ***@***.***> wrote: A possible complication of counting here is if we have duplicate add locs, we could leak entries (count > 0 persistently if there were two adds and one remove). Would this be a concern? Also, what would happen if the count goes below zero? The assumption in this PR is that every add will have a corresponding remove (or a node failure). So a leak could definitely happen if there are duplicate adds, although it can happen today too just with message reordering. I think we should merge this as is since it will be more robust under heavy loads, but in the future we could have a method to handle duplicates, e.g., resetting the directory. I'll add a note about this. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#16314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSVTPBSIDWEKHWJ3HTTTRZRHRANCNFSM46KNNXLA> .

rkooo567 · 2021-06-09T07:16:11Z

(Btw, if you merge the latest master, the build issue will be gone)

rkooo567 · 2021-06-09T22:20:26Z

Oh also I think we should merge this regardless of the obod pubsub because it fixes the different path that obod pubsub handles cc @clarkzinzow

stephanie-wang assigned ericl and rkooo567 Jun 8, 2021

stephanie-wang requested a review from clarkzinzow June 8, 2021 17:51

ericl reviewed Jun 8, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 9, 2021

clarkzinzow self-assigned this Jun 14, 2021

Sequence ops

78dc958

stephanie-wang force-pushed the obod-out-of-order branch from 5d6091a to 78dc958 Compare June 17, 2021 17:03

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 17, 2021

stephanie-wang added 3 commits June 17, 2021 10:15

id

ac57888

fix

47229c7

lint

f9cba2b

ericl approved these changes Jun 18, 2021

View reviewed changes

ericl merged commit 5eb51c8 into ray-project:master Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Make object directory robust to out-of-order updates #16314

[core] Make object directory robust to out-of-order updates #16314

stephanie-wang commented Jun 8, 2021

ericl Jun 8, 2021

stephanie-wang Jun 8, 2021

ericl left a comment •

edited

Loading

stephanie-wang commented Jun 8, 2021

ericl commented Jun 8, 2021 via email

rkooo567 commented Jun 9, 2021

rkooo567 commented Jun 9, 2021

[core] Make object directory robust to out-of-order updates #16314

[core] Make object directory robust to out-of-order updates #16314

Conversation

stephanie-wang commented Jun 8, 2021

Why are these changes needed?

Checks

ericl Jun 8, 2021

Choose a reason for hiding this comment

stephanie-wang Jun 8, 2021

Choose a reason for hiding this comment

ericl left a comment • edited Loading

Choose a reason for hiding this comment

stephanie-wang commented Jun 8, 2021

ericl commented Jun 8, 2021 via email

rkooo567 commented Jun 9, 2021

rkooo567 commented Jun 9, 2021

ericl left a comment •

edited

Loading