Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Make object directory robust to out-of-order updates #16314

Merged
merged 4 commits into from
Jun 18, 2021

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

The ownership-based object directory (OBOD) can lose updates if they arrive out of order. Under heavy load and especially if there's thrashing, this can lead to memory leaks (location that never gets deleted) and possibly hanging (the OBOD registers a location that doesn't actually exist). This fixes the issue by collecting all the updates as a per-location count instead of adding/removing the location entry from a set.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -1012,8 +1017,11 @@ bool ReferenceCounter::RemoveObjectLocation(const ObjectID &object_id,
<< " that doesn't exist in the reference table";
return false;
}
it->second.locations.erase(node_id);
PushToLocationSubscribers(it);
it->second.locations[node_id]--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it goes negative due to out of order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely okay for this to be negative. The assumption is that eventually you will receive the corresponding Add request and then this will go back to 0.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible complication of counting here is if we have duplicate add locs, we could leak entries (count > 0 persistently if there were two adds and one remove). Would this be a concern?

Also, what would happen if the count goes below zero?

@stephanie-wang
Copy link
Contributor Author

A possible complication of counting here is if we have duplicate add locs, we could leak entries (count > 0 persistently if there were two adds and one remove). Would this be a concern?

Also, what would happen if the count goes below zero?

The assumption in this PR is that every add will have a corresponding remove (or a node failure). So a leak could definitely happen if there are duplicate adds, although it can happen today too just with message reordering.

I think we should merge this as is since it will be more robust under heavy loads, but in the future we could have a method to handle duplicates, e.g., resetting the directory. I'll add a note about this.

@ericl
Copy link
Contributor

ericl commented Jun 8, 2021 via email

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 9, 2021

(Btw, if you merge the latest master, the build issue will be gone)

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 9, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Jun 9, 2021

Oh also I think we should merge this regardless of the obod pubsub because it fixes the different path that obod pubsub handles cc @clarkzinzow

@clarkzinzow clarkzinzow self-assigned this Jun 14, 2021
@stephanie-wang stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 17, 2021
@ericl ericl merged commit 5eb51c8 into ray-project:master Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants