Refactor object restoration path #14821

ericl · 2021-03-21T01:24:04Z

Why are these changes needed?

This PR refactors the object restoration path to unify it with the normal object pull path, both eliminating the short polling and removing the need for separate restore RPCs. The new strategy is as follows:

If an object has no locations, PullManager will

call Restore() directly if the object can be accessed directly (on local disk of the node, or remote storage), otherwise
treat the spilled_node_id as the singleton location and pull from there

PushManager will call Restore() if it receives a request for a non in-memory object that is on local disk.

Note that this is latency-optimal, since PushManager will get a notification once the object becomes local again from Restore and immediately start the push.

TODO:

Fix and add unit tests

clarkzinzow

This will definitely be way nicer once done, I left a few comments on an initial skim.

src/ray/object_manager/pull_manager.cc

clarkzinzow · 2021-03-21T21:42:47Z

src/ray/object_manager/object_manager.cc

+    // Issue a restore request if the object is on local disk. This is only relevant
+    // if the local filesystem storage type is being used.
+    auto object_url = get_spilled_object_url_(object_id);
+    if (!object_url.empty()) {
+      restore_spilled_object_(object_id, object_url, nullptr);
+    }


Instead of restoring the object directly here, I think that we should issue a local pull through the pull manager so that the restoration is done under the pull manager's admission control.

I'm going to punt this for later, since #14817 will mitigate the issue and this isn't a regression.

In the long run, direct streaming from disk means we shouldn't need admission control here.

src/ray/raylet/local_object_manager.cc

ericl · 2021-03-22T23:00:39Z

src/ray/object_manager/pull_manager.cc

  }
+
+  // TODO(ekl) should we more directly mark the object as lost in this case?


Hmm @stephanie-wang what should we do in this case? (this is an existing problem I guess). It seems we will be rapidly re-polling the object during Tick().

Isn't it something that can happen before the spilled object URL / location is updated? (e.g., the object location is not available yet). If so, don't we just need to ignore this case?

Hmm yeah it seems like we'll want to mark the object as failed. Or we can let the owner handle it now that we have the OBOD.

Hmm if we publish the URL update at the same time we remove the location, then I don't see how this could happen. The only case would be for node failure... I guess currently task dep manager will mark it as failed and cancel the pull at a higher level hopefully

Right now, the owner handles this case by marking the object as failed if it detects the failure of the primary location. It may already work for this case too (since the spilled location should also be the primary location), but we need to check.

ericl · 2021-03-22T23:25:40Z

Ready for review.

rkooo567

This is a really nice refactoring!! Have some minor comments

rkooo567 · 2021-03-23T05:41:30Z

python/ray/external_storage.py

@@ -60,6 +60,8 @@ def parse_url_with_offset(url_with_offset: str) -> Tuple[str, int, int]:
    query_dict = urllib.parse.parse_qs(parsed_result.query)
    # Split by ? to remove the query from the url.
    base_url = parsed_result.geturl().split("?")[0]
+    if "offset" not in query_dict or "size" not in query_dict:


Isn't this supposed to not happen ever? Why don't we use assert here?

This gives a better error in case of mis-formatted URL (including the origin URL)

src/ray/object_manager/object_manager.h

src/ray/object_manager/pull_manager.cc

rkooo567 · 2021-03-23T05:45:29Z

src/ray/object_manager/pull_manager.cc

  }
+
+  // TODO(ekl) should we more directly mark the object as lost in this case?


Isn't it something that can happen before the spilled object URL / location is updated? (e.g., the object location is not available yet). If so, don't we just need to ignore this case?

rkooo567 · 2021-03-23T05:52:44Z

src/ray/raylet/local_object_manager.cc

    std::function<void(const ray::Status &)> callback) {
  if (objects_pending_restore_.count(object_id) > 0) {
    // If the same object is restoring, we dedup here.
    return;
  }

-  if (is_external_storage_type_fs_ && node_id != self_node_id_) {


Do we still need is_external_storage_type_fs_?

Ah, we still need it for a couple things like setting Node to Nil vs not nil, so we can't entirely remove it.

clarkzinzow

LGTM overall, although I think that is_external_storage_type_fs_ isn't being properly set.

src/ray/object_manager/pull_manager.cc

src/ray/raylet/local_object_manager.h

src/ray/object_manager/ownership_based_object_directory.cc

fixed

ericl · 2021-03-26T05:46:46Z

//python/ray/tune:test_convergence FAILED in 3 out of 3 in 287.4s

Test unrelated, merging

ericl added 5 commits March 20, 2021 17:41

wip

ce09996

wip

41d876b

cleanup

2cb700c

wip

7ecb4f8

wip

ea39f43

ericl mentioned this pull request Mar 21, 2021

[Core] Object spilling - Changed (remote) object restoration request to a long-poll, with exponentially backed off retries on failures. #14660

Closed

8 tasks

clarkzinzow reviewed Mar 21, 2021

View reviewed changes

rkooo567 assigned rkooo567 and stephanie-wang Mar 22, 2021

ericl added 2 commits March 22, 2021 14:49

comments

f0313d3

update tests

e5d6c7a

ericl commented Mar 22, 2021

View reviewed changes

update

d99a9c3

clarkzinzow self-assigned this Mar 22, 2021

add url check

e2837fa

ericl changed the title ~~[WIP] Refactor object restoration path~~ Refactor object restoration path Mar 22, 2021

rkooo567 approved these changes Mar 23, 2021

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 23, 2021

fix mac build

eaf9ce1

ericl mentioned this pull request Mar 23, 2021

[object_store] Object Spilling does not work properly #14788

Closed

2 tasks

ericl added 2 commits March 23, 2021 12:55

oops

5933179

Merge remote-tracking branch 'upstream/master' into refactor-restore

8773987

ericl force-pushed the refactor-restore branch from 3833524 to 8773987 Compare March 24, 2021 19:14

clarkzinzow previously requested changes Mar 25, 2021

View reviewed changes

src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved

src/ray/raylet/local_object_manager.h Show resolved Hide resolved

src/ray/object_manager/ownership_based_object_directory.cc Show resolved Hide resolved

ericl added 2 commits March 25, 2021 15:30

put it back

81cfe92

hex

29f8707

ericl force-pushed the refactor-restore branch from f050e35 to 29f8707 Compare March 25, 2021 23:17

Merge remote-tracking branch 'upstream/master' into refactor-restore

76e6d33

ericl merged commit 2157021 into ray-project:master Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor object restoration path #14821

Refactor object restoration path #14821

ericl commented Mar 21, 2021 •

edited

Loading

clarkzinzow left a comment

clarkzinzow Mar 21, 2021 •

edited

Loading

ericl Mar 22, 2021

ericl Mar 22, 2021

rkooo567 Mar 23, 2021

stephanie-wang Mar 23, 2021

ericl Mar 23, 2021

stephanie-wang Mar 23, 2021

ericl commented Mar 22, 2021

rkooo567 left a comment

rkooo567 Mar 23, 2021

ericl Mar 23, 2021

rkooo567 Mar 23, 2021

rkooo567 Mar 23, 2021

ericl Mar 23, 2021

clarkzinzow left a comment

ericl commented Mar 26, 2021

		}

		// TODO(ekl) should we more directly mark the object as lost in this case?

Refactor object restoration path #14821

Refactor object restoration path #14821

Conversation

ericl commented Mar 21, 2021 • edited Loading

Why are these changes needed?

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Mar 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Mar 22, 2021

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

ericl commented Mar 26, 2021

ericl commented Mar 21, 2021 •

edited

Loading

clarkzinzow Mar 21, 2021 •

edited

Loading