-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor object restoration path #14821
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely be way nicer once done, I left a few comments on an initial skim.
// Issue a restore request if the object is on local disk. This is only relevant | ||
// if the local filesystem storage type is being used. | ||
auto object_url = get_spilled_object_url_(object_id); | ||
if (!object_url.empty()) { | ||
restore_spilled_object_(object_id, object_url, nullptr); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of restoring the object directly here, I think that we should issue a local pull through the pull manager so that the restoration is done under the pull manager's admission control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to punt this for later, since #14817 will mitigate the issue and this isn't a regression.
In the long run, direct streaming from disk means we shouldn't need admission control here.
} | ||
|
||
// TODO(ekl) should we more directly mark the object as lost in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm @stephanie-wang what should we do in this case? (this is an existing problem I guess). It seems we will be rapidly re-polling the object during Tick().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it something that can happen before the spilled object URL / location is updated? (e.g., the object location is not available yet). If so, don't we just need to ignore this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah it seems like we'll want to mark the object as failed. Or we can let the owner handle it now that we have the OBOD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm if we publish the URL update at the same time we remove the location, then I don't see how this could happen. The only case would be for node failure... I guess currently task dep manager will mark it as failed and cancel the pull at a higher level hopefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the owner handles this case by marking the object as failed if it detects the failure of the primary location. It may already work for this case too (since the spilled location should also be the primary location), but we need to check.
Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice refactoring!! Have some minor comments
@@ -60,6 +60,8 @@ def parse_url_with_offset(url_with_offset: str) -> Tuple[str, int, int]: | |||
query_dict = urllib.parse.parse_qs(parsed_result.query) | |||
# Split by ? to remove the query from the url. | |||
base_url = parsed_result.geturl().split("?")[0] | |||
if "offset" not in query_dict or "size" not in query_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this supposed to not happen ever? Why don't we use assert here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives a better error in case of mis-formatted URL (including the origin URL)
} | ||
|
||
// TODO(ekl) should we more directly mark the object as lost in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it something that can happen before the spilled object URL / location is updated? (e.g., the object location is not available yet). If so, don't we just need to ignore this case?
std::function<void(const ray::Status &)> callback) { | ||
if (objects_pending_restore_.count(object_id) > 0) { | ||
// If the same object is restoring, we dedup here. | ||
return; | ||
} | ||
|
||
if (is_external_storage_type_fs_ && node_id != self_node_id_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need is_external_storage_type_fs_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we still need it for a couple things like setting Node to Nil vs not nil, so we can't entirely remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, although I think that is_external_storage_type_fs_
isn't being properly set.
//python/ray/tune:test_convergence FAILED in 3 out of 3 in 287.4s Test unrelated, merging |
Why are these changes needed?
This PR refactors the object restoration path to unify it with the normal object pull path, both eliminating the short polling and removing the need for separate restore RPCs. The new strategy is as follows:
Note that this is latency-optimal, since PushManager will get a notification once the object becomes local again from Restore and immediately start the push.
TODO: