-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(object_store): refactor timeout and retry of object store interface #16231
Conversation
What's iterface? |
src/object_store/src/object/mod.rs
Outdated
|| async { | ||
let future = async { | ||
self.inner | ||
.upload(path, obj.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the clone here necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter requirement must be a FnMut. How do we bypass clone?
src/object_store/src/object/mod.rs
Outdated
|| async { | ||
let future = async { | ||
self.inner | ||
.read(path, range.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the clone here necessary?
src/object_store/src/object/mod.rs
Outdated
|| async { | ||
let future = async { | ||
self.inner | ||
.streaming_read(path, range.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the clone here necessary?
…nto li0k/storage_object_retry
S3(#[source] BoxedError), | ||
#[error("s3 error: {inner}")] | ||
S3 { | ||
// TODO: remove this after switch s3 backend to opendal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.0 Why? Any infomation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move to opendal::Error after switching to opendal
src/object_store/src/object/mod.rs
Outdated
|| async { | ||
let future = async { | ||
self.inner | ||
.upload(path, obj.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…nto li0k/storage_object_retry
@@ -648,7 +648,7 @@ def section_storage(outer_panels): | |||
[50, 99], | |||
), | |||
panels.target( | |||
f"sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_sum')}[$__rate_interval])) / sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_count')}[$__rate_interval])) > 0", | |||
f"sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_sum')}[$__rate_interval])) / sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_count')}[$__rate_interval]))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the conflict is not resolved correctly here in this file. > 0
and >= 0
are added to avoid showing NaN on division by zero.
@@ -82,7 +82,7 @@ def section_cluster_node(outer_panels): | |||
% (COMPONENT_LABEL, NODE_LABEL), | |||
), | |||
panels.target( | |||
f"sum(rate({metric('process_cpu_seconds_total')}[$__rate_interval])) by ({COMPONENT_LABEL}, {NODE_LABEL}) / avg({metric('process_cpu_core_num')}) by ({COMPONENT_LABEL}, {NODE_LABEL}) > 0", | |||
f"sum(rate({metric('process_cpu_seconds_total')}[$__rate_interval])) by ({COMPONENT_LABEL}, {NODE_LABEL}) / avg({metric('process_cpu_core_num')}) by ({COMPONENT_LABEL}, {NODE_LABEL})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/object_store/src/object/mod.rs
Outdated
"streaming_upload write_bytes timeout", | ||
)) | ||
Err(ObjectError::timeout(format!( | ||
"{} timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not resolved.
src/object_store/src/object/mod.rs
Outdated
.await | ||
.unwrap_or_else(|_| { | ||
Err(ObjectError::timeout(format!( | ||
"{}_attempt_timeout_ms {:?} {}_retry_attempts {:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"retry attempts exhausted for {}. Please modify {}_attempt_timeout_ms (current=...) and {}_retry_attempts (current=...) under [storage.object_store.retry] in the config accordingly if needed."
…nto li0k/storage_object_retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the efforts!
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Prior to this PR, Object Store could only configure the total timeout for some operations and put retry behavior on backend execution. This design is not flexible. We configure a more conservative timeout for each operation by default, which may extend the system recovery time in some scenarios. Moreover, the total timeout is not intuitive for users.
Therefore, this PR refactors the retry and timeout logic of the Object Store
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.
[storage.object_store.retry]
refer to the following