Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(object_store): refactor timeout and retry of object store interface #16231

Merged
merged 56 commits into from
May 8, 2024

Conversation

Li0k
Copy link
Contributor

@Li0k Li0k commented Apr 10, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Prior to this PR, Object Store could only configure the total timeout for some operations and put retry behavior on backend execution. This design is not flexible. We configure a more conservative timeout for each operation by default, which may extend the system recovery time in some scenarios. Moreover, the total timeout is not intuitive for users.

Therefore, this PR refactors the retry and timeout logic of the Object Store

  • Introduced separate timeout and retry configuration for each operation
  • Remove the retry logic of object store backend
  • Uniformly handle retry and timeout by MonitredObjectStore

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

  1. Introduce new timeout and retry configurations for all object store interfaces. The new configurations are located at [storage.object_store.retry] refer to the following
[storage.object_store.retry]
req_backoff_interval_ms = 1000
req_backoff_max_delay_ms = 10000
req_backoff_factor = 2
upload_attempt_timeout_ms = 8000
upload_retry_attempts = 3
streaming_upload_attempt_timeout_ms = 5000
streaming_upload_retry_attempts = 3
  1. Deprecate ambiguous timeout configuration
[storage.object_store]
object_store_streaming_read_timeout_ms = 480000
object_store_streaming_upload_timeout_ms = 480000
object_store_upload_timeout_ms = 480000
object_store_read_timeout_ms = 480000

[storage.object_store.s3]
object_store_req_retry_interval_ms = 20
object_store_req_retry_max_delay_ms = 10000
object_store_req_retry_max_attempts = 8

@TennyZhuang
Copy link
Contributor

What's iterface?

|| async {
let future = async {
self.inner
.upload(path, obj.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the clone here necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter requirement must be a FnMut. How do we bypass clone?

|| async {
let future = async {
self.inner
.read(path, range.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the clone here necessary?

|| async {
let future = async {
self.inner
.streaming_read(path, range.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the clone here necessary?

src/object_store/src/object/s3.rs Outdated Show resolved Hide resolved
src/object_store/src/object/s3.rs Outdated Show resolved Hide resolved
src/object_store/src/object/s3.rs Outdated Show resolved Hide resolved
src/object_store/src/object/error.rs Outdated Show resolved Hide resolved
@Li0k Li0k changed the title refactor(object_store): refactor timeout and retry of iterface refactor(object_store): refactor timeout and retry of object store interface Apr 11, 2024
@Li0k Li0k marked this pull request as ready for review April 12, 2024 11:03
src/common/src/config.rs Outdated Show resolved Hide resolved
S3(#[source] BoxedError),
#[error("s3 error: {inner}")]
S3 {
// TODO: remove this after switch s3 backend to opendal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0 Why? Any infomation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move to opendal::Error after switching to opendal

|| async {
let future = async {
self.inner
.upload(path, obj.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

src/object_store/src/object/mod.rs Show resolved Hide resolved
src/object_store/src/object/mod.rs Show resolved Hide resolved
@Li0k Li0k requested a review from wenym1 April 17, 2024 08:21
@@ -648,7 +648,7 @@ def section_storage(outer_panels):
[50, 99],
),
panels.target(
f"sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_sum')}[$__rate_interval])) / sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_count')}[$__rate_interval])) > 0",
f"sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_sum')}[$__rate_interval])) / sum by(le, {COMPONENT_LABEL}) (rate({metric('state_store_sync_size_count')}[$__rate_interval]))",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conflict is not resolved correctly here in this file. > 0 and >= 0 are added to avoid showing NaN on division by zero.

@@ -82,7 +82,7 @@ def section_cluster_node(outer_panels):
% (COMPONENT_LABEL, NODE_LABEL),
),
panels.target(
f"sum(rate({metric('process_cpu_seconds_total')}[$__rate_interval])) by ({COMPONENT_LABEL}, {NODE_LABEL}) / avg({metric('process_cpu_core_num')}) by ({COMPONENT_LABEL}, {NODE_LABEL}) > 0",
f"sum(rate({metric('process_cpu_seconds_total')}[$__rate_interval])) by ({COMPONENT_LABEL}, {NODE_LABEL}) / avg({metric('process_cpu_core_num')}) by ({COMPONENT_LABEL}, {NODE_LABEL})",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

"streaming_upload write_bytes timeout",
))
Err(ObjectError::timeout(format!(
"{} timeout",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not resolved.

.await
.unwrap_or_else(|_| {
Err(ObjectError::timeout(format!(
"{}_attempt_timeout_ms {:?} {}_retry_attempts {:?}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"retry attempts exhausted for {}. Please modify {}_attempt_timeout_ms (current=...) and {}_retry_attempts (current=...) under [storage.object_store.retry] in the config accordingly if needed."

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the efforts!

@Li0k Li0k added this pull request to the merge queue May 8, 2024
@Li0k Li0k added the need-cherry-pick-release-1.9 Open a cherry-pick PR to branch release-1.9 after the current PR is merged label May 8, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 8, 2024
@Li0k Li0k added this pull request to the merge queue May 8, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 8, 2024
@Li0k Li0k added this pull request to the merge queue May 8, 2024
Merged via the queue into main with commit 2929fb8 May 8, 2024
32 of 33 checks passed
@Li0k Li0k deleted the li0k/storage_object_retry branch May 8, 2024 18:00
Li0k added a commit that referenced this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
block-release-v1.9 ci/run-e2e-single-node-tests need-cherry-pick-release-1.9 Open a cherry-pick PR to branch release-1.9 after the current PR is merged type/refactor user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants