Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Objstore Client Level Failures and Retries #3907

Closed
bwplotka opened this issue Mar 10, 2021 · 6 comments
Closed

Improve Objstore Client Level Failures and Retries #3907

bwplotka opened this issue Mar 10, 2021 · 6 comments

Comments

@bwplotka
Copy link
Member

A good retry mechanism is critical. For example compactor is doing all the compaction work, and when the upload is aborted, and if object client library decides to not retry, we crash the container and start from scratch. Since we never know at what moment we crashed we have to download, potentially the same blocks again, and compact it again which is inefficient if we just failed upload. Hashing added by @GiedriusS mitigates that partially by ensuring downloaded block consistency, but it might be only small mitigation.

We had many attempts, many PRs for adding bucket level retries e.g: #3894, #3756

Adding retries on object storage level? The problem was discussed many times already. The main problem is that we have (and should have) retries in individual client implementations. It's because we know more about the client-side, so we can perform retries efficiently. Adding retry logic to all parent layers is not necessary and should be then avoided.

At the end, we trust each client to have good retry logic ideally per HTTP multi-part request even AND with good visibility/metrics Each client should also understand things backpressure statuses like Rate-limit, TryLater etc

but.. all of this is not the case in practice, unfortunately.

Solution?

I can see two option:

  1. Should we try to contribute this to at least main clients e.g GCS/S3 or double-check this?

  2. Create "just in case" SINGLE bucket wrapper that tries to apply some retries within timeout to handle non-consistent retry logics around clients? 🤔

Thoughts welcome 🤗

@Biswajitghosh98
Copy link
Contributor

I had a discussion with @prmsrswt and we came to a conclusion along the lines of your solution 2.
I'm trying to refactor the exponential backoff code so that it can be added in runutils and then used in the way you suggested with a user input flag value consisting of max time till retry and other parameters. @bwplotka

@GiedriusS
Copy link
Member

I even attempted to do this once upon a time: #2785 😄

@bwplotka
Copy link
Member Author

Thanks @Biswajitghosh98 and yea, we could use @GiedriusS implementation directly.

I am not yet fully convinced we want to go this path though. It is very inefficient to retry on this level, especially is client already retries. Do we really want this? (:

@pracucci
Copy link
Contributor

It is very inefficient to retry on this level, especially is client already retries.

I agree on this. I'm not much convinced about doing retries with a wrapper. We would end up retrying 4xx errors and retry over already retried requests for client already supporting it.

What if we start with a better analysis, like:

  1. Among object store clients we do support, which support retries and which not?
  2. For the ones not supporting it, are there open issues in the respective projects? If not, can we open it? (eg. Minio developers have been very supportive so far)

@stale
Copy link

stale bot commented Jun 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@stale
Copy link

stale bot commented Jun 16, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants