Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downsample: retry objstore related errors #7194

Merged
merged 3 commits into from Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -19,6 +19,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#7122](https://github.com/thanos-io/thanos/pull/7122) Store Gateway: Fix lazy expanded postings estimate base cardinality using posting group with remove keys.

### Added
- [#7194](https://github.com/thanos-io/thanos/pull/7194) Downsample: retry objstore related errors
- [#7105](https://github.com/thanos-io/thanos/pull/7105) Rule: add flag `--query.enable-x-functions` to allow usage of extended promql functions (xrate, xincrease, xdelta) in loaded rules
- [#6867](https://github.com/thanos-io/thanos/pull/6867) Query UI: Tenant input box added to the Query UI, in order to be able to specify which tenant the query should use.
- [#7175](https://github.com/thanos-io/thanos/pull/7175): Query: Add `--query.mode=distributed` which enables the new distributed mode of the Thanos query engine.
Expand Down
5 changes: 3 additions & 2 deletions cmd/thanos/downsample.go
Expand Up @@ -21,6 +21,7 @@ import (
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/prometheus/tsdb"
"github.com/prometheus/prometheus/tsdb/chunkenc"
"github.com/thanos-io/thanos/pkg/compact"

"github.com/thanos-io/objstore"
"github.com/thanos-io/objstore/client"
Expand Down Expand Up @@ -358,7 +359,7 @@ func processDownsampling(

err := block.Download(ctx, logger, bkt, m.ULID, bdir, objstore.WithFetchConcurrency(blockFilesConcurrency))
if err != nil {
return errors.Wrapf(err, "download block %s", m.ULID)
return compact.NewRetryError(errors.Wrapf(err, "download block %s", m.ULID))
}
level.Info(logger).Log("msg", "downloaded block", "id", m.ULID, "duration", time.Since(begin), "duration_ms", time.Since(begin).Milliseconds())

Expand Down Expand Up @@ -419,7 +420,7 @@ func processDownsampling(

err = block.Upload(ctx, logger, bkt, resdir, hashFunc)
if err != nil {
return errors.Wrapf(err, "upload downsampled block %s", id)
return compact.NewRetryError(errors.Wrapf(err, "upload downsampled block %s", id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From scanning the code, I dont think this is consumed somewhere right now, right? I dont think this will lead to retries currently!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it
processDownsampling returns error to
downsampleBucket which returns error to compactMainFn in cmd/thanos/compact.go
then in cmd/thanos/compact.go there's a if compact.IsRetryError(err) { check
which should trigger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we run compactor with --wait flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to retry right here instead of returning an error. This way the compactor will not have to go through the whole cycle again and downsample the block from the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought to make it less intrusive, retry like compaction process is retried.
I can retry upload/download calls. But it would be good to make the same logic in compaction - retry upload/download calls. And, maybe, expose a parameter --objstore.file-retries Maximum number of retries for fetch/upload block files from object storage.
What do you say, @fpetkovski ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, my bad!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xBazilio sounds good, we can keep things consistent for now. Nothing is set in stone anyway, so we can improve if needed.

}

level.Info(logger).Log("msg", "uploaded block", "id", id, "duration", time.Since(begin), "duration_ms", time.Since(begin).Milliseconds())
Expand Down
4 changes: 4 additions & 0 deletions pkg/compact/compact.go
Expand Up @@ -967,6 +967,10 @@ type RetryError struct {
err error
}

func NewRetryError(err error) error {
return retry(err)
}

func retry(err error) error {
if IsHaltError(err) {
return err
Expand Down