Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328

Open
namco1992 opened this issue May 2, 2023 · 5 comments

Comments

@namco1992
Copy link

namco1992 commented May 2, 2023

Is your proposal related to a problem?

As the issue #5978 mentioned, it's currently a halt error when it tries to compact an incomplete block (e.g. only meta.json or index is uploaded, but not the chunks folder). It seems the safeguard here https://github.com/thanos-io/thanos/blob/main/pkg/block/block.go#L156 doesn't really guarantee that the partial upload == missing or corrupted meta.json. #5859 suggests the same.

Although the issue #5978 is closed, the solution is deleting bad blocks manually, which is arguably a toil and also error-prone.

Describe the solution you'd like

We would like to propose the compactor extend the detection of the partial upload blocks. It's proven that the assumption that "The presence of meta.json means a complete upload" sometimes doesn't stand. Simply checking the meta.json doesn't guarantee the block is intact, a more comprehensive approach is needed to cover the cases where the chunks are missing/partially uploaded.

It can be done when collecting the block health stats:

thanos/pkg/compact/compact.go

Lines 1035 to 1059 in a1ec4d5

var stats block.HealthStats
if err := tracing.DoInSpanWithErr(ctx, "compaction_block_health_stats", func(ctx context.Context) (e error) {
stats, e = block.GatherIndexHealthStats(cg.logger, filepath.Join(bdir, block.IndexFilename), meta.MinTime, meta.MaxTime)
return e
}, opentracing.Tags{"block.id": meta.ULID}); err != nil {
return errors.Wrapf(err, "gather index issues for block %s", bdir)
}
if err := stats.CriticalErr(); err != nil {
return halt(errors.Wrapf(err, "block with not healthy index found %s; Compaction level %v; Labels: %v", bdir, meta.Compaction.Level, meta.Thanos.Labels))
}
if err := stats.OutOfOrderChunksErr(); err != nil {
return outOfOrderChunkError(errors.Wrapf(err, "blocks with out-of-order chunks are dropped from compaction: %s", bdir), meta.ULID)
}
if err := stats.Issue347OutsideChunksErr(); err != nil {
return issue347Error(errors.Wrapf(err, "invalid, but reparable block %s", bdir), meta.ULID)
}
if err := stats.OutOfOrderLabelsErr(); !cg.acceptMalformedIndex && err != nil {
return errors.Wrapf(err,
"block id %s, try running with --debug.accept-malformed-index", meta.ULID)
}
return nil

Or in the BestEffortCleanAbortedPartialUploads:

cleanPartialMarked := func() error {
cleanMtx.Lock()
defer cleanMtx.Unlock()
if err := sy.SyncMetas(ctx); err != nil {
return errors.Wrap(err, "syncing metas")
}
compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, compactMetrics.partialUploadDeleteAttempts, compactMetrics.blocksCleaned, compactMetrics.blockCleanupFailures)
if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil {
return errors.Wrap(err, "cleaning marked blocks")
}
compactMetrics.cleanups.Inc()
return nil
}

@GiedriusS
Copy link
Member

I've personally seen this problem come up time and time again. But this sounds like a bug in minio-go - it returns success whereas the uploading fails. I wonder if you have any way of reproducing this?

@namco1992
Copy link
Author

Hi @GiedriusS, sorry for the long-overdue reply 😅 It still keeps happening after I raised the issue here. However, I'm increasingly doubt that it might have something to do with our in-house Ceph cluster.

I've personally seen this problem come up time and time again.

Do you mean you've personally encountered this issue in your work/project or you've seen this problem is repeatedly reported? If it's the former, do you also use some sort of in-house storage solution too? Unfortunately I don't have a way to reproduce this consistently because I'm not fully understand the condition to trigger this.

We recently had an outage of the Ceph cluster, and it leads to corrupted chunks too. I think my point is that the current safeguards doesn't seem to be sufficient, especially when using some in-house storage solution that might not be as reliable as the AWS S3. The compactor should have a data integrity check and skip corrupted blocks.

@seanschneeweiss
Copy link

Can confirm that we see similar problems with an in-house Minio storage, that has problems syncing between the replicas due to DNS lookup timeouts.

@dmclf
Copy link

dmclf commented Oct 31, 2023

Also experiencing this on AWS S3 (which likely is not to be considered an in-house S3 solution?)
After deleting the key/ULID that does not have a chunks folder, things are happy again.

@outofrange
Copy link
Contributor

This would help a lot!

If they are excluded from compaction, would they still be deleted when retention comes, or would they still have to be manually cleaned up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants