[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328

namco1992 · 2023-05-02T14:35:26Z

Is your proposal related to a problem?

As the issue #5978 mentioned, it's currently a halt error when it tries to compact an incomplete block (e.g. only meta.json or index is uploaded, but not the chunks folder). It seems the safeguard here https://github.com/thanos-io/thanos/blob/main/pkg/block/block.go#L156 doesn't really guarantee that the partial upload == missing or corrupted meta.json. #5859 suggests the same.

Although the issue #5978 is closed, the solution is deleting bad blocks manually, which is arguably a toil and also error-prone.

Describe the solution you'd like

We would like to propose the compactor extend the detection of the partial upload blocks. It's proven that the assumption that "The presence of meta.json means a complete upload" sometimes doesn't stand. Simply checking the meta.json doesn't guarantee the block is intact, a more comprehensive approach is needed to cover the cases where the chunks are missing/partially uploaded.

It can be done when collecting the block health stats:

thanos/pkg/compact/compact.go

Lines 1035 to 1059 in a1ec4d5

    
           var stats block.HealthStats 
        
           if err := tracing.DoInSpanWithErr(ctx, "compaction_block_health_stats", func(ctx context.Context) (e error) { 
        
           	stats, e = block.GatherIndexHealthStats(cg.logger, filepath.Join(bdir, block.IndexFilename), meta.MinTime, meta.MaxTime) 
        
           	return e 
        
           }, opentracing.Tags{"block.id": meta.ULID}); err != nil { 
        
           	return errors.Wrapf(err, "gather index issues for block %s", bdir) 
        
           } 
        
           if err := stats.CriticalErr(); err != nil { 
        
           	return halt(errors.Wrapf(err, "block with not healthy index found %s; Compaction level %v; Labels: %v", bdir, meta.Compaction.Level, meta.Thanos.Labels)) 
        
           } 
        
           if err := stats.OutOfOrderChunksErr(); err != nil { 
        
           	return outOfOrderChunkError(errors.Wrapf(err, "blocks with out-of-order chunks are dropped from compaction:  %s", bdir), meta.ULID) 
        
           } 
        
           if err := stats.Issue347OutsideChunksErr(); err != nil { 
        
           	return issue347Error(errors.Wrapf(err, "invalid, but reparable block %s", bdir), meta.ULID) 
        
           } 
        
           if err := stats.OutOfOrderLabelsErr(); !cg.acceptMalformedIndex && err != nil { 
        
           	return errors.Wrapf(err, 
        
           		"block id %s, try running with --debug.accept-malformed-index", meta.ULID) 
        
           } 
        
           return nil

Or in the BestEffortCleanAbortedPartialUploads:

thanos/cmd/thanos/compact.go

Lines 404 to 419 in a1ec4d5

    
           cleanPartialMarked := func() error { 
        
           	cleanMtx.Lock() 
        
           	defer cleanMtx.Unlock() 
        
           	if err := sy.SyncMetas(ctx); err != nil { 
        
           		return errors.Wrap(err, "syncing metas") 
        
           	} 
        
           	compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, compactMetrics.partialUploadDeleteAttempts, compactMetrics.blocksCleaned, compactMetrics.blockCleanupFailures) 
        
           	if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil { 
        
           		return errors.Wrap(err, "cleaning marked blocks") 
        
           	} 
        
           	compactMetrics.cleanups.Inc() 
        
           	return nil 
        
           }

The text was updated successfully, but these errors were encountered:

GiedriusS · 2023-05-03T10:16:11Z

I've personally seen this problem come up time and time again. But this sounds like a bug in minio-go - it returns success whereas the uploading fails. I wonder if you have any way of reproducing this?

namco1992 · 2023-06-26T08:01:17Z

Hi @GiedriusS, sorry for the long-overdue reply 😅 It still keeps happening after I raised the issue here. However, I'm increasingly doubt that it might have something to do with our in-house Ceph cluster.

I've personally seen this problem come up time and time again.

Do you mean you've personally encountered this issue in your work/project or you've seen this problem is repeatedly reported? If it's the former, do you also use some sort of in-house storage solution too? Unfortunately I don't have a way to reproduce this consistently because I'm not fully understand the condition to trigger this.

We recently had an outage of the Ceph cluster, and it leads to corrupted chunks too. I think my point is that the current safeguards doesn't seem to be sufficient, especially when using some in-house storage solution that might not be as reliable as the AWS S3. The compactor should have a data integrity check and skip corrupted blocks.

seanschneeweiss · 2023-06-27T07:03:29Z

Can confirm that we see similar problems with an in-house Minio storage, that has problems syncing between the replicas due to DNS lookup timeouts.

dmclf · 2023-10-31T00:04:43Z

Also experiencing this on AWS S3 (which likely is not to be considered an in-house S3 solution?)
After deleting the key/ULID that does not have a chunks folder, things are happy again.

outofrange · 2024-02-09T17:05:48Z

This would help a lot!

If they are excluded from compaction, would they still be deleted when retention comes, or would they still have to be manually cleaned up?

MichaHoffmann mentioned this issue Apr 22, 2024

fetcher: sanity check metas for common corruptions #7282

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328

[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328

namco1992 commented May 2, 2023 •

edited

Loading

GiedriusS commented May 3, 2023

namco1992 commented Jun 26, 2023

seanschneeweiss commented Jun 27, 2023

dmclf commented Oct 31, 2023

outofrange commented Feb 9, 2024

[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328

[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328

Comments

namco1992 commented May 2, 2023 • edited Loading

Is your proposal related to a problem?

Describe the solution you'd like

GiedriusS commented May 3, 2023

namco1992 commented Jun 26, 2023

seanschneeweiss commented Jun 27, 2023

dmclf commented Oct 31, 2023

outofrange commented Feb 9, 2024

namco1992 commented May 2, 2023 •

edited

Loading