Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos compactor fails with populate block: chunk iter: cannot populate chunk 8: segment index 0 out of range #5978

Closed
mateuszdrab opened this issue Dec 18, 2022 · 10 comments

Comments

@mateuszdrab
Copy link

mateuszdrab commented Dec 18, 2022

Thanos, Prometheus and Golang version used: thanos:0.29.0

Object Storage Provider: s3/minio

What happened:

Similarly to issue #1300, I've had too little storage assigned to the compactor so it built a big backlog and only started processing it after expansion of the volume yesterday. However, overnight, it halted due to the error below and restarting the compactor reproduces the same exact error message with the same blocks.

What you expected to happen:
Compact or skip and continue with the rest of the backlog. There are still over 300 compactions in the todo, and about half has been done

How to reproduce it (as minimally and precisely as possible):
Not sure

Full logs to relevant components:

level=error ts=2022-12-18T03:30:59.254203234Z caller=compact.go:488 msg="critical error detected; halting" err="compaction: group 0@16259223971946823639: compact blocks [/data/compact/0@16259223971946823639/01GGM8RSMD74KF170SH7VJKGKE /data/compact/0@16259223971946823639/01GGMFMH57QV6C6GJ45MAH49EN /data/compact/0@16259223971946823639/01GGMPG8DMKVJVJMWAP97M84R5 /data/compact/0@16259223971946823639/01GGMXBZP09041HM47Y47HDXQJ]: populate block: chunk iter: cannot populate chunk 8: segment index 0 out of range"
@fpetkovski
Copy link
Contributor

Do you have a retention policy configured in the compactor or in the bucket itself?

As a workaround, you might want to add a no-compact mark to the blocks for which compaction fails so that the compactor can get unstuck.

@mateuszdrab
Copy link
Author

Do you have a retention policy configured in the compactor or in the bucket itself?

As a workaround, you might want to add a no-compact mark to the blocks for which compaction fails so that the compactor can get unstuck.

Hi @fpetkovski

Thanks for replying

I configured the retention on the compactor, not the bucket.

I suspect the issue could be due to partial upload or a corrupted block, but I'm not sure how to verify.

I do I mark the block as do not compact?

Another thing is, the error mentions a few blocks I think, so it's hard to tell which one is an actual issue.

@fpetkovski
Copy link
Contributor

I configured the retention on the compactor, not the bucket.

Could the retention kick in while blocks are getting uploaded?

I do I mark the block as do not compact?
Another thing is, the error mentions a few blocks I think, so it's hard to tell which one is an actual issue.

You can use the thanos tools mark command for this: https://thanos.io/tip/components/tools.md/#bucket-mark and mark all blocks reported in that error.

@mateuszdrab
Copy link
Author

@fpetkovski upon browsing minio looking for the chunks mentioned in the error, I found that there was a chunk 01GGMXBZP09041HM47Y47HDXQJ that did not have a chunks directory, just meta.json and index so I deleted the chunk and compaction resumed. Must have been an incomplete upload.
Would be nice if Thanos could flag and continue with other blocks or even delete such chunks (the chunk was dated over a month ago).

@pjastrzabek-roche
Copy link

We have the same issue in one bucket, but there are a lot of chunks like this, so deleting them one by one did not help.
Is there some way to find all broken chunks and mark them all at once?

@mateuszdrab
Copy link
Author

mateuszdrab commented Jan 15, 2023

We have the same issue in one bucket, but there are a lot of chunks like this, so deleting them one by one did not help.

Is there some way to find all broken chunks and mark them all at once?

Have you tried https://thanos.io/tip/components/tools.md/#bucket-verify ?

Looks like it might be the command you need to determine broken chunks.

@pjastrzabek-roche
Copy link

pjastrzabek-roche commented Jan 16, 2023

Thanks for the tip.
Problematic chunk looks like this

level=debug ts=2023-01-16T01:32:32.303097377Z caller=index_issue.go:140 verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues stats="{TotalSeries:49338 OutOfOrderSeries:0 OutOfOrderChunks:0 DuplicatedChunks:0 OutsideChunks:0 CompleteOutsideChunks:0 Issue347OutsideChunks:0 OutOfOrderLabels:0 SeriesMinLifeDuration:0s SeriesAvgLifeDuration:865ms SeriesMaxLifeDuration:15.001s SeriesMinLifeDurationWithoutSingleSampleSeries:14.656s SeriesAvgLifeDurationWithoutSingleSampleSeries:14.976s SeriesMaxLifeDurationWithoutSingleSampleSeries:15.001s SeriesMinChunks:1 SeriesAvgChunks:1 SeriesMaxChunks:1 TotalChunks:49338 ChunkMinDuration:0s ChunkAvgDuration:865ms ChunkMaxDuration:15.001s ChunkMinSize:9223372036854775807 ChunkAvgSize:0 ChunkMaxSize:-9223372036854775808 SingleSampleSeries:46487 SingleSampleChunks:46487 LabelNamesCount:194 MetricLabelValuesCount:1168}" id=01GJ73EN2DN1XFHEA6087NFTQ7
level=debug ts=2023-01-16T01:32:32.303142098Z caller=index_issue.go:57 verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="no issue" id=01GJ73EN2DN1XFHEA6087NFTQ7

What's weird is chunkMaxSize -9223372036854775808 and ChunkMinSize 9223372036854775807 (both being very round numbers in hex/bin).
I have 49 chunks like this.

@mchinkov-ionos
Copy link

@fpetkovski upon browsing minio looking for the chunks mentioned in the error, I found that there was a chunk 01GGMXBZP09041HM47Y47HDXQJ that did not have a chunks directory, just meta.json and index so I deleted the chunk and compaction resumed. Must have been an incomplete upload. Would be nice if Thanos could flag and continue with other blocks or even delete such chunks (the chunk was dated over a month ago).

Yes, I've had the same thing! Debugged half-day via thanos command line, found nothing, and then simple brute-force chunk deletion from S3 itself solved the problem!

@outofrange
Copy link
Contributor

outofrange commented Feb 9, 2024

This happens for us every few weeks, either due to some underlying network issues or other problems, not sure yet.

But fixing this is quite painful; usually it's

  • restart compactor
  • look for error / wait until problematic block is addressed
  • get block id from logs and delete
  • rinse and repeat

I had no luck with bucket verify, but will test again - maybe I already deleted affected blocks manually when testing it, not sure anymore.
But even if it works, it takes a long time, and having to download everything or setting up a backup bucket for automatic repair is more hassle than I'd expect to get a list of blocks with no chunks attached to them, to manually delete - if it'd work that way.

If anyone else needs to fix it, here's a small script using s3cmd to find buckets without chunks; disclaimer:

  • won't check for "freshness", so would delete currently uploading buckets - only run when nothing uploads, or expect data loss
  • deletes every folder with no or an empty "chunks" dir - so only execute against thanos buckets
#!/bin/bash

THREAD=10
BUCKET=thanos
BUCKETS=$(s3cmd ls s3://$BUCKET | grep -Po "s3://.*")

function check_bucket() {
     bucket=$1
     CHUNKS=$(s3cmd ls "${bucket%/}/chunks/" | wc -l)
     if [[ $CHUNKS -eq 0 ]]; then
         echo "$bucket: no chunks, deleting"
         s3cmd rm -r "${bucket%/}"
     fi
}

export -f check_bucket
echo "$BUCKETS" | xargs -L1 -P$THREAD bash -c 'check_bucket "$@"' _

@bobykus31
Copy link

bobykus31 commented Mar 3, 2024

Is there any way to download and check why bucket is "segment index out of range"? Mine looks like

@thanos-compact1:~$ ls -la data/01HQZD47X047YW61ZSYHFYJC8K/* 
-rw-rw-r-- 1 user user 8553936 Mar  2 11:00 data/01HQZD47X047YW61ZSYHFYJC8K/index
-rw-rw-r-- 1 user user     713 Mar  2 11:00 data/01HQZD47X047YW61ZSYHFYJC8K/meta.json

data/01HQZD47X047YW61ZSYHFYJC8K/chunks:
total 54400
drwxrwxr-x 2 user user       20 Mar  3 09:35 .
drwxrwxr-x 3 user user       50 Mar  3 09:35 ..
-rw-rw-r-- 1 user user 55702692 Mar  2 11:00 000001

Can not see anything wrong thereby naked eye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants