-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos sidecar doesn't upload full blocks #7234
Comments
Hey! Out of curiosity: how does meta.json look for a broken block? Does it only contain the chunks that are actually there uploaded? |
Wonder why it is always "cannot populate chunk 8 from block"... We have such errors in compactor periodically when object storage load/latency is high, like
Last time even too many
16 blocks last 24hrs. While all (some of?) this blocks or chunks looks perfect when run
Here is a meta.json
and a copy of block on the local fs
Looks like some issue in s3 obj storage client or what? |
Hey @bobykus31, are you able to share the blocks by chance? |
I wish but don't think I am allowed to, sorry. You mean a content, right? |
I even dunno if the block itself can be any help, but ...
|
So I turned on a debug for s3 by trace.enable: true. Here is what I can see
First
Then
Then
Somehow I can not see something like
exactly as I can see
For some reason compactor just ignored /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001 even later on I was able to download it with "s3cmd get --recursive" manually.
|
Hey @bobykus31, thats really interesting; can you retry with |
Yes, re download seems to make the compaction going well.
|
Seems to be an s3 issues itself. I was able to reproduce it with s3cmd for some blocks. May be a --consistency-delay setting can help (currently it is 2h) |
The issue still exists, fyi. Increase --consistency-delay does not help much. Looks similar to #1199 |
Not a fix for it by any means, but if we already have uploaded such blocks then #7282 should at least make compactor and store not crash but rather mark them as corrupted and increment the proper metrics |
This could happen when you use the multi site obj storage. And data sync between sites is not consistent (for many reasons). Here is what obj storage provider recommends in such a case Note: In general, you should use the “read-after-new-write” consistency control value. If requests aren't working correctly, change the application client behavior if possible. Or, configure the client to specify the consistency control for each API request. Set the consistency control at the bucket level only as a last resort.
How this achievable with Thanos? |
Thanos, Prometheus and Golang version used:
Object Storage Provider: Ceph / R2
What happened:
Every now and then, our sidecars upload incomplete blocks. This causes our compactors to panic while trying to read those blocks until we manually delete them.
How incomplete varies from completely missing chunks:
to incomplete chunks:
What you expected to happen:
Either:
How to reproduce it (as minimally and precisely as possible):
Unknown at the moment. Submitting this mostly in the off chance anyone has seen this before
Full logs to relevant components:
Anything else we need to know:
A few quirks of our environment:
Full prometheus command line:
Full sidecar command line:
/cc @MichaHoffmann
The text was updated successfully, but these errors were encountered: