You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During debugging for downsampling last week I discovered at least one block that contained duplicated and out of order data.
For about 10% of its series there was a sequence of 3 chunks that was repeated about 180 times. At the end of that repetition were two chunks for times before that sequence. I could not come up with an explanation how this could have happened. Generally, the original block with those 3 chunks might not have been GC'd and then accidentally been re-compacted in over several iterations. But this should not get us to 180.
No data loss occurred as far as I can tell and this should be fully recoverable.
Even though the duplication is very high, the total data blow up is fairly minimal AFAICT.
As we have no explanation of what caused this, the best way to address this seems to be:
Add handling to our normal reads and downsampling that account for the issue so it is not user facing
Add verification logic to our compactor that aborts if it detects such a case again. This way we can detect the issue right away and will have a chance to properly debug it rather than after several more compaction iterations
Add a thanos bucket check command that walks existing blocks and detects the issue. It can also be extended to re-write affected blocks properly.
The text was updated successfully, but these errors were encountered:
During debugging for downsampling last week I discovered at least one block that contained duplicated and out of order data.
For about 10% of its series there was a sequence of 3 chunks that was repeated about 180 times. At the end of that repetition were two chunks for times before that sequence. I could not come up with an explanation how this could have happened. Generally, the original block with those 3 chunks might not have been GC'd and then accidentally been re-compacted in over several iterations. But this should not get us to 180.
No data loss occurred as far as I can tell and this should be fully recoverable.
Even though the duplication is very high, the total data blow up is fairly minimal AFAICT.
As we have no explanation of what caused this, the best way to address this seems to be:
thanos bucket check
command that walks existing blocks and detects the issue. It can also be extended to re-write affected blocks properly.The text was updated successfully, but these errors were encountered: