Data loss when running compactor for fixed time range #6293
-
ContextCompactor in our cluster gained a lot of backlog that it was not compacting the data anymore. Following the recommendation under Troubleshoot compactor backlog , I ran the compactor for just a short time range 49 days with a fixed min and max time in compactor. With this setup the compactor was able to compact, downsample and delete old blocks. as expected it produced 3 14-day long blocks (42 days) but there are no blocks for the last 1 week. Thanos Query returns no data for that time range. Using thanos tools bucket inspect I confirmed that there are no blocks for the 7 day period. It looks like the compactor merged raw blocks in blocks of 14 days of downsampled data. But the last 7 days went missing! I was expecting it to be compacted into its own blocks. Retention for 1 hour resolution is set to 3 years, and it is less than a year for those data blocks. Anyone know what is causing this behavior? thanos version: v0.29 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
This might be the reason. Since the compactor was pretty much blocked for a year, I guess the "minimum age for the successive downsampling pass" (not sure what this means) must have gotten greater than the retention period of raw blocks. Looks like I will need to adjust the retention while i'm working through the backlog! Good that this is called out in the compactor doc. Would be nice to have this warning included in the compactor backlog doc as well. |
Beta Was this translation helpful? Give feedback.
-
Also filed an issue to update the compactor backlog documentation to provide more details on how to safely run multiple compactors for non-overlapping time range in parallel. #6298 |
Beta Was this translation helpful? Give feedback.
After bit more digging, it appears that downsampling requires source blocks to be of certain size. In my case 5m downsample did occur but not 1hour. The 1h downsampling requires blocks size of 10 days and since the time range I ran the compactor for was not a multiple of 10 days the compacted blocks for the missing period were <10 days long. So downsampling skipped those blocks. Then retention hit and cleaned up the raw and 5m blocks. Its possible there are other edge cases I might be missing.
Conclusion:
I guess the safest possible path forward is to compact with retention time increased to a large enough value. I am also not sure what the best approach is to work through a large backlog…