Data loss when running compactor for fixed time range #6293

v1jayr · 2023-04-18T22:19:44Z

v1jayr
Apr 18, 2023

Context

Compactor in our cluster gained a lot of backlog that it was not compacting the data anymore. Following the recommendation under Troubleshoot compactor backlog , I ran the compactor for just a short time range 49 days with a fixed min and max time in compactor. With this setup the compactor was able to compact, downsample and delete old blocks. as expected it produced 3 14-day long blocks (42 days) but there are no blocks for the last 1 week. Thanos Query returns no data for that time range. Using thanos tools bucket inspect I confirmed that there are no blocks for the 7 day period.

It looks like the compactor merged raw blocks in blocks of 14 days of downsampled data. But the last 7 days went missing! I was expecting it to be compacted into its own blocks. Retention for 1 hour resolution is set to 3 years, and it is less than a year for those data blocks.

Anyone know what is causing this behavior?
How do it prevent this when running compaction for select time range?

thanos version: v0.29

Answered by v1jayr

Apr 19, 2023

After bit more digging, it appears that downsampling requires source blocks to be of certain size. In my case 5m downsample did occur but not 1hour. The 1h downsampling requires blocks size of 10 days and since the time range I ran the compactor for was not a multiple of 10 days the compacted blocks for the missing period were <10 days long. So downsampling skipped those blocks. Then retention hit and cleaned up the raw and 5m blocks. Its possible there are other edge cases I might be missing.

Conclusion:
I guess the safest possible path forward is to compact with retention time increased to a large enough value. I am also not sure what the best approach is to work through a large backlog…

View full answer

v1jayr · 2023-04-18T22:44:14Z

v1jayr
Apr 18, 2023
Author

NOTE: If retention at each resolution is lower than minimum age for the successive downsampling pass, data will be deleted before downsampling can be completed. As a rule of thumb retention for each downsampling level should be the same, and should be greater than the maximum date range (10 days for 5m to 1h downsampling).

This might be the reason. Since the compactor was pretty much blocked for a year, I guess the "minimum age for the successive downsampling pass" (not sure what this means) must have gotten greater than the retention period of raw blocks. Looks like I will need to adjust the retention while i'm working through the backlog! Good that this is called out in the compactor doc. Would be nice to have this warning included in the compactor backlog doc as well.

3 replies

v1jayr Apr 19, 2023
Author

Going through the Compactor code, it seems to run the 2 downsample passes before applying retention. From the logs I do see the downsample iterations done log message before the start optional retention log message (code ref). In this case, I would expect the downsample pass to produce the 1hr blocks ahead of the retention logic is run. How then does the blocks get deleted as suggested in the doc?

v1jayr Apr 19, 2023
Author

After bit more digging, it appears that downsampling requires source blocks to be of certain size. In my case 5m downsample did occur but not 1hour. The 1h downsampling requires blocks size of 10 days and since the time range I ran the compactor for was not a multiple of 10 days the compacted blocks for the missing period were <10 days long. So downsampling skipped those blocks. Then retention hit and cleaned up the raw and 5m blocks. Its possible there are other edge cases I might be missing.

Conclusion:
I guess the safest possible path forward is to compact with retention time increased to a large enough value. I am also not sure what the best approach is to work through a large backlog in parallel since the time range has an impact on how blocks get compacted and in turn whether they will be downsampled or not.

Answer selected by v1jayr

BenoitPoulet Apr 20, 2023

I drawn this timeline about this, some times ago
https://twitter.com/poulet_benoit/status/1630199372586549250

v1jayr · 2023-04-19T21:21:40Z

v1jayr
Apr 19, 2023
Author

Also filed an issue to update the compactor backlog documentation to provide more details on how to safely run multiple compactors for non-overlapping time range in parallel. #6298

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loss when running compactor for fixed time range #6293

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Data loss when running compactor for fixed time range #6293

v1jayr Apr 18, 2023

Context

Replies: 2 comments · 3 replies

v1jayr Apr 18, 2023 Author

v1jayr Apr 19, 2023 Author

v1jayr Apr 19, 2023 Author

BenoitPoulet Apr 20, 2023

v1jayr Apr 19, 2023 Author

v1jayr
Apr 18, 2023

Replies: 2 comments 3 replies

v1jayr
Apr 18, 2023
Author

v1jayr Apr 19, 2023
Author

v1jayr Apr 19, 2023
Author

v1jayr
Apr 19, 2023
Author