Improve handling of out-of-order chunks #4916

AndrewNiven1 · 2021-11-30T15:46:36Z

Is your proposal related to a problem?

When Compactor encounters an out-of-order chunk, it is able to skip compaction of that block if --compact.skip-block-with-out-of-order-chunks is set (from this MR). This is very useful for stopping a single broken pod from breaking compaction for an entire deployment.

What I am concerned about is the long-term behaviour here. We have observed that, when blocks go uncompacted for too long, metrics query performance becomes extremely poor. If too many blocks are having compaction skipped, I expect that they will eventually have the same effect on performance, and it will not be obvious to an operator why this is happening.

Describe the solution you'd like

EDIT: This solution is already implemented

As a quick and easy solution, we would like to add a metric to track blocks that are being skipped due to compaction (e.g. thanos_compact_skipped_compaction_blocks_total. This would allow us to write dashboards and alerts that make it clear if this is happening.

Describe alternatives you've considered

It would be nice if Compactor had the ability to clean up the storage account that contains the out-of-order chunk. This could be as simple as backing up the block to a separate storage account, and deleting the offending out-of-order time series (or the entire chunk, or the entire block). This could lead to some data loss from the storage account, but it would be retrievable, and would make Compactor much more self-sufficient (i.e. requiring less manual intervention when things go wrong).

The text was updated successfully, but these errors were encountered:

wiardvanrij · 2021-12-02T20:57:35Z

I haven't tested it myself, but for your quick and easy solution;

thanos/cmd/thanos/compact.go

Line 148 in d08a12a

m.blocksMarked = promauto.With(reg).NewCounterVec(prometheus.CounterOpts{

Wouldn't metadata.OutOfOrderChunksNoCompactReason give you the info you need?

AndrewNiven1 · 2021-12-06T10:50:53Z

Ah, excellent - I was looking for a standalone metric and didn't realise it would be a label on another metric. Thanks for pointing that out!

I'd still like to have a discussion about whether we could add an option to automate back up and deletion of problematic time series (either at the time-series, chunk or block level). I'd like to know if the Thanos team would be likely to accept this as a proposal, or if there are potential issues with this I haven't foreseen.

yeya24 · 2021-12-12T06:49:58Z

I'd still like to have a discussion about whether we could add an option to automate back up and deletion of problematic time series (either at the time-series, chunk or block level). I'd like to know if the Thanos team would be likely to accept this as a proposal, or if there are potential issues with this I haven't foreseen.

I think this is a reasonable approach. Right now, when OOO chunks/series happen, the compactor will halt and the only thing I can do besides debugging is just use the bucket rewrite tool to delete the problematic series.

This can be provided as a flag to just delete the OOO chunks/series if users don't really care about debugging this case.

stale · 2022-03-02T15:26:34Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2022-04-17T06:54:41Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

wiardvanrij added component: compact feature request/improvement labels Dec 2, 2021

stale bot added the stale label Mar 2, 2022

stale bot closed this as completed Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of out-of-order chunks #4916

Improve handling of out-of-order chunks #4916

AndrewNiven1 commented Nov 30, 2021 •

edited

wiardvanrij commented Dec 2, 2021

AndrewNiven1 commented Dec 6, 2021

yeya24 commented Dec 12, 2021

stale bot commented Mar 2, 2022

stale bot commented Apr 17, 2022

Improve handling of out-of-order chunks #4916

Improve handling of out-of-order chunks #4916

Comments

AndrewNiven1 commented Nov 30, 2021 • edited

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

wiardvanrij commented Dec 2, 2021

AndrewNiven1 commented Dec 6, 2021

yeya24 commented Dec 12, 2021

stale bot commented Mar 2, 2022

stale bot commented Apr 17, 2022

AndrewNiven1 commented Nov 30, 2021 •

edited