Improving {Halt,Retry}Error semantics #2784

GiedriusS · 2020-06-19T09:11:56Z

It's a continuation of a small discussion that we had on Slack a few weeks ago. Currently, we have HaltError and RetryError. HaltErrors happen when some critical issue occurs and we have to halt the whole process so that the operator could investigate what went wrong. RetryError happens when some operation fails that we can retry. Some relevant snippets of code:

  366             // The RetryError signals that we hit an retriable error (transient error, no connection).                                                                                                         
  367             // You should alert on this being triggered too frequently.

  354             // The HaltError type signals that we hit a critical bug and should block                                                                                                                          
  355             // for investigation. You should alert on this being halted.

However, the way it is right now is if a RetryError occurs then we retry the whole iteration i.e. compactMainFn(). That's awesome but all of the previously performed work disappears because once we start that function again, we wipe the workspace:

https://github.com/thanos-io/thanos/blob/master/pkg/compact/compact.go#L938-L941
https://github.com/thanos-io/thanos/blob/master/pkg/compact/compact.go#L464-L469

This means that all of the files that were previously successfully downloaded have been lost.

I propose extending this logic to check what is on the disk before trying to overwrite the files. My idea is:

Extend meta.json to include file integrity checksums of all files in the directory besides meta.json itself and deletion-mark.json. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;
Make Thanos Compact consult those checksums if they are present before trying to download files again. If they match then we don't download the file. If they don't - we overwrite and download the file again.

Affected Thanos versions: <=0.13.0.

Thoughts?

The text was updated successfully, but these errors were encountered:

In certain environments, especially private, on-premise clouds, sometimes transient errors can happen due to misconfiguration or some reason. In my case, the S3 end-point sometimes had spuriously reset the TCP connection. Without functionality like this, it was almost impossible for Thanos Compact to finish its work considering sometimes we have to download hundreds of gigabytes of files. Curiously enough, only the downloading path was affected so this only adds retries for that direction. Opening this up without any tests to see if it is something we want to add to Thanos or we should implement thanos-io#2784? Or maybe both? Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

pracucci · 2020-06-19T14:16:44Z

My gut feeling is that your proposed logic adds extra complexity to cover a case which shouldn't be the normal one. I wouldn't expect failures to happen so frequently and, if they do, probably fixing the root cause is better, but I'm open to discuss it further.

Extend meta.json to include file integrity checksums of all files in the directory besides meta.json itself and deletion-mark.json. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;

Some object storages already export the checksum of objects when reading attributes (HEAD operation). Have you checked if it's available for any object store we support? If so, you wouldn't need to store any checksum in meta.json.

GiedriusS · 2020-06-19T14:39:36Z

My gut feeling is that your proposed logic adds extra complexity to cover a case which shouldn't be the normal one. I wouldn't expect failures to happen so frequently and, if they do, probably fixing the root cause is better, but I'm open to discuss it further.

I actually had to implement a rough version of #2785 because it was impossible for Thanos Compactor to finish at least one iteration due to transient errors. RetryError should help out in this case but it wipes everything right now so it's pretty useless. It only "helps" by not restarting the whole process when --wait is specified.

If the object storage code doesn't support (and it shouldn't) retries when some more serious error happens and if our retrying logic is kind of useless IMHO then where do you think such things should be handled?

RetryError seems like an elegant solution to me but it needs improvements as outlined in the OP.

Extend meta.json to include file integrity checksums of all files in the directory besides meta.json itself and deletion-mark.json. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;

Some object storages already export the checksum of objects when reading attributes (HEAD operation). Have you checked if it's available for any object store we support? If so, you wouldn't need to store any checksum in meta.json.

That's true but I don't think we should add another method for this to our object storage interface if we can avoid it because that raises the bar for any other object storage that might not support such functionality.

pracucci · 2020-06-19T15:02:07Z

That's true but I don't think we should add another method for this to our object storage interface if we can avoid it because that raises the bar for any other object storage that might not support such functionality.

We have the attributes function which was designed for this. Assuming it's feasible to add the checksum (I haven't checked if it's implemented by all storages we do support nowadays) I agree that adding extra attributes there could be a blocker to support other storages in the future.

stale · 2020-07-19T15:04:00Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS · 2020-07-20T10:42:49Z

WIP. Have basic hashes support in my own branch, now need to add the needed functionality in Compactor.

stale · 2020-08-19T13:37:50Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS · 2020-08-19T13:53:04Z

Attempting to implement this in #3031.

stale · 2020-10-18T15:54:22Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-12-19T05:00:07Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS · 2020-12-22T19:58:52Z

Trying to push #3031 over the line.

stale · 2021-02-21T15:43:12Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS · 2021-03-01T21:14:57Z

#3031 merged so closing 🎉

locvo986 · 2021-03-01T21:26:56Z

Vào Th 3, 2 thg 3, 2021 lúc 04:15 Giedrius Statkevičius < notifications@github.com> đã viết:

Closed #2784 <#2784>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2784 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASCHX7A33GCKJ64SZJKZVCLTBP7WVANCNFSM4OCROJSQ> .

-- locvo

GiedriusS mentioned this issue Jun 19, 2020

compact/tools: add hidden option for retrying downloads #2785

Closed

2 tasks

stale bot added the stale label Jul 19, 2020

stale bot removed the stale label Jul 20, 2020

GiedriusS mentioned this issue Aug 13, 2020

block: precalculate hashes if enabled and use them during compaction (downloading) #3031

Merged

2 tasks

stale bot added the stale label Aug 19, 2020

stale bot removed the stale label Aug 19, 2020

stale bot added the stale label Oct 18, 2020

kakkoyun removed the stale label Oct 19, 2020

stale bot added the stale label Dec 19, 2020

stale bot removed the stale label Dec 22, 2020

stale bot added the stale label Feb 21, 2021

GiedriusS removed the stale label Feb 22, 2021

GiedriusS closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving {Halt,Retry}Error semantics #2784

Improving {Halt,Retry}Error semantics #2784

GiedriusS commented Jun 19, 2020

pracucci commented Jun 19, 2020

GiedriusS commented Jun 19, 2020

pracucci commented Jun 19, 2020

stale bot commented Jul 19, 2020

GiedriusS commented Jul 20, 2020

stale bot commented Aug 19, 2020

GiedriusS commented Aug 19, 2020

stale bot commented Oct 18, 2020

stale bot commented Dec 19, 2020

GiedriusS commented Dec 22, 2020

stale bot commented Feb 21, 2021

GiedriusS commented Mar 1, 2021

locvo986 commented Mar 1, 2021 via email

Improving {Halt,Retry}Error semantics #2784

Improving {Halt,Retry}Error semantics #2784

Comments

GiedriusS commented Jun 19, 2020

pracucci commented Jun 19, 2020

GiedriusS commented Jun 19, 2020

pracucci commented Jun 19, 2020

stale bot commented Jul 19, 2020

GiedriusS commented Jul 20, 2020

stale bot commented Aug 19, 2020

GiedriusS commented Aug 19, 2020

stale bot commented Oct 18, 2020

stale bot commented Dec 19, 2020

GiedriusS commented Dec 22, 2020

stale bot commented Feb 21, 2021

GiedriusS commented Mar 1, 2021

locvo986 commented Mar 1, 2021 via email