Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing upload confuses shipper #934

Closed
SuperQ opened this issue Mar 18, 2019 · 8 comments
Closed

Existing upload confuses shipper #934

SuperQ opened this issue Mar 18, 2019 · 8 comments

Comments

@SuperQ
Copy link
Contributor

SuperQ commented Mar 18, 2019

Thanos, Prometheus and Golang version used

thanos, version 0.3.2 (branch: HEAD, revision: 4b7320c0e45e3f48a437bd19294f569785bafb02)
  build user:       root@e9a9c28f966a
  build date:       20190304-17:11:05
  go version:       go1.11.5

What happened

Shipper didn't write out thanos.shipper.json due to failure to upload a compacted block.

What you expected to happen

Shipper should update thanos.shipper.json every time it uploads a block, not just when it completes a batch sync.

How to reproduce it (as minimally and precisely as possible):

Bucket storage upload is canceled in the middle of a large compacted block upload.

Full logs to relevant components

2019-03-18_10:55:09.69871 level=info ts=2019-03-18T10:55:09.698596569Z caller=shipper.go:375 msg="upload new block" id=01CXQ2RF3N84BSXF8TGGN36ZST
2019-03-18_11:09:03.45688 level=info ts=2019-03-18T11:09:03.456809039Z caller=shipper.go:375 msg="upload new block" id=01CY8ETJP1F9MC3T9EQWFBR884
2019-03-18_11:20:27.20557 level=error ts=2019-03-18T11:20:27.205460156Z caller=shipper.go:342 msg="shipping failed" block=01CY8ETJP1F9MC3T9EQWFBR884 err="upload chunks: upl
oad file /opt/prometheus/prometheus/data/thanos/upload/01CY8ETJP1F9MC3T9EQWFBR884/chunks/000041 as 01CY8ETJP1F9MC3T9EQWFBR884/chunks/000041: context canceled"
2019-03-18_11:20:27.33593 level=info ts=2019-03-18T11:20:27.335859618Z caller=shipper.go:226 msg="gathering all existing blocks from the remote bucket"
2019-03-18_11:23:48.78979 level=error ts=2019-03-18T11:23:48.789680688Z caller=shipper.go:326 msg="found overlap or error during sync, cannot upload compacted block" err="shipping compacted block 01CXQ2RF3N84BSXF8TGGN36ZST is blocked; overlap spotted: [mint: 1543147200000, maxt: 1543730400000, range: 162h0m0s, blocks: 2]: <ulid: 01CXQ2RF3N84BSXF8TGGN36ZST, mint: 1543147200000, maxt: 1543730400000, range: 162h0m0s>, <ulid: 01CXQ2RF3N84BSXF8TGGN36ZST, mint: 1543147200000, maxt: 1543730400000, range: 162h0m0s>"
@bwplotka
Copy link
Member

Dived into this bit more.

  1. We don't want to spend too much time & code on this. This is only one time feature. Most of the time user will not use this code path so supporting it in production path code should be limited. Maybe instead we should add this functionality as tool?
  2. Something is weird here:
    If your log is from single run:
  • we iterate over 01CXQ2RF3N84BSXF8TGGN36ZST
  • then over 01CY8ETJP1F9MC3T9EQWFBR884 and that failed.
  • somehow we iterate over 01CXQ2RF3N84BSXF8TGGN36ZST again! (what?)

@bwplotka
Copy link
Member

There is another issue with this -> if you restart sidecar in the middle of process you will have overlap issue as well, but that's another story.

@caarlos0
Copy link

I still see this on v0.5.0 + prometheus 2.10.0

level=error ts=2019-06-25T17:09:38.818351347Z caller=shipper.go:310 msg="found overlap or error during sync, cannot upload compacted block" err="shipping compacted block 01DDJAMD7QJE1SAZAWQ2Q7ESR8 is blocked; overlap spotted: [mint: 1560751200000, maxt: 1560902400000, range: 42h0m0s, blocks: 2]: <ulid: 01DDQ3M3085SK8MSYG475ZG7RZ, mint: 1560751200000, maxt: 1560902400000, range: 42h0m0s>, <ulid: 01DDQ3KJMV8WX3MWPYA21Q52WE, mint: 1560751200000, maxt: 1560902400000, range: 42h0m0s>\n[mint: 1560902400000, maxt: 1561075200000, range: 48h0m0s, blocks: 2]: <ulid: 01DDW1HSGA2K8XAJ1A86JK7907, mint: 1560902400000, maxt: 1561075200000, range: 48h0m0s>, <ulid: 01DDW1J9SGY24J25MYPMHMHA8J, mint: 1560902400000, maxt: 1561075200000, range: 48h0m0s>\n[mint: 1561075200000, maxt: 1561248000000, range: 48h0m0s, blocks: 2]: <ulid: 01DE16BG0NVDWPSNQD1X4GVEYG, mint: 1561075200000, maxt: 1561248000000, range: 48h0m0s>, <ulid: 01DE16AZJ4B49AQK38RTR0WZH2, mint: 1561075200000, maxt: 1561248000000, range: 48h0m0s>\n[mint: 1561248000000, maxt: 1561420800000, range: 48h0m0s, blocks: 2]: <ulid: 01DE6B192N2135ZQFS9P5SJE20, mint: 1561248000000, maxt: 1561420800000, range: 48h0m0s>, <ulid: 01DE6B0PSYX6Q6ZP2QJFK3Z0ZG, mint: 1561248000000, maxt: 1561420800000, range: 48h0m0s>"

uploading 300d of old data 💭

happened on another prometheus instance with 40d too...

@bwplotka
Copy link
Member

@caarlos0 what exactly you want to accomplish? Are you running sidecar with any special flag?

@caarlos0
Copy link

I'm running with --shipper.upload-compacted , wanted to upload historical data of an existing instance...

@bwplotka
Copy link
Member

bwplotka commented Jun 25, 2019

This is quite manual feature and single time. To make sure it is safe, it errors out instead of assumming something. The best bet is to try understand the error and mititgate. I don't know you case but by first glance it looks like you have blocks for exactly the same timestamp but diffferent ULID. Are you sure that:

  1. You set unique external labels?
  2. There is no global compactor running?

@caarlos0
Copy link

caarlos0 commented Jun 25, 2019

You set unique external labels?

yes

There is no global compactor running?

it was running at some point, I stopped it but errors continued...

@caarlos0
Copy link

what happened - I think, is that I started it without the flag, stopped, added the flag and started again

maybe it got lost in there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants