Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on network failures (e.g uploads) #318

Closed
bwplotka opened this issue May 2, 2018 · 31 comments
Closed

Retry on network failures (e.g uploads) #318

bwplotka opened this issue May 2, 2018 · 31 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented May 2, 2018

Not critical since compactor just restarted and continued just fine, but can be annoying.

level=error name=thanos-compactor ts=2018-04-28T10:32:12.73383864Z caller=main.go:147 msg="running command failed" err="first pass of downsampling failed: retrieve bucket block metas: get meta for block 01C6XZ1256S7VFNQP9D36XJ4F4: Get https://storage.googleapis.com/thanos-alpha/01C6XZ1256S7VFNQP9D36XJ4F4/meta.json: dial tcp [xxx]:443: connect: network is unreachable"
@bwplotka
Copy link
Member Author

bwplotka commented Jun 1, 2018

Especially funny is the single error during sync metas that causes compactor to retry WHOLE sync.

level=debug ts=2018-05-31T00:14:19.617242203Z caller=compact.go:165 msg="download meta" block=01C8146Z5B7AX5HYTV6S15G044
level=error ts=2018-05-31T00:14:34.617896977Z caller=compact.go:205 msg="retriable error" err="sync: retrieve bucket block metas: downloading meta.json for 01C8146Z5B7AX5HYTV6S15G044: decode meta.json for block 01C8146Z5B7AX5HYTV6S15G044: Get https://prod-int-spaces01.nyc3.internal.digitalocean.com/pandora-lts-ams2-hvs/01C8146Z5B7AX5HYTV6S15G044/meta.json: net/http: timeout awaiting response headers"
level=info ts=2018-05-31T00:15:57.63331754Z caller=compact.go:126 msg="start sync of metas"

Can see for @TimSimmons logs that it happens quite often. We should retry just the problematic thing.

@asbjxrn
Copy link

asbjxrn commented Jul 2, 2018

This happens both for downloads and uploads (of compacted blocks) for me. And I also see the same timeouts when uploading from the sidecars, so I think this issue applies to all components that communicates with the block store.

With a large enough number of thanos sidecars this issue can be quite bad as once you fall behind the number of files up/downloaded gets large which means higher chance of hitting the issue which may put you even further behind and so on.

@bwplotka
Copy link
Member Author

bwplotka commented Jul 2, 2018

yup, exactly.

@bwplotka bwplotka added this to To do in v0.2.0 Sep 21, 2018
xjewer added a commit to xjewer/thanos that referenced this issue Sep 26, 2018
Add backoff reply for a single object storage query request, except Range and Iter methods.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
xjewer added a commit to xjewer/thanos that referenced this issue Sep 26, 2018
Add backoff reply for a single object storage request, except Range and Iter.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
xjewer added a commit to xjewer/thanos that referenced this issue Oct 10, 2018
Add backoff retry for a single object storage request, except Range and Iter.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
xjewer added a commit to xjewer/thanos that referenced this issue Oct 10, 2018
Add backoff retry for a single object storage request, except Range and Iter.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
xjewer added a commit to xjewer/thanos that referenced this issue Oct 10, 2018
Add backoff retry for a single object storage request, except Range and Iter.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
xjewer added a commit to xjewer/thanos that referenced this issue Oct 11, 2018
Add backoff retry for a single object storage request, except Range and Iter.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
xjewer added a commit to xjewer/thanos that referenced this issue Oct 17, 2018
Add backoff retry for a single object storage request, except Range and Iter.
Error handler splits errors on net/http and others, and replies the request to the object storage for the former.

Fixes thanos-io#318
@bwplotka
Copy link
Member Author

Ok this is interesting as s3 client really have retries: https://sourcegraph.com/github.com/minio/minio-go@master/-/blob/api.go#L524:17

Maybe it's worth to reach them?

@bwplotka bwplotka added this to To do in v0.3.0 via automation Dec 3, 2018
@bwplotka bwplotka removed this from To do in v0.2.0 Dec 3, 2018
@bwplotka bwplotka removed this from To do in v0.3.0 Feb 6, 2019
@bwplotka
Copy link
Member Author

bwplotka commented Feb 6, 2019

We double checked and retries are already implemented in minio and GCS client. For each client we need to double check and add if missing (per client).

@bwplotka bwplotka closed this as completed Feb 6, 2019
@swollo
Copy link

swollo commented Mar 13, 2019

@bwplotka this still seems to happen in v0.3.1. The behavior I see is that the timeout occurs, not exactly sure whether the retry is triggered within minio or not, but the compactor exits and restarts. I'd assume that on restart it's cleaning the compaction directory and effectively starting from 0 again

@realdimas
Copy link
Contributor

realdimas commented Mar 13, 2019

@bwplotka we are observing net/http: timeout awaiting response headers about every 15 mins.

Compactor is running without --wait so this error force the whole run to fail and retry.

Setup:

  • S3 bucket with ~1-2 TB of data
  • 200 sidecars uploading chunks
  • --retention.resolution-raw=10d
  • --retention.resolution-5m=15d
  • --retention.resolution-1h=30d
  • release v0.3.2
Logs
thanos-compactor-1552487171-vwdk5   0/1     Error              0          123m
thanos-compactor-1552487171-8t5ff   0/1     Error              0          88m
thanos-compactor-1552487171-qg22k   0/1     Error              0          76m
thanos-compactor-1552487171-p8xvm   0/1     Error              0          67m
thanos-compactor-1552487171-gxbx5   0/1     Error              0          36m
thanos-compactor-1552487171-w4xdp   0/1     Error              0          23m

thanos-compactor-1552487171-vwdk5

level=debug ts=2019-03-13T15:00:19.251252149Z caller=compact.go:721 compactionGroup="0@{REDACTED}" msg="downloaded and verified blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D55WPAZ0X59B623QNRBNSZW5 /tmp/thanos-compact/0@{REDACTED}/01D563J26MVFDGYVW9J01ZRV0F /tmp/thanos-compact/0@{REDACTED}/01D56ADSEMK3D2EQNR3MH4SAS5 /tmp/thanos-compact/0@{REDACTED}/01D56H9GPMTGPV539Y1Q5PK7SK]" duration=26.415816673s
level=info ts=2019-03-13T15:00:56.248042705Z caller=compact.go:391 msg="compact blocks" count=4 mint=1551744000000 maxt=1551772800000 ulid=01D5VS31C57FPR99W06ZNA2ZFC sources="[01D55WPAZ0X59B623QNRBNSZW5 01D563J26MVFDGYVW9J01ZRV0F 01D56ADSEMK3D2EQNR3MH4SAS5 01D56H9GPMTGPV539Y1Q5PK7SK]" duration=36.996722382s
level=debug ts=2019-03-13T15:00:56.316490692Z caller=compact.go:730 compactionGroup="0@{REDACTED}" msg="compacted blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D55WPAZ0X59B623QNRBNSZW5 /tmp/thanos-compact/0@{REDACTED}/01D563J26MVFDGYVW9J01ZRV0F /tmp/thanos-compact/0@{REDACTED}/01D56ADSEMK3D2EQNR3MH4SAS5 /tmp/thanos-compact/0@{REDACTED}/01D56H9GPMTGPV539Y1Q5PK7SK]" duration=37.065172117s
level=error ts=2019-03-13T15:01:14.050367663Z caller=main.go:181 msg="running command failed" err="compaction failed: compaction: upload of 01D5VS31C57FPR99W06ZNA2ZFC failed: upload chunks: upload file /tmp/thanos-compact/0@{REDACTED}/01D5VS31C57FPR99W06ZNA2ZFC/chunks/000001 as 01D5VS31C57FPR99W06ZNA2ZFC/chunks/000001: upload s3 object: Put https://REDACTED.s3.dualstack.eu-west-1.amazonaws.com/01D5VS31C57FPR99W06ZNA2ZFC/chunks/000001?partNumber=3&uploadId=REDACTED: net/http: timeout awaiting response headers"

thanos-compactor-1552487171-8t5ff

level=debug ts=2019-03-13T15:12:59.720173348Z caller=compact.go:721 compactionGroup="0@{REDACTED}" msg="downloaded and verified blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D5DKWGVBC0ESEBBV9B0ZSRSB /tmp/thanos-compact/0@{REDACTED}/01D5DTR8KCXK3CAZDQWBPSRBBD /tmp/thanos-compact/0@{REDACTED}/01D5E1KZRRXAME2EMDCC2PBS52 /tmp/thanos-compact/0@{REDACTED}/01D5E8FPYGTD4C6WMKWVPHKTYZ]" duration=3.043312731s
level=info ts=2019-03-13T15:13:02.457205517Z caller=compact.go:391 msg="compact blocks" count=4 mint=1552003200000 maxt=1552032000000 ulid=01D5VST7TMJ1NTP7NE4MPJ7FP4 sources="[01D5DKWGVBC0ESEBBV9B0ZSRSB 01D5DTR8KCXK3CAZDQWBPSRBBD 01D5E1KZRRXAME2EMDCC2PBS52 01D5E8FPYGTD4C6WMKWVPHKTYZ]" duration=2.736976322s
level=debug ts=2019-03-13T15:13:02.470046947Z caller=compact.go:730 compactionGroup="0@{REDACTED}" msg="compacted blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D5DKWGVBC0ESEBBV9B0ZSRSB /tmp/thanos-compact/0@{REDACTED}/01D5DTR8KCXK3CAZDQWBPSRBBD /tmp/thanos-compact/0@{REDACTED}/01D5E1KZRRXAME2EMDCC2PBS52 /tmp/thanos-compact/0@{REDACTED}/01D5E8FPYGTD4C6WMKWVPHKTYZ]" duration=2.749819539s
level=error ts=2019-03-13T15:13:18.284475363Z caller=main.go:181 msg="running command failed" err="compaction failed: compaction: upload of 01D5VST7TMJ1NTP7NE4MPJ7FP4 failed: upload chunks: upload file /tmp/thanos-compact/0@{REDACTED}/01D5VST7TMJ1NTP7NE4MPJ7FP4/chunks/000001 as 01D5VST7TMJ1NTP7NE4MPJ7FP4/chunks/000001: upload s3 object: Put https://REDACTED.s3.dualstack.eu-west-1.amazonaws.com/01D5VST7TMJ1NTP7NE4MPJ7FP4/chunks/000001?partNumber=2&uploadId=REDACTED: net/http: timeout awaiting response headers"

thanos-compactor-1552487171-qg22k

level=debug ts=2019-03-13T15:21:43.531599137Z caller=compact.go:721 compactionGroup="0@{REDACTED}" msg="downloaded and verified blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D58F32FYX0WBN5T6CTRZH0SG /tmp/thanos-compact/0@{REDACTED}/01D58NYSRKDHK69QPFEKXPR5H1 /tmp/thanos-compact/0@{REDACTED}/01D58WTGVPV9P54PEYZN61Y2MC /tmp/thanos-compact/0@{REDACTED}/01D593P88DJC3V3RJJF9NDDKNB]" duration=2.769941168s
level=info ts=2019-03-13T15:21:46.956397106Z caller=compact.go:391 msg="compact blocks" count=4 mint=1551830400000 maxt=1551859200000 ulid=01D5VTA7BRA7JGNGA7YGWTPBP1 sources="[01D58F32FYX0WBN5T6CTRZH0SG 01D58NYSRKDHK69QPFEKXPR5H1 01D58WTGVPV9P54PEYZN61Y2MC 01D593P88DJC3V3RJJF9NDDKNB]" duration=3.424717733s
level=debug ts=2019-03-13T15:21:46.966418662Z caller=compact.go:730 compactionGroup="0@{REDACTED}" msg="compacted blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D58F32FYX0WBN5T6CTRZH0SG /tmp/thanos-compact/0@{REDACTED}/01D58NYSRKDHK69QPFEKXPR5H1 /tmp/thanos-compact/0@{REDACTED}/01D58WTGVPV9P54PEYZN61Y2MC /tmp/thanos-compact/0@{REDACTED}/01D593P88DJC3V3RJJF9NDDKNB]" duration=3.434741475s
level=error ts=2019-03-13T15:22:02.754889288Z caller=main.go:181 msg="running command failed" err="compaction failed: compaction: upload of 01D5VTA7BRA7JGNGA7YGWTPBP1 failed: upload chunks: upload file /tmp/thanos-compact/0@{REDACTED}/01D5VTA7BRA7JGNGA7YGWTPBP1/chunks/000001 as 01D5VTA7BRA7JGNGA7YGWTPBP1/chunks/000001: upload s3 object: Put https://REDACTED.s3.dualstack.eu-west-1.amazonaws.com/01D5VTA7BRA7JGNGA7YGWTPBP1/chunks/000001?partNumber=1&uploadId=REDACTED: net/http: timeout awaiting response headers"

thanos-compactor-1552487171-p8xvm

level=debug ts=2019-03-13T15:49:56.838470615Z caller=compact.go:721 compactionGroup="0@{REDACTED}" msg="downloaded and verified blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D57KM4WNZV4N098VV5Z9FW85 /tmp/thanos-compact/0@{REDACTED}/01D57TFW4GZ8SREPC5HASRT7J2 /tmp/thanos-compact/0@{REDACTED}/01D581BKD3WY77J8X45W01N0WV /tmp/thanos-compact/0@{REDACTED}/01D5887AMGDTN1FJQ0GQZDK20A]" duration=1m38.510479563s
level=info ts=2019-03-13T15:52:23.551514954Z caller=compact.go:391 msg="compact blocks" count=4 mint=1551801600000 maxt=1551830400000 ulid=01D5VVXZT9QSR5SXWDE900CKVD sources="[01D57KM4WNZV4N098VV5Z9FW85 01D57TFW4GZ8SREPC5HASRT7J2 01D581BKD3WY77J8X45W01N0WV 01D5887AMGDTN1FJQ0GQZDK20A]" duration=2m26.71297677s
level=debug ts=2019-03-13T15:52:23.828198304Z caller=compact.go:730 compactionGroup="0@{REDACTED}" msg="compacted blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D57KM4WNZV4N098VV5Z9FW85 /tmp/thanos-compact/0@{REDACTED}/01D57TFW4GZ8SREPC5HASRT7J2 /tmp/thanos-compact/0@{REDACTED}/01D581BKD3WY77J8X45W01N0WV /tmp/thanos-compact/0@{REDACTED}/01D5887AMGDTN1FJQ0GQZDK20A]" duration=2m26.989662374s
level=error ts=2019-03-13T15:52:50.788765966Z caller=main.go:181 msg="running command failed" err="compaction failed: compaction: upload of 01D5VVXZT9QSR5SXWDE900CKVD failed: upload chunks: upload file /tmp/thanos-compact/0@{REDACTED}/01D5VVXZT9QSR5SXWDE900CKVD/chunks/000003 as 01D5VVXZT9QSR5SXWDE900CKVD/chunks/000003: upload s3 object: Put https://REDACTED.s3.dualstack.eu-west-1.amazonaws.com/01D5VVXZT9QSR5SXWDE900CKVD/chunks/000003?partNumber=3&uploadId=REDACTED: net/http: timeout awaiting response headers"

thanos-compactor-1552487171-gxbx5

level=debug ts=2019-03-13T16:03:26.488516059Z caller=compact.go:721 compactionGroup="0@{REDACTED}" msg="downloaded and verified blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D5CRDJWX3KXRYR9KH8XHCY7S /tmp/thanos-compact/0@{REDACTED}/01D5CZ9A4X2BN6HWYKD873V53R /tmp/thanos-compact/0@{REDACTED}/01D5D651D71TN0YVFWKC028XPH /tmp/thanos-compact/0@{REDACTED}/01D5DD0RPM576T0B4YMB704BNR]" duration=2.300868771s
level=info ts=2019-03-13T16:03:27.988367016Z caller=compact.go:391 msg="compact blocks" count=4 mint=1551974400000 maxt=1552003200000 ulid=01D5VWPKN6M9702HJH263GKC7C sources="[01D5CRDJWX3KXRYR9KH8XHCY7S 01D5CZ9A4X2BN6HWYKD873V53R 01D5D651D71TN0YVFWKC028XPH 01D5DD0RPM576T0B4YMB704BNR]" duration=1.499793223s
level=debug ts=2019-03-13T16:03:27.995430305Z caller=compact.go:730 compactionGroup="0@{REDACTED}" msg="compacted blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D5CRDJWX3KXRYR9KH8XHCY7S /tmp/thanos-compact/0@{REDACTED}/01D5CZ9A4X2BN6HWYKD873V53R /tmp/thanos-compact/0@{REDACTED}/01D5D651D71TN0YVFWKC028XPH /tmp/thanos-compact/0@{REDACTED}/01D5DD0RPM576T0B4YMB704BNR]" duration=1.506860238s
level=error ts=2019-03-13T16:03:43.388278661Z caller=main.go:181 msg="running command failed" err="compaction failed: compaction: upload of 01D5VWPKN6M9702HJH263GKC7C failed: upload chunks: upload file /tmp/thanos-compact/0@{REDACTED}/01D5VWPKN6M9702HJH263GKC7C/chunks/000001 as 01D5VWPKN6M9702HJH263GKC7C/chunks/000001: upload s3 object: Put https://REDACTED.s3.dualstack.eu-west-1.amazonaws.com/01D5VWPKN6M9702HJH263GKC7C/chunks/000001?partNumber=2&uploadId=REDACTED: net/http: timeout awaiting response headers"

thanos-compactor-1552487171-w4xdp

level=debug ts=2019-03-13T16:19:22.895728181Z caller=compact.go:721 compactionGroup="0@{REDACTED}" msg="downloaded and verified blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D5EFBD320884B27184AGYDQ6 /tmp/thanos-compact/0@{REDACTED}/01D5EP74B77CBDEV6M3B0JG4Z3 /tmp/thanos-compact/0@{REDACTED}/01D5EX2VK8H1DSCAR55MA6DT0S /tmp/thanos-compact/0@{REDACTED}/01D5F3YJMAF3884QWFEX076VXY]" duration=26.174313049s
level=info ts=2019-03-13T16:19:59.160326436Z caller=compact.go:391 msg="compact blocks" count=4 mint=1552032000000 maxt=1552060800000 ulid=01D5VXKVKD992DWAN2AGY2BEZG sources="[01D5EFBD320884B27184AGYDQ6 01D5EP74B77CBDEV6M3B0JG4Z3 01D5EX2VK8H1DSCAR55MA6DT0S 01D5F3YJMAF3884QWFEX076VXY]" duration=36.264544218s
level=debug ts=2019-03-13T16:19:59.242527786Z caller=compact.go:730 compactionGroup="0@{REDACTED}" msg="compacted blocks" blocks="[/tmp/thanos-compact/0@{REDACTED}/01D5EFBD320884B27184AGYDQ6 /tmp/thanos-compact/0@{REDACTED}/01D5EP74B77CBDEV6M3B0JG4Z3 /tmp/thanos-compact/0@{REDACTED}/01D5EX2VK8H1DSCAR55MA6DT0S /tmp/thanos-compact/0@{REDACTED}/01D5F3YJMAF3884QWFEX076VXY]" duration=36.346748295s
level=error ts=2019-03-13T16:20:24.118422931Z caller=main.go:181 msg="running command failed" err="compaction failed: compaction: upload of 01D5VXKVKD992DWAN2AGY2BEZG failed: upload chunks: upload file /tmp/thanos-compact/0@{REDACTED}/01D5VXKVKD992DWAN2AGY2BEZG/chunks/000003 as 01D5VXKVKD992DWAN2AGY2BEZG/chunks/000003: upload s3 object: Put https://REDACTED.s3.dualstack.eu-west-1.amazonaws.com/01D5VXKVKD992DWAN2AGY2BEZG/chunks/000003?partNumber=2&uploadId=REDACTED: net/http: timeout awaiting response headers"

@bwplotka
Copy link
Member Author

So this is essentially connected to minio library.. If you are getting timeout seems like we should look on the reasons why..Are blocks too big? Is there anyway we can adjust minio library (https://github.com/minio/minio-go) to improve that?

Retry is already in place, minio should handle retries. But if you are getting timeout for retries even... Not sure if masking your issue with another retry is a good solution here (:

@bwplotka
Copy link
Member Author

one way is to actually grab a single bigger block that fails constantly and try upload it on your own using mc (minio client) - and adjust things there so you know how to adjust it on prod.

@swollo
Copy link

swollo commented Mar 13, 2019

I'll give that a try. I believe the some directories from the compaction are large, > 100 GBs. I'll have to do some digging.
However, in logs that @forkbomber posted, we see that the upload fails with an error log. I'm new to Go, but from my understanding, this should only log a warning and then retry, unless I'm understanding something wrong:

https://github.com/improbable-eng/thanos/blob/2b8669265ebcb3b60bbf4dadb887a504a4bfa56e/pkg/compact/compact.go#L575

@realdimas
Copy link
Contributor

@bwplotka in our case these are all different blocks each time. It does succeed, but at times after a handful of cronjob restarts caused by net/http: timeout awaiting response headers.

@bwplotka
Copy link
Member Author

bwplotka commented Mar 14, 2019

@GiedriusS re: #923 (comment)

See this issue here, but what's the point of retrying if the underlying client provider lib retries for us?
Essentialy:

  • they have more control, they can actually split request into actual multi upload and retry only those etc

The only problem is when the library we use has this logic broken, I think we should propagate this issue to them. Double retrying is not a solution.

@bwplotka bwplotka reopened this Mar 14, 2019
@GiedriusS
Copy link
Member

Oh, sorry, haven't seen this since it was closed. Could we rename the title to be a bit more generic because this affects not only compactor but sidecar as well? :P Yes, I agree that this should be delegated to the underlying libraries that we use but perhaps we could think of some kind even smarter solution like double checking what (if any) files were uploaded to remote storage, and to retry uploading only those files if they are still present on the disk.

@bwplotka bwplotka changed the title compactor: Retry on network failures Retry on network failures (e.g uploads) Mar 14, 2019
@bwplotka
Copy link
Member Author

bwplotka commented Mar 14, 2019

I would follow up for every issue to the underlying provider and make them better. If we will be really hit by this we can still evaluate that bit, but in perfect world (open source world) we should not do it unless provider states that.

E.g how we can tell if the error is even retriable? It does not make sense to retry always (500, 403,404 etc)

@SuperQ
Copy link
Contributor

SuperQ commented Mar 18, 2019

Related: #934

@xjewer
Copy link
Contributor

xjewer commented Apr 12, 2019

just rolled back to the build with v0.20, will see, how it works

for some reason, I see uploaded blocks with corrupted state (missing files, could be index file or chunk files)

Example of one block with index file being absent:

mc ls -r s3/thanos/01D84YH6M4MPG0JZ6M9C411B72/
[2019-04-11 01:03:48 BST] 512MiB chunks/000001
[2019-04-11 01:03:51 BST] 512MiB chunks/000002
[2019-04-11 01:04:10 BST] 512MiB chunks/000003
[2019-04-11 01:04:19 BST] 512MiB chunks/000004
[2019-04-11 01:04:26 BST] 512MiB chunks/000005
[2019-04-11 01:04:35 BST] 113MiB chunks/000006
[2019-04-11 01:05:05 BST]   453B meta.json
Example of logs from radosgw:

[11/Apr/2019:01:03:40 +0000] "HEAD /thanos/01D84YH6M4MPG0JZ6M9C411B72/meta.json HTTP/1.1" 404 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:40 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:40 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:40 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:40 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:40 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:40 +0000] "PUT /thanos/debug/metas/01D84YH6M4MPG0JZ6M9C411B72.json HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:42 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:43 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:43 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:44 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:46 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000001 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:47 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:47 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:47 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:47 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:47 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:48 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:49 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:50 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:51 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:54 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000002 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:55 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:56 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:56 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:56 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:56 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:58 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:03:59 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:04 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:04 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:08 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000003 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:09 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:09 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:09 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:09 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:09 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:12 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:12 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:12 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:13 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:19 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000004 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:19 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:20 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:20 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:20 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:20 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:21 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:21 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:21 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:22 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:26 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000005 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:26 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/chunks/000006 HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:35 +0000] "POST /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:35 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:35 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:35 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:35 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:38 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:38 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:39 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:40 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:41 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:42 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:42 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:44 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:45 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:49 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:50 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:53 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:54 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 400 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:04:55 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:05:01 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 400 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:05:02 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:05:04 +0000] "DELETE /thanos/01D84YH6M4MPG0JZ6M9C411B72/index HTTP/1.1" 204 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)
[11/Apr/2019:01:05:04 +0000] "PUT /thanos/01D84YH6M4MPG0JZ6M9C411B72/meta.json HTTP/1.1" 200 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12)

@Alexvianet
Copy link

Alexvianet commented Apr 18, 2019

Also have the similar issue with sidecar and compact release v0.3.2, s3 provider.
prometheus2/c56c3589-a2f1-4d22-aa7a-dfaf9b05cecf: stdout | level=error ts=2019-04-18T15:03:01.700845076Z caller=shipper.go:342 msg="shipping failed" block=01D8RFCAP40FQ0HDCM3YVP3737 err="failed to clean block after upload issue. Partial block in system. Err: upload meta file: upload file /var/vcap/store/prometheus2/thanos/upload/01D8RFCAP40FQ0HDCM3YVP3737/meta.json as 01D8RFCAP40FQ0HDCM3YVP3737/meta.json: upload s3 object: Put https://s3/thanos-dc20-prod/01D8RFCAP40FQ0HDCM3YVP3737/meta.json: net/http: timeout awaiting response headers: upload meta file: upload file /var/vcap/store/prometheus2/thanos/upload/01D8RFCAP40FQ0HDCM3YVP3737/meta.json as 01D8RFCAP40FQ0HDCM3YVP3737/meta.json: upload s3 object: Put https://s3/thanos-dc20-prod/01D8RFCAP40FQ0HDCM3YVP3737/meta.json: net/http: timeout awaiting response headers" thanos_compactor/7646e8e8-62d6-418e-8ab9-319ec593cb56: stdout | level=error ts=2019-04-18T15:05:39.924223397Z caller=compact.go:265 msg="retriable error" err="compaction failed: sync: retrieve bucket block metas: Get https://s3/thanos-dc20-prod/?delimiter=%2F&max-keys=1000&prefix=: net/http: timeout awaiting response headers"

@drax68
Copy link

drax68 commented Apr 23, 2019

Same issue with rc release and 400+Gb block upload. Compactor fails with "net/http: timeout await
ing response headers" and retries whole compaction for that group. It's quite inefficient and generates large amount of traffic.

@bwplotka
Copy link
Member Author

bwplotka commented Apr 23, 2019

Guys, can you make sure to mention:

  • What provider are you using
  • Thanos version
  • How are you sure that no retry actually happen. It might be that provider code attempted couple of times and gave up (which might indicate massive network disconnectivity issues)

Otherwise it is not much helpful ):

Ideally we would like to focus on each provider separatedly

@drax68
Copy link

drax68 commented Apr 23, 2019

  • s3
  • 0.4.0-rc.0
  • Logs from the time when it's happened, nothing more:
thanos_compactor[16064]: level=info ts=2019-04-23T12:03:21.942003329Z caller=compact.go:441 msg="compact blocks" count=7 mint=1554336000000 maxt=1555545600000 ulid=01D94MRNE52W85WSVD9RG6PS07 sources="[01D7S0BWZR5MNZFWVSX3RY52CP 01D7XVKN5AMQ1J40GRMK99NN2Q 01D82QNH8NHZ9T6Y8B7MXSE4PS 01D87XMJ3QK1SSQJ9GCR8CQFJB 01D8D77WAM0W1T7XPD24Y2NPXM 01D8JDC78D23WR7A67HAQCDYPR 01D930ZBMBD33HBZRGYAHH7XHT]" duration=3h38m47.722834567s
thanos_compactor[16064]: level=error ts=2019-04-23T12:24:23.825585304Z caller=main.go:182 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{prometheus_node=\"1234\",prometheus_stack_name=\"stack1\"}: upload of 01D94MRNE52W85WSVD9RG6PS07 failed: upload chunks: upload file /opt/thanos/compact/compact/0@{prometheus_node=\"1234\",prometheus_stack_name=\"stack1\"}/01D94MRNE52W85WSVD9RG6PS07/chunks/000075 as 01D94MRNE52W85WSVD9RG6PS07/chunks/000075: upload s3 object: Put https://s3_bucket/01D94MRNE52W85WSVD9RG6PS07/chunks/000075?partNumber=2&uploadId=asdzxc: net/http: timeout awaiting response headers"

@antonio
Copy link
Contributor

antonio commented Apr 25, 2019

This is also happening constantly to me, on S3, with version 0.3.1.

I've spent some time today debugging the issue and I believe it might have been caused by #323 : likely the 15 seconds timeout that were set in that PR are not enough for large blocks.

I'm testing a custom version in which I've increased the timeout to 2 minutes (🤷‍♂️ 😄) and so far I haven't seen any issues in a couple hours, where it used to fail every 5-10 minutes. I'll leave a few compacting processes running over the night and will report back tomorrow with the results.

@antonio
Copy link
Contributor

antonio commented Apr 26, 2019

I'll leave a few compacting processes running over the night and will report back tomorrow with the results.

All the processes are still working correctly after 12 hours.

@antonio
Copy link
Contributor

antonio commented Apr 29, 2019

I'm testing a custom version in which I've increased the timeout to 2 minutes

I haven't experienced a single error in the last 4 days. @bwplotka I'd be happy to contribute a patch for the timeout awaiting response headers issue, but I'd like to ask what your preferred option would be: to simply increase it to another arbitrary value (e.g. 2 minutes) or to add a configuration flag. The latter is more flexible, but at the same time adds complexity without (imho) adding much value. Pinging @alvaroaleman too as the creator of #323

@alvaroaleman
Copy link
Contributor

The headers are the first thing sent, admittedly the 10s we currently use are a bit little, but if you don't get them within two whole minutes I'd say you can safely assume they wont come later.

@kadern0
Copy link
Contributor

kadern0 commented Apr 29, 2019

I'm testing a custom version in which I've increased the timeout to 2 minutes

I haven't experienced a single error in the last 4 days. @bwplotka I'd be happy to contribute a patch for the timeout awaiting response headers issue, but I'd like to ask what your preferred option would be: to simply increase it to another arbitrary value (e.g. 2 minutes) or to add a configuration flag. The latter is more flexible, but at the same time adds complexity without (imho) adding much value. Pinging @alvaroaleman too as the creator of #323

I would'n mind trying your approach, as I'm facing exactly the same issue. Could you share your changes ? Gracias.

@antonio
Copy link
Contributor

antonio commented Apr 29, 2019

@kadern0 I've included my change in #1094

@cspargo
Copy link

cspargo commented May 16, 2019

I see the s3 header timeout issue with 0.3.2 and 0.4.0. One thing I noticed when I was trying 0.4.0 is that when the timeout happens, the thanos compactor process exits and needs to be restarted. With 0.3.2 it does not exit, and just loops and tries again. Is this an expected changed in behaviour in 0.4.0?

@Allex1
Copy link
Contributor

Allex1 commented Jun 19, 2019

  • S3
  • thanos 0.5.0

Jun 19 05:46:05 HOST: level=error ts=2019-06-19T05:46:05.073782433Z caller=main.go:182 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{monitor=\"master\",replica=\"1\"}: upload of 01DDQ452TMKEX341JMXVXV90N4 failed: upload chunks: upload file /var/lib/thanos-compact/compact/0@{monitor=\"master\",replica=\"1\"}/01DDQ452TMKEX341JMXVXV90N4/chunks/000005 as 01DDQ452TMKEX341JMXVXV90N4/chunks/000005: upload s3 object: Put https://s3bucket.s3.dualstack.us-east-1.amazonaws.com/01DDQ452TMKEX341JMXVXV90N4/chunks/000005?partNumber=8&uploadId=xxx--: net/http: timeout awaiting response headers"

#1094 which was merged before the 0.5.0 release doesn't seem to fix it for us

@daixiang0
Copy link
Member

@Allex1 hi, would you like to try with v0.10 rc?

@Allex1
Copy link
Contributor

Allex1 commented Jan 9, 2020

@daixiang0 I haven't seen this error since upgrading to v0.8.1
Thanks

@daixiang0
Copy link
Member

@bwplotka seems we can close it safely.

@bwplotka bwplotka closed this as completed Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests