Skip to content

Conversation

@pettersolberg88
Copy link

@pettersolberg88 pettersolberg88 commented Dec 29, 2023

Store: Add flag ignore-deletion-marks-errors to be able to ignore errors while retrieving deletion marks.

Our S3 implementation (Netapp) have intermittent faults that creates time-outs when querying some non-existent objects.
The IgnoreDeletionMarkFilter queries all metrics blocks for the file deletion-mark.json and when store receives an timeout or other error, it crashes. This flag ignores all fetching errors, and makes store not crash.

Fixes errors like this:

{"caller":"grpc.go:164","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HA07EAKT1YPCMYC6SDHS58S0/deletion-mark.json: Get \"https:<S3-URL>/thanos-metrics/01HA07EAKT1YPCMYC6SDHS58S0/deletion-mark.json\": dial tcp <IP-address>:443: i/o timeout","level":"info","msg":"internal server is shutdown gracefully","service":"gRPC/server","ts":"2023-10-31T07:06:03.987065433Z"}

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Add flag ignore-deletion-marks-errors to be able to ignore errors while retrieving deletion marks.
  • Log other errors while processing deletion-marks.

Verification

  • Tested against against our s3 implementation. Store do not crash.

…ors while retrieving deletion marks

Signed-off-by: Petter Solberg <pettersolberg88@gmail.com>
Signed-off-by: Petter Solberg <pettersolberg88@gmail.com>
Signed-off-by: Petter Solberg <pettersolberg88@gmail.com>
@pull-request-size pull-request-size bot added size/M and removed size/L labels Dec 29, 2023
Signed-off-by: Petter Solberg <pettersolberg88@gmail.com>
Signed-off-by: Petter Solberg <pettersolberg88@gmail.com>
Signed-off-by: Petter Solberg <pettersolberg88@gmail.com>
@yeya24
Copy link
Contributor

yeya24 commented Jan 3, 2024

The IgnoreDeletionMarkFilter queries all metrics blocks for the file deletion-mark.json and when store receives an timeout or other error, it crashes.

This doesn't sound right. IO timeout shouldn't crash store

@pettersolberg88
Copy link
Author

pettersolberg88 commented Jan 4, 2024

I Agree that IO timeout should not crash thanos store. For me it seems that the error-handling does not handle timeout properly.

Here is a complete log from thanos-store, that is currently crashlooping. We are running two replicas and both are crashlooping running v0.32.5. And the workaround is to delete the whole chunk.

k logs thanos-store-cold-0 -c thanos-store -p {"caller":"factory.go:53","level":"info","msg":"loading bucket configuration","ts":"2024-01-04T05:46:42.136611622Z"} {"caller":"factory.go:35","level":"info","msg":"loading index cache configuration","ts":"2024-01-04T05:46:42.1370288Z"} {"caller":"memcached.go:71","level":"info","msg":"created index cache","ts":"2024-01-04T05:46:42.137888234Z"} {"caller":"options.go:26","level":"info","msg":"disabled TLS, key and cert must be set to enable","protocol":"gRPC","ts":"2024-01-04T05:46:42.13830578Z"} {"caller":"store.go:520","level":"info","msg":"starting store node","ts":"2024-01-04T05:46:42.139639595Z"} {"caller":"intrumentation.go:75","level":"info","msg":"changing probe status","status":"healthy","ts":"2024-01-04T05:46:42.139735228Z"} {"address":"0.0.0.0:10902","caller":"http.go:73","component":"store","level":"info","msg":"listening for requests and metrics","service":"http/server","ts":"2024-01-04T05:46:42.139776183Z"} {"caller":"store.go:418","level":"info","msg":"initializing bucket store","ts":"2024-01-04T05:46:42.139788709Z"} {"address":"[::]:10902","caller":"tls_config.go:274","component":"store","level":"info","msg":"Listening on","service":"http/server","ts":"2024-01-04T05:46:42.139967167Z"} {"address":"[::]:10902","caller":"tls_config.go:277","component":"store","http2":false,"level":"info","msg":"TLS is disabled.","service":"http/server","ts":"2024-01-04T05:46:42.139987835Z"} {"caller":"intrumentation.go:67","level":"warn","msg":"changing probe status","reason":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","status":"not-ready","ts":"2024-01-04T07:02:52.44066162Z"} {"caller":"http.go:91","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutting down","service":"http/server","ts":"2024-01-04T07:02:52.440757359Z"} {"caller":"intrumentation.go:56","level":"info","msg":"changing probe status","status":"ready","ts":"2024-01-04T07:02:52.440804592Z"} {"address":"0.0.0.0:10901","caller":"grpc.go:131","component":"store","level":"info","msg":"listening for serving gRPC","service":"gRPC/server","ts":"2024-01-04T07:02:52.440867161Z"} {"caller":"http.go:110","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutdown gracefully","service":"http/server","ts":"2024-01-04T07:02:52.44091739Z"} {"caller":"intrumentation.go:81","level":"info","msg":"changing probe status","reason":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","status":"not-healthy","ts":"2024-01-04T07:02:52.440966495Z"} {"caller":"intrumentation.go:67","level":"warn","msg":"changing probe status","reason":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","status":"not-ready","ts":"2024-01-04T07:02:52.441015326Z"} {"caller":"grpc.go:138","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutting down","service":"gRPC/server","ts":"2024-01-04T07:02:52.441033605Z"} {"caller":"grpc.go:151","component":"store","level":"info","msg":"gracefully stopping internal server","service":"gRPC/server","ts":"2024-01-04T07:02:52.44105544Z"} {"caller":"grpc.go:164","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutdown gracefully","service":"gRPC/server","ts":"2024-01-04T07:02:52.441094163Z"} {"caller":"main.go:161","err":"Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <IP-address-to-S3-provider>:443: i/o timeout get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json github.com/thanos-io/thanos/pkg/block/metadata.ReadMarker \t/app/pkg/block/metadata/markers.go:124 github.com/thanos-io/thanos/pkg/block.(*IgnoreDeletionMarkFilter).Filter.func1 \t/app/pkg/block/fetcher.go:859 golang.org/x/sync/errgroup.(*Group).Go.func1 \t/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 filter blocks marked for deletion github.com/thanos-io/thanos/pkg/block.(*IgnoreDeletionMarkFilter).Filter \t/app/pkg/block/fetcher.go:904 github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch \t/app/pkg/block/fetcher.go:475 github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch \t/app/pkg/block/fetcher.go:514 github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks \t/app/pkg/store/bucket.go:556 github.com/thanos-io/thanos/pkg/store.(*BucketStore).InitialSync \t/app/pkg/store/bucket.go:625 main.runStore.func5.1 \t/app/cmd/thanos/store.go:427 github.com/thanos-io/thanos/pkg/runutil.RetryWithLog \t/app/pkg/runutil/runutil.go:97 github.com/thanos-io/thanos/pkg/runutil.Retry \t/app/pkg/runutil/runutil.go:87 main.runStore.func5 \t/app/cmd/thanos/store.go:426 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 filter metas github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch \t/app/pkg/block/fetcher.go:476 github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch \t/app/pkg/block/fetcher.go:514 github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks \t/app/pkg/store/bucket.go:556 github.com/thanos-io/thanos/pkg/store.(*BucketStore).InitialSync \t/app/pkg/store/bucket.go:625 main.runStore.func5.1 \t/app/cmd/thanos/store.go:427 github.com/thanos-io/thanos/pkg/runutil.RetryWithLog \t/app/pkg/runutil/runutil.go:97 github.com/thanos-io/thanos/pkg/runutil.Retry \t/app/pkg/runutil/runutil.go:87 main.runStore.func5 \t/app/cmd/thanos/store.go:426 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 sync block github.com/thanos-io/thanos/pkg/store.(*BucketStore).InitialSync \t/app/pkg/store/bucket.go:626 main.runStore.func5.1 \t/app/cmd/thanos/store.go:427 github.com/thanos-io/thanos/pkg/runutil.RetryWithLog \t/app/pkg/runutil/runutil.go:97 github.com/thanos-io/thanos/pkg/runutil.Retry \t/app/pkg/runutil/runutil.go:87 main.runStore.func5 \t/app/cmd/thanos/store.go:426 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 bucket store initial sync main.runStore.func5 \t/app/cmd/thanos/store.go:432 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 store command failed main.main \t/app/cmd/thanos/main.go:161 runtime.main \t/usr/local/go/src/runtime/proc.go:267 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650","level":"error","ts":"2024-01-04T07:02:52.441344799Z"}

@Stig132
Copy link

Stig132 commented Mar 5, 2025

We've been running this version of thanos in production over 1 year now without issues.
Can you have another look at this, or is this change not acceptable for thanos?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants