Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: "invalid size" when decoding postings. #6545

Closed
ryan-suther opened this issue Jul 20, 2023 · 10 comments · Fixed by #6602
Closed

store: "invalid size" when decoding postings. #6545

ryan-suther opened this issue Jul 20, 2023 · 10 comments · Fixed by #6602

Comments

@ryan-suther
Copy link

Thanos, Prometheus and Golang version used:
Thanos: main@dc337b23
Prometheus: v0.44.1
Golang: 1.20.2

thanos, version 0.32.0-dev (branch: HEAD, revision: dc337b2)
build user: rsuther@
build date: 20230720-18:24:06
go version: go1.20.2
platform: darwin/amd64
tags: netgo

Object Storage Provider:
CEPH(S3)

What happened:
We were on version v0.30.2 and upgraded to 0.31.0 and experienced the high mem usage while rolling out new deployments. Then we upgraded to the above commit for the high mem usage fix and began experiencing the below error.

xxx.xxx.xxx.xxxx:10901 LabelSets: {shard labels} MinTime: 1678150240752 MaxTime: 1688850857186: rpc error: code = Aborted desc = fetch series for block 01H4SRWSPKY6K1WHN1XG79P85Y: expanded matching posting: expand: invalid size"}

Multiple refreshes of the Grafana dashboard produced different block IDs.

What you expected to happen:
Thanos to operate normally.

How to reproduce it (as minimally and precisely as possible):
We are not able to reliably reproduce the error. I was able to download one of the reported blocks and run some e2e tests using the original query from the dashboard and was not able to reproduce the issue. I did however, get good results from the query.

In the test, I used a filesystem bucket with the reported block, and an in memory index cache. Output of the test:

start store, run InitialSync()

level=info msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=4000000000 maxItems=maxInt
level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=20
level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=5.789802ms duration_ms=5 cached=1 returned=1 partial=0
level=debug msg="loading new block" id=01H4SRWSPKY6K1WHN1XG79P85Y
level=info msg="loaded new block" elapsed=58.86966ms id=01H4SRWSPKY6K1WHN1XG79P85Y

Submit query via store.Series(...) call.

level=debug msg="Blocks source resolutions" blocks=1 MaximumResolution=0 mint=1678150240752 maxt=1688850857186 lset="{_shard_="..."}" spans="Range: 1688601600000-1688774400000 Resolution: 0"

level=debug msg="stats query processed" request="min_time:1678150240752 max_time:1688850857186 matchers: <several matchers> shard_info:<shard_index:3 total_shards:4 labels:\"_shard_\" > " stats="&{blocksQueried:1 postingsTouched:6 PostingsTouchedSizeSum:2MiB538KiB853B postingsToFetch:0 postingsFetched:6 PostingsFetchedSizeSum:10MiB1KiB192B postingsFetchCount:5 PostingsFetchDurationSum:142.70863ms cachedPostingsCompressions:7 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:2MiB538KiB925B CachedPostingsCompressedSizeSum:781KiB687B CachedPostingsCompressionTimeSum:5.656442ms cachedPostingsDecompressions:0 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:0s seriesTouched:18 SeriesTouchedSizeSum:4KiB908B seriesFetched:18 SeriesFetchedSizeSum:371KiB768B seriesFetchCount:3 SeriesFetchDurationSum:2.3962ms chunksTouched:99 ChunksTouchedSizeSum:5KiB38B chunksFetched:99 ChunksFetchedSizeSum:109KiB864B chunksFetchCount:5 ChunksFetchDurationSum:9.772196ms GetAllDuration:81.33475ms mergedSeriesCount:7 mergedChunksCount:99 MergeDuration:3.801363ms DataDownloadedSizeSum:10MiB482KiB800B}" err=null

Run store.Sync(...)

level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=20
level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.346155ms duration_ms=4 cached=1 returned=1 partial=0

store.Series(...) again, with the same query.

level=debug msg="Blocks source resolutions" blocks=1 MaximumResolution=0 mint=1678150240752 maxt=1688850857186 lset="{_shard_="..."}" spans="Range: 1688601600000-1688774400000 Resolution: 0"

level=debug msg="stats query processed" request="min_time:1678150240752 max_time:1688850857186 matchers:<matchers> shard_info:<shard_index:3 total_shards:4 labels:\"_shard_\" > " stats="&{blocksQueried:1 postingsTouched:1 PostingsTouchedSizeSum:50B postingsToFetch:0 postingsFetched:0 PostingsFetchedSizeSum:0B postingsFetchCount:0 PostingsFetchDurationSum:0s cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:0B CachedPostingsCompressedSizeSum:0B CachedPostingsCompressionTimeSum:0s cachedPostingsDecompressions:1 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:708ns seriesTouched:18 SeriesTouchedSizeSum:4KiB908B seriesFetched:0 SeriesFetchedSizeSum:0B seriesFetchCount:0 SeriesFetchDurationSum:0s chunksTouched:99 ChunksTouchedSizeSum:5KiB38B chunksFetched:99 ChunksFetchedSizeSum:109KiB864B chunksFetchCount:5 ChunksFetchDurationSum:1.679806ms GetAllDuration:60.138µs mergedSeriesCount:7 mergedChunksCount:99 MergeDuration:657.334µs DataDownloadedSizeSum:114KiB798B}" err=null

Resulting series looked good, no warnings.

Full logs to relevant components:
The error above is not logged. All other logging for the block(s) appeared normal.

Anything else we need to know:

@GiedriusS
Copy link
Member

@yeya24 maybe you've seen this on Cortex side?

@GiedriusS
Copy link
Member

Was able to repro this consistently, let me try to fix it.

@GiedriusS
Copy link
Member

Added repro here with prod data #6575, not sure how this data got generated in the first place. In one instance 0xc8 is left at the end of a chunk and in another - 0x83. 🤔

@GiedriusS
Copy link
Member

@ryan-suther do you use Redis caching in prod perhaps?

@ryan-suther
Copy link
Author

ryan-suther commented Aug 1, 2023

@GiedriusS No, we do not. We use a normal PVC. Our store args:

        - store
        - --data-dir=/data
        - --index-cache-size=1GB
        - --chunk-pool-size=4GB
        - --max-time=-9d
        - --min-time=-400d

@GiedriusS
Copy link
Member

Somehow fixing the following data race also fixed this problem for me: #6575.

@GiedriusS
Copy link
Member

GiedriusS commented Aug 10, 2023

Still reproducible, unfortunately, trying to fix it. Breaks long-term querying with index cache :'/

GiedriusS added a commit to vinted/thanos that referenced this issue Aug 10, 2023
Snappy works on byte level and it can cut two different chunks in the
middle of a varint. Thus, if there's some error from the Decbuf then
fill up the buffer and try reading a varint again. Added repro test.

Closes thanos-io#6545.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@GiedriusS
Copy link
Member

Will finally be fixed in #6602

GiedriusS added a commit to vinted/thanos that referenced this issue Aug 10, 2023
Snappy works on byte level and it can cut two different chunks in the
middle of a varint. Thus, if there's some error from the Decbuf then
fill up the buffer and try reading a varint again. Added repro test.

Closes thanos-io#6545.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
saswatamcode pushed a commit that referenced this issue Aug 10, 2023
Snappy works on byte level and it can cut two different chunks in the
middle of a varint. Thus, if there's some error from the Decbuf then
fill up the buffer and try reading a varint again. Added repro test.

Closes #6545.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@ylazaar
Copy link

ylazaar commented Sep 9, 2023

Im still seeing this issue in v0.32.2
image

@yeya24
Copy link
Contributor

yeya24 commented Sep 10, 2023

@ylazaar IIUC it is a different issue as the error is from compactor, not store gateway. Would you mind creating a separate issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants