Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: Hundreds of errors in logs #1257

Closed
R4scal opened this issue Jun 17, 2019 · 4 comments · Fixed by #1274
Closed

Store: Hundreds of errors in logs #1257

R4scal opened this issue Jun 17, 2019 · 4 comments · Fixed by #1274

Comments

@R4scal
Copy link

R4scal commented Jun 17, 2019

Hi

I have hundreds of error in logs of store daemon, like:

Logs

{"cacheType":"Postings","caller":"cache.go:254","curSize":8587457979,"itemSize":4253828,"iterations":500,"level":"error","maxItemSizeBytes":4294967296,"maxSizeBytes":8589934592,"msg":"After max sane iterations of LRU evictions, we still cannot allocate the item. Ignoring.","ts":"2019-06-17T07:07:37.983393384Z"}

I think this is not real problem. Maybe change log level to warning?

./thanos --version
thanos, version 0.5.0 (branch: HEAD, revision: 72820b3f41794140403fd04d6da82299f2c16447)
  build user:       root@7d72e9360b09
  build date:       20190606-10:49:10
  go version:       go1.12.5
@bwplotka
Copy link
Member

Thanks for report!

So it is an error that we recover from, but it sugest really high preassure on posting index cache in store gateway. It's an error as in this case your cache is literally not working as nothing can be removed from it.

Are you sure this is with store gateway on 0.5.0?

If that's true then #1142 might be still not enough as a fix.

cc @GiedriusS

@R4scal
Copy link
Author

R4scal commented Jun 17, 2019

Yes, store 0.5.0. I can try increase index-cache-size from 8g to 10g, but actually cache is not critical feature for us, because we use fast local s3-storage and don't have hight rps for store.

@abursavich
Copy link
Contributor

abursavich commented Jun 20, 2019

In my v0.4.0 store (still waiting on the next prometheus-operator release before moving to v0.5.0), once the cache thinks its full and starts triggering evictions it quickly starts complaining about too many iterations. It stays in this mostly-working mode until it evicts everything (and still thinks its nearly full). After this it keeps adding and evicting (small) things for days, but the hit ratio drops to near zero.

Cache overview:
metrics

Logs (max'd y-axis cuts off millions of errors):
logs

The logs switch from After max sane iterations of LRU evictions, we still cannot allocate the item. Ignoring. to LRU has nothing more to evict, but we still cannot allocate the item. Ignoring. once the true number of items in the cache falls to below saneMaxIterations. If you zoom in there are still blips of "iteration" errors as the transition occurs.

Zoomed in on transition:
transition

My hypothesis is that the saneMaxIterations value of 500 is too low due to the cache having a wide range of item sizes. For the same store as above:

  • Postings average size is ~30KB (x500 = ~15MB)
  • Series average size is ~100 bytes (x500 = ~50KB)
  • The overall average size is ~4KB (x500 = ~2MB)

I have "iteration" error logs when trying to insert items ranging from ~130KB to ~13MB (most are between ~1.5MB and ~3.5MB).

Unless "nothing more to evict" starts appearing in v0.5.0 logs, I think the issue is the iteration restriction.

Average item sizes:
item sizes

@bwplotka
Copy link
Member

That is very plausible! nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants