Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Improve memory use for queries #1649

Closed
ppanyukov opened this issue Oct 14, 2019 · 4 comments
Closed

Discussion: Improve memory use for queries #1649

ppanyukov opened this issue Oct 14, 2019 · 4 comments

Comments

@ppanyukov
Copy link
Contributor

ppanyukov commented Oct 14, 2019

This is a tracker issue for memory utilisation in Thanos to serve Queries and the steps we want to take to make things better.

Core of the issue:

  • Both Querier and Store Gateway allocate unbounded amounts of memory during query execution.
  • This can and does lead to OOM in resource-limited container environments (k8s etc).

What we want is:

  • A list of ideas to consider and try;
  • Status of each idea based on discussions, tests etc;

Repro

Base build

thanos, version 0.7.0 (branch: master, revision: ed045447fef8c4b5ff62697559463549ff522f31)
  build user:       philip@Mayas-MacBook-Pro.local
  build date:       20191010-13:52:26
  go version:       go1.13.1

Data

  • synthetically generated (uses thanosbench blockgen)
  • data size:
    • metrics: 200
    • targets: 100
    • sample interval: 15s
    • total time interval: 10 hours

Query

  • count({__name__=~".+"}) by (__name__)
  • time span: all generated data
  • curl:
curl 'http://localhost:11902/api/v1/query_range?query=count(%7B__name__%3D~%22.%2B%22%7D)%20by%20(__name__)&dedup=true&partial_response=true&start=1568634900&end=1569844500&step=4838&max_source_resolution=0s&_=1570622122482'

Memory profile

  • MacBook Pro 16GB RAM
  • Store Gateway: 20GB peak
  • Querier: 10GB peak before time-out
  • See chart

Chart

  • Green is Storage Gateway; Red is Querier
  • Note that the query timed out so does not show Querier in its entirety.

Screenshot 2019-10-10 at 14 42 08

List of ideas

Per-query limits and flow optimisations looks to be the way to go.

[1.] Simple global memory limiter in Querier. Rejected. See #1631 .

[1.1] Simple global memory limiter in SG? I've just realised that the OOM problem actually may be in SG that causes most pain, not in Querier. Worth considering?

[2.] Limit concurrent queries, max samples, and series(?). As per Cortex: #1631 (comment)

[3.] Frontend query caching. See #1006

users start to use frontend query caching. This changes a characteristic of traffic against Querier a lot, as it splits queries by day - maybe we can take that as an advantage?

[4.] Result streaming.

why does it not stream out the results? is it something to do with deduplication perhaps? aggregation?

@bwplotka: Mainly because of SeriesSet API PromQL requiring full series per series iteration.

but in querier.go, we see comment which we way want to investigate further?

// TODO(@fabxc ): this could potentially pushed further down into the store API
// to make true streaming possible.
sortDedupLabels(resp.seriesSet, q.replicaLabels)

[5.] Prevent knowingly bad queries?

[6.] Forking.

Maybe crazy idea, but what if we fork process for each query with simple mem limits?
This could be a way to have per-query memory limits etc.

[7.] (@bwplotka) Count allocations (roughly) per user / query.

[8.] Chop [tmin, tmax] in Querier. Can we chop the requested [tmin, tmax] into smaller (1h?) chunks [tmin, T1], [T1, T3], ... [Tn, tmax], request those from Store Gateway one by one, assemble results in Querier like we do now, and then send over to client? Is this similar to what Frontend query caching would do? Not worth it, see #1667

[9.] (@bwplotka) can we loosen StoreAPI a bit to allow series being aligned differently? This means mean the returned values could be unordered thus opening up way to stream chunks.

[10.] Intern labels (Labels []Label). These probably consume tons of memory (strings). And there are probably lots of duplicate identical lists. Why keep them all in memory as distinct objects? Why don't we do this at the point close to proto unmarshal:

func internLables(labels []Label) []Label {
  h := hash(labels)
  if interned, found := lookup(h); if found {
    return interned
  }

  addToMap(h, labels)
  return labels
}

possibly look at other things we can intern?

[11.] Related work in Prometheus codebase to unify labels package and which hopefully will allow/lead to fewer allocs (alloc-free casts etc). See prometheus/prometheus#6029

[12.] Optimise for special case of: Querier -> 1x Store Gateway -> 1x S3 bucket. This way we essentially can do straight streaming without having to merge the results, right? [What about dedup?]

[12.1] Taking [12] further and look at this case: Querier -> 1x Store Gateway -> [1x S3 bucket + 1x Prometheus]. Can we assert that for a lot of queries the data will be either in S3 or Prometheus thus allowing optimisation [12]?

[12.2] Can we take this optmisation further and do smarter query partitioning and merging when required?


[999.] Other. Inventive use of OS/k8s/other products which may make queries better.

@ppanyukov ppanyukov changed the title Improve memory use for queries Discussion: Improve memory use for queries Oct 14, 2019
@bwplotka
Copy link
Member

bwplotka commented Oct 14, 2019

What about just.. chunk limit? It's similar to 2. but we operate on chunks rather then samples on storeAPI level really.

  1. Frontend query helps a lot, but it's (a) optional (b) just something to think about when designing any limits. By splitting by day we have higher chances to loadbalance and parallelize queries - this means lower chances to hit high memory consumption given concurrency limit (if we do it right on query frontend)

why does it not stream out the results? is it something to do with deduplication perhaps? aggregation?

Mainly because of SeriesSet API PromQL requiring full series per series iteration. This is the same for StoreAPIs but you can imagine that with fanout, we need all results to merge (essentially MergeSort) them into series per series sorted by labels. Not much we can do here. On top of it deduplication is applied but it works on per series basis so that one is streaming technically.

Also 8. Count allocations (roughly) per user (:

@ppanyukov
Copy link
Contributor Author

Count allocations (roughly) per user (:

Are we talking about memory accounting per query here?

@ppanyukov
Copy link
Contributor Author

Did a super quick trial of [10] with label interning in Querier. Not sure it made much difference with my test data set but but it was a super dirty quick job. Will look into this in more detail and in SG. That may be more promising avenue + see if there are easy wins using [11].

@bwplotka
Copy link
Member

bwplotka commented Nov 1, 2019

Just to ensure it is transparent. The data we used for your tests @ppanyukov :

Data

  • synthetically generated (uses thanosbench blockgen)
  • data size:
    - metrics: 200
    - targets: 100
    - sample interval: 15s
    - total time interval: 10 hours

.. had many chunks with *single sample each (due to some misconfiguration in the initial version of thanosbench), so while they can repro OOM they are not really representative. (:

Still, it is extremely useful to see those ideas. I aggregated an umbrella issue for those here: #1705 I copied most of your ideas there! Let's propose ideas one by one in the next issues (:

Closing this in favor of #1705

@bwplotka bwplotka closed this as completed Nov 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants