Discussion: Improve memory use for queries #1649

ppanyukov · 2019-10-14T16:49:20Z

This is a tracker issue for memory utilisation in Thanos to serve Queries and the steps we want to take to make things better.

Core of the issue:

Both Querier and Store Gateway allocate unbounded amounts of memory during query execution.
This can and does lead to OOM in resource-limited container environments (k8s etc).

What we want is:

A list of ideas to consider and try;
Status of each idea based on discussions, tests etc;

Repro

Base build

thanos, version 0.7.0 (branch: master, revision: ed045447fef8c4b5ff62697559463549ff522f31)
  build user:       philip@Mayas-MacBook-Pro.local
  build date:       20191010-13:52:26
  go version:       go1.13.1

Data

synthetically generated (uses thanosbench blockgen)
data size:
- metrics: 200
- targets: 100
- sample interval: 15s
- total time interval: 10 hours

Query

count({__name__=~".+"}) by (__name__)
time span: all generated data
curl:

curl 'http://localhost:11902/api/v1/query_range?query=count(%7B__name__%3D~%22.%2B%22%7D)%20by%20(__name__)&dedup=true&partial_response=true&start=1568634900&end=1569844500&step=4838&max_source_resolution=0s&_=1570622122482'

Memory profile

MacBook Pro 16GB RAM
Store Gateway: 20GB peak
Querier: 10GB peak before time-out
See chart

Chart

Green is Storage Gateway; Red is Querier
Note that the query timed out so does not show Querier in its entirety.

List of ideas

Per-query limits and flow optimisations looks to be the way to go.

[1.] ~~Simple global memory limiter in Querier.~~ Rejected. See #1631 .

[1.1] Simple global memory limiter in SG? I've just realised that the OOM problem actually may be in SG that causes most pain, not in Querier. Worth considering?

[2.] Limit concurrent queries, max samples, and series(?). As per Cortex: #1631 (comment)

[3.] Frontend query caching. See #1006

users start to use frontend query caching. This changes a characteristic of traffic against Querier a lot, as it splits queries by day - maybe we can take that as an advantage?

[4.] Result streaming.

why does it not stream out the results? is it something to do with deduplication perhaps? aggregation?

@bwplotka: Mainly because of SeriesSet API PromQL requiring full series per series iteration.

but in querier.go, we see comment which we way want to investigate further?

// TODO(@fabxc ): this could potentially pushed further down into the store API
// to make true streaming possible.
sortDedupLabels(resp.seriesSet, q.replicaLabels)

[5.] Prevent knowingly bad queries?

[6.] Forking.

Maybe crazy idea, but what if we fork process for each query with simple mem limits?
This could be a way to have per-query memory limits etc.

[7.] (@bwplotka) Count allocations (roughly) per user / query.

[8.] Chop [tmin, tmax] in Querier. Can we chop the requested [tmin, tmax] into smaller (1h?) chunks [tmin, T1], [T1, T3], ... [Tn, tmax], request those from Store Gateway one by one, assemble results in Querier like we do now, and then send over to client? Is this similar to what Frontend query caching would do? Not worth it, see #1667

[9.] (@bwplotka) can we loosen StoreAPI a bit to allow series being aligned differently? This means mean the returned values could be unordered thus opening up way to stream chunks.

[10.] Intern labels (Labels []Label). These probably consume tons of memory (strings). And there are probably lots of duplicate identical lists. Why keep them all in memory as distinct objects? Why don't we do this at the point close to proto unmarshal:

func internLables(labels []Label) []Label {
  h := hash(labels)
  if interned, found := lookup(h); if found {
    return interned
  }

  addToMap(h, labels)
  return labels
}

possibly look at other things we can intern?

[11.] Related work in Prometheus codebase to unify labels package and which hopefully will allow/lead to fewer allocs (alloc-free casts etc). See prometheus/prometheus#6029

[12.] Optimise for special case of: Querier -> 1x Store Gateway -> 1x S3 bucket. This way we essentially can do straight streaming without having to merge the results, right? [What about dedup?]

[12.1] Taking [12] further and look at this case: Querier -> 1x Store Gateway -> [1x S3 bucket + 1x Prometheus]. Can we assert that for a lot of queries the data will be either in S3 or Prometheus thus allowing optimisation [12]?

[12.2] Can we take this optmisation further and do smarter query partitioning and merging when required?

[999.] Other. Inventive use of OS/k8s/other products which may make queries better.

The text was updated successfully, but these errors were encountered:

bwplotka · 2019-10-14T17:52:09Z

What about just.. chunk limit? It's similar to 2. but we operate on chunks rather then samples on storeAPI level really.

Frontend query helps a lot, but it's (a) optional (b) just something to think about when designing any limits. By splitting by day we have higher chances to loadbalance and parallelize queries - this means lower chances to hit high memory consumption given concurrency limit (if we do it right on query frontend)

why does it not stream out the results? is it something to do with deduplication perhaps? aggregation?

Mainly because of SeriesSet API PromQL requiring full series per series iteration. This is the same for StoreAPIs but you can imagine that with fanout, we need all results to merge (essentially MergeSort) them into series per series sorted by labels. Not much we can do here. On top of it deduplication is applied but it works on per series basis so that one is streaming technically.

Also 8. Count allocations (roughly) per user (:

ppanyukov · 2019-10-15T09:22:14Z

Count allocations (roughly) per user (:

Are we talking about memory accounting per query here?

ppanyukov · 2019-10-18T20:38:34Z

Did a super quick trial of [10] with label interning in Querier. Not sure it made much difference with my test data set but but it was a super dirty quick job. Will look into this in more detail and in SG. That may be more promising avenue + see if there are easy wins using [11].

bwplotka · 2019-11-01T19:52:46Z

Just to ensure it is transparent. The data we used for your tests @ppanyukov :

Data

synthetically generated (uses thanosbench blockgen)

data size:
- metrics: 200
- targets: 100
- sample interval: 15s
- total time interval: 10 hours

.. had many chunks with *single sample each (due to some misconfiguration in the initial version of thanosbench), so while they can repro OOM they are not really representative. (:

Still, it is extremely useful to see those ideas. I aggregated an umbrella issue for those here: #1705 I copied most of your ideas there! Let's propose ideas one by one in the next issues (:

Closing this in favor of #1705

ppanyukov mentioned this issue Oct 14, 2019

Querier OOM: introduce and manage memory limits #1625

Closed

ppanyukov changed the title ~~Improve memory use for queries~~ Discussion: Improve memory use for queries Oct 14, 2019

ppanyukov mentioned this issue Oct 14, 2019

Memory limiter in Querier #1631

Closed

bwplotka added component: query component: store feature request/improvement labels Oct 16, 2019

ppanyukov mentioned this issue Oct 18, 2019

[Demo] Query/SG: chopped queries #1667

Closed

bwplotka mentioned this issue Nov 1, 2019

Long Term Storage Improvements [Tracking Issue] #1705

Closed

34 tasks

bwplotka closed this as completed Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Improve memory use for queries #1649

Discussion: Improve memory use for queries #1649

ppanyukov commented Oct 14, 2019 •

edited

bwplotka commented Oct 14, 2019 •

edited

ppanyukov commented Oct 15, 2019

ppanyukov commented Oct 18, 2019

bwplotka commented Nov 1, 2019

Discussion: Improve memory use for queries #1649

Discussion: Improve memory use for queries #1649

Comments

ppanyukov commented Oct 14, 2019 • edited

Repro

List of ideas

bwplotka commented Oct 14, 2019 • edited

ppanyukov commented Oct 15, 2019

ppanyukov commented Oct 18, 2019

bwplotka commented Nov 1, 2019

ppanyukov commented Oct 14, 2019 •

edited

bwplotka commented Oct 14, 2019 •

edited