Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidelines scaling thanos compact #1964

Closed
jaseemabid opened this issue Jan 8, 2020 · 7 comments
Closed

Guidelines scaling thanos compact #1964

jaseemabid opened this issue Jan 8, 2020 · 7 comments

Comments

@jaseemabid
Copy link
Contributor

At Monzo we have over 100TB of Thanos metrics and we are noticing some serious performance bottlenecks. Since downsampling must be run as a singleton, we are building up a huge backlog right now as the max performance we are able to get out of our current setup is around 16MB/s. Compaction performance is also comparable but we manage this with a few shards.

This issue is about requesting guidelines for other people with similar scale.

Here is a sample downsampling event.

level=info ts=2020-01-08T13:44:33.768943954Z caller=downsample.go:284 msg="downsampled block" from=01DWRR1MBJE3G61NY0PKB91ZVT to=01DY2F87NB1S6J2J61PAXXXJ89 duration=2h4m45.237070743s

This is a 2w long block with 5m resolution downsampled to 1h.

01DWRR1MBJE3G61NY0PKB91ZVT is about 244 512MB chunks with 1.5GB Index.
numSamples: 29904945369, numSeries: 7777499, numChunks: 73985300

01DY2F87NB1S6J2J61PAXXXJ89 is 1.2GB index with 26 512MB chunks.
numSamples: 2546495789, numSeries: 7777499, numChunks: 28488448

Reducing the resolution from 5m to 60m lead to a 12x reduction in number of samples, no of chunks and total storage. Number of series is the same as expected.

0:23 spent downloading data (11:17 -> 11:40)
2:04 spent down sampling 120GB at about 16MB/s
CPU pegged to 1 core with ~20MB/s read from disk and about 30GB memory usage.

It looks like the bottleneck is the single threaded thanos process rather than network or disk.
Any tips on how to make this faster? I thought I'd start a discussion here before starting to CPU profile the compactor.

This is one of the smaller blocks, we have seen some raw blocks approach almost a TB in size and some downsampling/compactions take almost 11-12hrs.

image

Environment:

  • Thanos v0.8.1 using the official docker image
  • Thanos runs inside k8s pods. r4.8xlarge instance with tons of CPU and memory on AWS
  • Linux debug 4.19.43-coreos-r1 Initial structure and block shipper #1 SMP Tue Jun 11 07:24:09 -00 2019 x86_64 GNU/Linux
  • Data stored in s3.
@bwplotka
Copy link
Member

bwplotka commented Jan 9, 2020

Thanks for the detailed description of the problem (:

Are you sharding at the end? If yes, how far you can get with sharding? Because sharding, in the same way, helps to quickly catch up with downsampling.

Within one "block stream" 11-12h is indeed a bit slow, although compacting to 2w (and downsampling of 2w) happens only every 2w at maximum, so compactor can usually catch up in between.

Anyway, let's discuss what we can improve here, some initial thoughts:

  1. I believe we can add more concurrency to the downsampling process (we process each series sequentially).
  2. Adding
  3. From v0.10.0 (release soon) block loading memory consumption should be reduced. This means we might be able to add more concurrency to single shard (downsample multiple blocks "streams"). It should also improve memory of compaction, so we can as well increase concurrency and compact multiple streams concurrently. As a side effect compaction will take even longer for v0.10.0, see this. I believe this might make the latency even worse for you case.
  4. What's the cardinality of your blocks? How many series, samples?
  5. Wonder if it makes sense to split the block at some point if it comes to TB size.... even upload and load takes a lot of time, not mentioning manual operations/debug if needed.

cc @pracucci
cc @brian-brazil (:

@bandesz
Copy link
Contributor

bandesz commented Jan 9, 2020

@bwplotka I work with @jaseemabid. It seems to us that downsampling doesn't consider the relabel config in the latest release (

err := bkt.Iter(ctx, "", func(name string) error {
). I've seen the latest commits that the new meta syncer solves this problem, so I suspect we have to wait until 0.10 is released or run from master.

Just for reference our biggest index files are currently around 40Gb (but previously we managed to hit the 64Gb tsdb index limit) and I think the biggest block had ~2400 chunks, so around 1.2TB.

@stale
Copy link

stale bot commented Feb 8, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 8, 2020
@bwplotka
Copy link
Member

bwplotka commented Feb 8, 2020

In review: #1922

@stale stale bot removed the stale label Feb 8, 2020
@aaron-trout
Copy link

We are also running into similar issues so good to see we are not the only ones 😆

Screenshot 2020-02-26 at 08 46 33

Our metrics bucket is a lot smaller than Monzo's, but as you can see the CPU is still the bottleneck here. One issue I think is a gap in the Thanos docs with guidelines on what numbers are sensible for the compactor. Here are the flags we are using on compactor:

        - --log.level=debug
        - --retention.resolution-raw=14d
        - --retention.resolution-5m=6m
        - --retention.resolution-1h=10y
        - --consistency-delay=30m
        - --objstore.config-file=/etc/thanos/objstore-config.yaml
        - --data-dir=/data
        - --wait

I did try bumping the --compact.concurrency from the default of 1 but this did not seem to allow the compact process to use more than 1 core still.

Another thing which would be good to know... is there an easy way to look at the current status of the metrics in the bucket; i.e. to find out how much of a backlog of work the compactor has right now. Presumably there is at least some backlog, since the compactor is constantly doing work; as soon as it finishes working on some blocks there are always more it can pick up right away.

In other news though, the 0.10 release certainly did reduce memory usage! In this staging cluster pictured above, the memory usage on compactor went down from ~4GB to <0.5GB!

@bwplotka
Copy link
Member

bwplotka commented Feb 26, 2020 via email

@stale
Copy link

stale bot commented Mar 27, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 27, 2020
@stale stale bot closed this as completed Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants