Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: high cpu when upgrading from 0.12.2 with old data #2793

Closed
sepich opened this issue Jun 23, 2020 · 18 comments
Closed

receive: high cpu when upgrading from 0.12.2 with old data #2793

sepich opened this issue Jun 23, 2020 · 18 comments

Comments

@sepich
Copy link
Contributor

sepich commented Jun 23, 2020

Thanos, Prometheus and Golang version used:
thanos, version 0.13.0 (branch: HEAD, revision: adf6fac)
build user: circleci@b8cd18f8b553
build date: 20200622-10:04:50
go version: go1.14.2

Object Storage Provider:
GCP

What happened:
We're using 3 thanos-receive v0.12.2 pods running with --receive.replication-factor=3.
At 3:30 i've restarted one pod (green line) as v0.13.0, and it's cpu usage doubled:
image
Memory usage is 5-10% higher, which is fine.
Here is another graph, from node where pod has been running:
image

What you expected to happen:
Statistically negligible resource usage change between v0.12.2 and v0.13.0 for receive, as for other thanos components.

How to reproduce it (as minimally and precisely as possible):
We run receive with such args:

          args:
          - receive
          - |
            --objstore.config=type: GCS
            config:
              bucket: jb-thanos
          - --tsdb.path=/data
          - --tsdb.retention=1d
          - --tsdb.min-block-duration=30m  # have to be >2x(prometheus side 15m), https://github.com/thanos-io/thanos/issues/2114
          - --tsdb.max-block-duration=30m
          - --label=replica="$(NAME)"
          - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
          - --receive.hashrings-file=/cfg/hashrings.json
          - --receive.replication-factor=3

Full logs to relevant components:
These are new events, which were not in v0.12.2. Also, they written each ~15sec comparing to ~15min in prometheus v0.19.0 with almost the same settings for tsdb:

Logs

level=info ts=2020-06-23T00:37:15.498358444Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=308.339725ms
level=info ts=2020-06-23T00:37:22.528425096Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=287.226708ms
level=info ts=2020-06-23T00:37:25.321693858Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=51 last=52 duration=2.793211656s
level=info ts=2020-06-23T00:37:34.117011112Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=305.063766ms
level=info ts=2020-06-23T00:37:42.249223736Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=353.091759ms
level=info ts=2020-06-23T00:37:45.317240183Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=53 last=54 duration=3.067957396s

Anything else we need to know:

@brancz
Copy link
Member

brancz commented Jun 23, 2020

could you take a cpu profile of this?

@sepich
Copy link
Contributor Author

sepich commented Jun 23, 2020

Please find 30s cpu profiles attached, they were taken almost at the same time:
Archive.zip

@bwplotka
Copy link
Member

bwplotka commented Jun 23, 2020

Thanks, looks like GC is the current bottleneck:

DIFF 0.12.2 vs 0.13.0:
profile001.pdf

@bwplotka
Copy link
Member

We can repro this on our side, looking into this more (:

@sepich
Copy link
Contributor Author

sepich commented Jun 29, 2020

JFYI,
i've backported #2621, #2679 to v0.12.2 and it does not lead to cpu usage increase.

@bwplotka
Copy link
Member

I think it's multiTSDB so #2012

@bwplotka
Copy link
Member

It's not particularly in v0.13.0, but this code does not help for sure: https://github.com/prometheus/prometheus/blob/b788986717e1597452ca25e5219510bb787165c7/tsdb/db.go#L812

TSDB we use, forces GC multiple times. It might make sense for single TSDB, but not really for multiple.

@sepich
Copy link
Contributor Author

sepich commented Jun 30, 2020

they written each ~15sec comparing to ~15min in prometheus v2.19.0 with almost the same settings for tsdb:

Turns out this high cpu usage happens when i'm start thanos-receive v0.13.0 on non-empty tsdb folder from v0.12.2 (which has many subfolders like 01EC0YX4APXB1MGA39DWFTS96C, etc)
This leads to each subfolder being converted to separate tsdb:

...
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1KGAFNHYWZ6Z8CPMZ8TQQ6"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1HSC4SJ9KSZRYENXX9PS18"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1G2E84CNBJ3NFY7P4PE5RW"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1EBGTBP4HRW2WGG9KEXN7P"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1CMKR2521YKPS1283WGBKJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1AXMY36Z98AQ64QQRD10MP"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC196PXDRFDA39FW476KR7SJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC17FSYJY8AWTYXCP3QW9J9W"}
...

And then log is full of component=tsdb msg="Head GC completed" events.
If I wipe tsdb folder before starting of v0.13.0 - then it consumes the same cpu as v0.12.2

@bwplotka
Copy link
Member

Also #2823

@bwplotka
Copy link
Member

This looks like something we experienced as well, but not when I recall this we wiped the dir indeed.

It's migration issue from non-multiTSDB to multiTSDB.

This can be easily repro in unit test when porting old receive. Let's fix it, might be annoying for Receive users who want to upgrade to 0.13+

@bwplotka bwplotka changed the title receive: high cpu in v0.13.0 receive: high cpu when upgrading from 0.12.2 with old WAL Jun 30, 2020
@bwplotka bwplotka changed the title receive: high cpu when upgrading from 0.12.2 with old WAL receive: high cpu when upgrading from 0.12.2 with old data Jun 30, 2020
@bwplotka
Copy link
Member

I think for you you can ensure it works with higher CPU long enough for WAL from 0.12 to be non-existent then wipe those dirs. In the mean time we will look on fix. Still help wanted (:

@brancz
Copy link
Member

brancz commented Jul 1, 2020

Those look like the migration didn't work correctly. The dirs look like they are block directories. Just to make sure though, you are not running a multi tenant setup right?

@sepich
Copy link
Contributor Author

sepich commented Jul 1, 2020

Yes, we're not running multi tenancy, and these are just block dirs.

@stale
Copy link

stale bot commented Jul 31, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 31, 2020
@stale
Copy link

stale bot commented Aug 7, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 7, 2020
@bwplotka
Copy link
Member

bwplotka commented Aug 7, 2020

Still the migration path has to be fixed in theory.

@bwplotka bwplotka reopened this Aug 7, 2020
@stale stale bot removed the stale label Aug 7, 2020
@stale
Copy link

stale bot commented Sep 6, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 6, 2020
@stale
Copy link

stale bot commented Sep 13, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants