Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

irate/rate query issue on the deduplicated metrics #5025

Closed
gabrielbatir opened this issue Jan 4, 2022 · 7 comments
Closed

irate/rate query issue on the deduplicated metrics #5025

gabrielbatir opened this issue Jan 4, 2022 · 7 comments
Labels

Comments

@gabrielbatir
Copy link

Thanos, Prometheus and Golang version used:

Thanos deployed with kube-prometheus. Initial issue detected while using thanos 0.23.1. Upgraded only thanos query to 0.24.0 but still having this issue.

Thanos compact:
compact --wait --log.level=info --log.format=logfmt --objstore.config=$(OBJSTORE_CONFIG) --data-dir=/var/thanos/compact --debug.accept-malformed-index --retention.resolution-raw=360d --retention.resolution-5m=360d --retention.resolution-1h=360d --delete-delay=8h --deduplication.replica-label=prometheus_replica --deduplication.replica-label=rule_replica --deduplication.replica-label=replica

Thanos query:
query --grpc-address=0.0.0.0:10901 --http-address=0.0.0.0:9090 --log.level=debug --log.format=logfmt --query.replica-label=prometheus_replica --query.replica-label=rule_replica --query.replica-label=replica --store=dnssrv+_grpc._tcp.prometheus-network-monitoring-thanos-sidecar.network-monitoring.svc.cluster.local --store=dnssrv+_grpc._tcp.thanos-store-network-monitoring.network-monitoring.svc.cluster.local --query.timeout=5m --query.lookback-delta=15m --query.auto-downsampling

Prometheus: quay.io/prometheus/prometheus:v2.29.2
--web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention.time=5d --web.enable-lifecycle --storage.tsdb.no-lockfile --web.external-url=https://prom-network-monitoring --web.route-prefix=/ --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h

Thanos sidecar version: 0.23.1

Object Storage Provider:
Minio

What happened:
This is one of the metrics where this issue manifests:
counter

This is how an irate looks like:
irate

The red line is the irate of the deduplicated metric while the yellow and blue are the irates for the metrics from prometheus.

Pressing Execute multiple times, sometimes the graph looks like this:
irate2

What you expected to happen:
I would expect for the irate of the deduplicated metric to look like the original from prometheus.

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:
I could not find any error in thanos compact or thanos query logs.

Anything else we need to know:
This is how a rate query looks like on the same metric:
rate

It looks somewhat better then the irate query but I would have expected to be closer to the query on the original data from prometheus.

This is how the values of the original and deduplicated metrics look like:

counter_values
The third line is the deduplicated metric.

Is this a bug or something wrong in my configuration?

@yeya24
Copy link
Contributor

yeya24 commented Jan 4, 2022

Deduplication doesn't work in your case because that's for 1:1 deduplication so mainly in receiver cases.
In the case of HA prometheus, you cannot do deduplication and the two series merge into one. I think that's the reason why irate won't work for you. If you disable deduplication then it should work.

@gabrielbatir
Copy link
Author

Now I noticed that the deduplicated stream actually contains the metrics from both prometheus replicas and in most cases the values are identical for both.
I am not really sure how irate works but could this be the issue?

Could this be fixed by using --deduplication.func=penalty for the compactor? From the docs this option seems to break irate in other ways.

@gabrielbatir
Copy link
Author

@yeya24 From what I can see at https://thanos.io/tip/components/compact.md/#enabling-vertical-compaction in order to enable vertical compaction I need to use that hidden flag.
From that page I understand that it is not enabled by default yet in my case, as you can see above, I do have it but compact does vertical compaction.

Maybe the note at

"NOTE: This flag is ignored and (enabled) when --deduplication.replica-label flag is set.").
could be put in the documentation.

@RayHuangCN
Copy link
Contributor

I meet the same case.

@yeya24
Copy link
Contributor

yeya24 commented Jan 5, 2022

Could this be fixed by using --deduplication.func=penalty for the compactor? From the docs this option seems to break irate in other ways.

Ideally it should solve the issue but now the functionality is not very stable when downsampling is also enabled so I wouldn't recommend enabling it.

I am not really sure how irate works but could this be the issue?

irate works by calculating the rate only based on the last two data points from your series.
For example, we have two series:

{replica=1} 1@1641055871  10000@1641055886
{replica=2} 1@1641055870  10001@1641055885

If irate function is called at time 1641055885 then the last two points are 1@1641055871 and 10001@1641055885 respectively. So the value diff is about 10000.
If irate function is called at time 1641055871 ~ 1641055884 then the last two points are 1@1641055870 and
1@1641055871. Now the value diff is 0.

This is just an example. If you are using rate then it might go wrong, too when counter values mistakenly resets.

Maybe the note at could be put in the documentation.

Yeah that's a good idea. Would you like to contribute and update the doc?

@stale
Copy link

stale bot commented Apr 17, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 17, 2022
@stale
Copy link

stale bot commented May 1, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants