Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

Open
KingRebo38 opened this issue Jan 17, 2024 · 0 comments
Open

ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

KingRebo38 opened this issue Jan 17, 2024 · 0 comments

Comments

@KingRebo38
Copy link

Thanos, Prometheus and Golang version used:

Thanos: v0.34.0-rc.0
Prometheus: v2.45.0

Object Storage Provider:
AWS S3

What happened:

We are using Thanos to store long term data for our business metrics. Previously we had a retention time on prometheus for 40 days, in order to meet business requirements. In order to save memory and resources, we decided to reduce our retention to just 7 days and solely rely on Thanos/Ruler to make long term aggregations.

Currently we have the following configuration for our thanos storage:

retentionResolutionRaw = 33d
retentionResolution5m = 120d
retentionResolution1h = 1y

After making the switch and migrating all our long term rules to Thanos ruler from Prometheus, the resulting metrics are calculated incorrectly:

image

As you can see in the image, before the switch we had about 6.5 mil requests in the last 14 days. After the switch the requests plummeted to about 2.5 mil requests. After further investigation we noticed that the number displayed are very close to the our 7 day aggregations. This suggests that our ruler does not receive all the data from our querier. The gaps inside the timeseries are not important, that's the time when the ruler wasn't running.

We are using the one of following metrics and rules to make our aggregations:

rules:
- expr: increase(kong_http_requests_total[14d])
record: kong_http_requests_total:rate:14d

What you expected to happen:

image

The behaviour should have been similar to this image. This is on our development stage where we had a similar switch. The graph simply continues after the switch. We could not find any meaningful differences between our development stage and this stage. We did experience a similar behaviour where the metric

How to reproduce it (as minimally and precisely as possible):

Sadly I don't know how to reproduce this issue outside of our own setup.

Configuration of our components

Thanos ruler

args:
  - rule
  - '--data-dir=/var/thanos/store'
  - '--log.level=info'
  - '--log.format=logfmt'
  - '--http-address=0.0.0.0:10902'
  - '--grpc-address=0.0.0.0:10901'
  - '--objstore.config-file=/etc/config/object-store.yaml'
  - '--rule-file=/etc/rules/*.yaml'
  - '--label=component="ruler"'
  - '--query=dnssrv+_http._tcp.kube-prometheus-stack-query-frontend-http.monitoring.svc.cluster.local'

Thanos Query Frontend

args:
  - query-frontend
  - '--log.level=debug'
  - '--log.format=logfmt'
  - '--http-address=0.0.0.0:10902'
  - '--query-range.split-interval=24h'
  - '--query-range.max-retries-per-request=5'
  - '--query-range.max-query-parallelism=14'
  - '--query-range.response-cache-max-freshness=1m'
  - >-
    --query-range.response-cache-config-file=etc/config/memcached-config.yaml
  - >-
    --query-frontend.downstream-url=http://kube-prometheus-stack-query-http.monitoring.svc.cluster.local:10902
  - '--query-frontend.compress-responses'
  - >-
    --labels.response-cache-config-file=etc/config/memcached-config.yaml
  - '--labels.split-interval=24h'
  - '--labels.max-retries-per-request=5'
  - '--labels.max-query-parallelism=14'
  - '--labels.response-cache-max-freshness=1m'

Thanos Query

args:
    - query
    - '--log.level=debug'
    - '--log.format=logfmt'
    - '--grpc-address=0.0.0.0:10901'
    - '--http-address=0.0.0.0:10902'
    - '--query.replica-label=prometheus_replica'
    - '--query.auto-downsampling'
    - '--store.sd-dns-resolver=miekgdns'
    - >-
      --store=dnssrv+_grpc._tcp.kube-prometheus-stack-store-grpc.monitoring.svc.cluster.local
    - >-
      --store=dnssrv+_grpc._tcp.kube-prometheus-stack-thanos-sidecar.monitoring.svc.cluster.local
    - >-
      --store=dnssrv+_grpc._tcp.kube-prometheus-stack-rule-grpc.monitoring.svc.cluster.local
    - '--store.sd-interval=5m'

Thanos store

args:
    - store
    - '--data-dir=/var/thanos/store'
    - '--log.level=info'
    - '--log.format=logfmt'
    - '--http-address=0.0.0.0:10902'
    - '--grpc-address=0.0.0.0:10901'
    - '--objstore.config-file=/etc/config/object-store.yaml'
    - '--index-cache-size=250MB'
    - '--chunk-pool-size=2GB'
    - '--store.grpc.series-max-concurrency=20'
    - '--sync-block-duration=3m'
    - '--block-sync-concurrency=20'
    - '--max-time=2y'

Thanos Sidecar

args:
    - sidecar
    - '--prometheus.url=http://127.0.0.1:9090/'
    - '--grpc-address=[$(POD_IP)]:10901'
    - '--http-address=[$(POD_IP)]:10902'
    - '--objstore.config=$(OBJSTORE_CONFIG)'
    - '--tsdb.path=/prometheus'
    - '--log.level=info'
    - '--log.format=logfmt'

Prometheus

args:
    - '--web.console.templates=/etc/prometheus/consoles'
    - '--web.console.libraries=/etc/prometheus/console_libraries'
    - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml'
    - '--storage.tsdb.path=/prometheus'
    - '--storage.tsdb.retention.time=7d'
    - '--web.enable-lifecycle'
    - '--storage.tsdb.no-lockfile'
    - >-
      --web.external-url=http://kube-prometheus-stack-prometheus.monitoring:9090
    - '--web.route-prefix=/'
    - '--storage.tsdb.max-block-duration=2h'
    - '--storage.tsdb.min-block-duration=2h'

Full logs to relevant components:

We were not able to find any logs which could be directly correlated to this issue. If you perhaps have an idea as to we need to look to find useful logs, please let me know.

Anything else we need to know:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant