ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

KingRebo38 · 2024-01-17T12:04:12Z

Thanos, Prometheus and Golang version used:

Thanos: v0.34.0-rc.0
Prometheus: v2.45.0

Object Storage Provider:
AWS S3

What happened:

We are using Thanos to store long term data for our business metrics. Previously we had a retention time on prometheus for 40 days, in order to meet business requirements. In order to save memory and resources, we decided to reduce our retention to just 7 days and solely rely on Thanos/Ruler to make long term aggregations.

Currently we have the following configuration for our thanos storage:

retentionResolutionRaw = 33d
retentionResolution5m = 120d
retentionResolution1h = 1y

After making the switch and migrating all our long term rules to Thanos ruler from Prometheus, the resulting metrics are calculated incorrectly:

As you can see in the image, before the switch we had about 6.5 mil requests in the last 14 days. After the switch the requests plummeted to about 2.5 mil requests. After further investigation we noticed that the number displayed are very close to the our 7 day aggregations. This suggests that our ruler does not receive all the data from our querier. The gaps inside the timeseries are not important, that's the time when the ruler wasn't running.

We are using the one of following metrics and rules to make our aggregations:

rules:
- expr: increase(kong_http_requests_total[14d])
record: kong_http_requests_total:rate:14d

What you expected to happen:

The behaviour should have been similar to this image. This is on our development stage where we had a similar switch. The graph simply continues after the switch. We could not find any meaningful differences between our development stage and this stage. We did experience a similar behaviour where the metric

How to reproduce it (as minimally and precisely as possible):

Sadly I don't know how to reproduce this issue outside of our own setup.

Configuration of our components

Thanos ruler

args:
  - rule
  - '--data-dir=/var/thanos/store'
  - '--log.level=info'
  - '--log.format=logfmt'
  - '--http-address=0.0.0.0:10902'
  - '--grpc-address=0.0.0.0:10901'
  - '--objstore.config-file=/etc/config/object-store.yaml'
  - '--rule-file=/etc/rules/*.yaml'
  - '--label=component="ruler"'
  - '--query=dnssrv+_http._tcp.kube-prometheus-stack-query-frontend-http.monitoring.svc.cluster.local'

Thanos Query Frontend

args:
  - query-frontend
  - '--log.level=debug'
  - '--log.format=logfmt'
  - '--http-address=0.0.0.0:10902'
  - '--query-range.split-interval=24h'
  - '--query-range.max-retries-per-request=5'
  - '--query-range.max-query-parallelism=14'
  - '--query-range.response-cache-max-freshness=1m'
  - >-
    --query-range.response-cache-config-file=etc/config/memcached-config.yaml
  - >-
    --query-frontend.downstream-url=http://kube-prometheus-stack-query-http.monitoring.svc.cluster.local:10902
  - '--query-frontend.compress-responses'
  - >-
    --labels.response-cache-config-file=etc/config/memcached-config.yaml
  - '--labels.split-interval=24h'
  - '--labels.max-retries-per-request=5'
  - '--labels.max-query-parallelism=14'
  - '--labels.response-cache-max-freshness=1m'

Thanos Query

args:
    - query
    - '--log.level=debug'
    - '--log.format=logfmt'
    - '--grpc-address=0.0.0.0:10901'
    - '--http-address=0.0.0.0:10902'
    - '--query.replica-label=prometheus_replica'
    - '--query.auto-downsampling'
    - '--store.sd-dns-resolver=miekgdns'
    - >-
      --store=dnssrv+_grpc._tcp.kube-prometheus-stack-store-grpc.monitoring.svc.cluster.local
    - >-
      --store=dnssrv+_grpc._tcp.kube-prometheus-stack-thanos-sidecar.monitoring.svc.cluster.local
    - >-
      --store=dnssrv+_grpc._tcp.kube-prometheus-stack-rule-grpc.monitoring.svc.cluster.local
    - '--store.sd-interval=5m'

Thanos store

args:
    - store
    - '--data-dir=/var/thanos/store'
    - '--log.level=info'
    - '--log.format=logfmt'
    - '--http-address=0.0.0.0:10902'
    - '--grpc-address=0.0.0.0:10901'
    - '--objstore.config-file=/etc/config/object-store.yaml'
    - '--index-cache-size=250MB'
    - '--chunk-pool-size=2GB'
    - '--store.grpc.series-max-concurrency=20'
    - '--sync-block-duration=3m'
    - '--block-sync-concurrency=20'
    - '--max-time=2y'

Thanos Sidecar

args:
    - sidecar
    - '--prometheus.url=http://127.0.0.1:9090/'
    - '--grpc-address=[$(POD_IP)]:10901'
    - '--http-address=[$(POD_IP)]:10902'
    - '--objstore.config=$(OBJSTORE_CONFIG)'
    - '--tsdb.path=/prometheus'
    - '--log.level=info'
    - '--log.format=logfmt'

Prometheus

args:
    - '--web.console.templates=/etc/prometheus/consoles'
    - '--web.console.libraries=/etc/prometheus/console_libraries'
    - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml'
    - '--storage.tsdb.path=/prometheus'
    - '--storage.tsdb.retention.time=7d'
    - '--web.enable-lifecycle'
    - '--storage.tsdb.no-lockfile'
    - >-
      --web.external-url=http://kube-prometheus-stack-prometheus.monitoring:9090
    - '--web.route-prefix=/'
    - '--storage.tsdb.max-block-duration=2h'
    - '--storage.tsdb.min-block-duration=2h'

Full logs to relevant components:

We were not able to find any logs which could be directly correlated to this issue. If you perhaps have an idea as to we need to look to find useful logs, please let me know.

Anything else we need to know:

KingRebo38 mentioned this issue Jan 19, 2024

Ruler: Implement flag for max-source-resolution in the rule query #7077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

KingRebo38 commented Jan 17, 2024

ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

ruler: Thanos Ruler is miscalculating long time ranges in queries #7070

Comments

KingRebo38 commented Jan 17, 2024

Configuration of our components