You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using Thanos to store long term data for our business metrics. Previously we had a retention time on prometheus for 40 days, in order to meet business requirements. In order to save memory and resources, we decided to reduce our retention to just 7 days and solely rely on Thanos/Ruler to make long term aggregations.
Currently we have the following configuration for our thanos storage:
After making the switch and migrating all our long term rules to Thanos ruler from Prometheus, the resulting metrics are calculated incorrectly:
As you can see in the image, before the switch we had about 6.5 mil requests in the last 14 days. After the switch the requests plummeted to about 2.5 mil requests. After further investigation we noticed that the number displayed are very close to the our 7 day aggregations. This suggests that our ruler does not receive all the data from our querier. The gaps inside the timeseries are not important, that's the time when the ruler wasn't running.
We are using the one of following metrics and rules to make our aggregations:
The behaviour should have been similar to this image. This is on our development stage where we had a similar switch. The graph simply continues after the switch. We could not find any meaningful differences between our development stage and this stage. We did experience a similar behaviour where the metric
How to reproduce it (as minimally and precisely as possible):
Sadly I don't know how to reproduce this issue outside of our own setup.
We were not able to find any logs which could be directly correlated to this issue. If you perhaps have an idea as to we need to look to find useful logs, please let me know.
Anything else we need to know:
The text was updated successfully, but these errors were encountered:
Thanos, Prometheus and Golang version used:
Thanos: v0.34.0-rc.0
Prometheus: v2.45.0
Object Storage Provider:
AWS S3
What happened:
We are using Thanos to store long term data for our business metrics. Previously we had a retention time on prometheus for 40 days, in order to meet business requirements. In order to save memory and resources, we decided to reduce our retention to just 7 days and solely rely on Thanos/Ruler to make long term aggregations.
Currently we have the following configuration for our thanos storage:
retentionResolutionRaw = 33d
retentionResolution5m = 120d
retentionResolution1h = 1y
After making the switch and migrating all our long term rules to Thanos ruler from Prometheus, the resulting metrics are calculated incorrectly:
As you can see in the image, before the switch we had about 6.5 mil requests in the last 14 days. After the switch the requests plummeted to about 2.5 mil requests. After further investigation we noticed that the number displayed are very close to the our 7 day aggregations. This suggests that our ruler does not receive all the data from our querier. The gaps inside the timeseries are not important, that's the time when the ruler wasn't running.
We are using the one of following metrics and rules to make our aggregations:
rules:
- expr: increase(kong_http_requests_total[14d])
record: kong_http_requests_total:rate:14d
What you expected to happen:
The behaviour should have been similar to this image. This is on our development stage where we had a similar switch. The graph simply continues after the switch. We could not find any meaningful differences between our development stage and this stage. We did experience a similar behaviour where the metric
How to reproduce it (as minimally and precisely as possible):
Sadly I don't know how to reproduce this issue outside of our own setup.
Configuration of our components
Thanos ruler
Thanos Query Frontend
Thanos Query
Thanos store
Thanos Sidecar
Prometheus
Full logs to relevant components:
We were not able to find any logs which could be directly correlated to this issue. If you perhaps have an idea as to we need to look to find useful logs, please let me know.
Anything else we need to know:
The text was updated successfully, but these errors were encountered: