-
Notifications
You must be signed in to change notification settings - Fork 2.2k
When requesting metrics in Grafana, data after some days seems quite inflated. #8155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey there, @mdraijer! 👋 I'm here to assist you with bug fixes, questions, and more while you wait for a human maintainer. I found a similar unsolved discussion that might be relevant to your issue: Values in Grafana seem inflated after some time, possibly after Thanos compaction [1]. This discussion highlights issues with inflated data in Grafana when using the sum function, particularly at the 1-hour resolution, which aligns with the problem you're experiencing. Regarding the sum function and data resolution in Thanos, the aggregation of metrics using the sum function involves creating downsampled chunks that include various aggregations such as sum, count, min, max, and counter. This is done at different resolutions, such as 5 minutes or 1 hour, to balance data resolution and storage efficiency. The sum function is stored in the The issue might be related to how Thanos handles queries with different resolutions. Thanos uses a function called If the problem persists, it might be worth investigating further into how Thanos handles the aggregation and resolution of your specific metrics, especially considering the known issues with the sum function and high-resolution data. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
More precise example, for query:
the following graphs, for a period of 1 hour, that is 1,5 day in the past, for respectively the resolutions raw, 5m and 1h: When you remove the However, in the 'Graph' view, for the query without sum, you see 'Result series: 131'. So, somehow over that hour period, Thanos recognizes 131 different series. With resolution 5m and raw it says 60 result series. |
After some more analysis of the Thanos source code it is even more clear that this is a bug.
These should not be mixed. For the calculation of the sum in downsampled data, the average for the time period (i.e. AggrSum/AggrCount) should be used, possibly with some intelligent calculations with the min and max for the period, but not the AggrSum. |
Maybe a Thanos developer can have a look at this? |
@dosu please tell me how I can get a Thanos maintainer to look at this ticket? |
It seems like you've identified a potential issue with how Thanos handles the aggregation of metrics, particularly when using the To address this, the calculation of the sum in downsampled data should ideally use the average for the time period (i.e., Unfortunately, I can't directly contact Thanos maintainers or make changes to the code. However, you might consider creating a detailed issue on the Thanos GitHub repository, providing your analysis and findings. This would help bring the issue to the attention of the maintainers and facilitate further investigation. This is a tough nut to crack, could really use your help on this, @kakkoyun. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
When requesting metrics in Grafana with a sum function, data after some days seems quite inflated.
You can see on the screenshots provided. Same timeframe, but suddenly the data starts to inflate multiple times over.
The query executed:
sum by (datname,usename) (cnpg_backends_total{namespace="$namespace",release="$release",pod=~"${cluster}-.*",usename!="cnpg_pooler_pgbouncer",datname!="postgres",usename!="postgres"})
It seems to resemble what happens in #922 but this should be fixed years ago.
Originally posted by @luuktrs in #8099
The text was updated successfully, but these errors were encountered: