Skip to content

When requesting metrics in Grafana, data after some days seems quite inflated. #8155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdraijer opened this issue Mar 12, 2025 · 6 comments
Labels

Comments

@mdraijer
Copy link

mdraijer commented Mar 12, 2025

When requesting metrics in Grafana with a sum function, data after some days seems quite inflated.

  • When the same data is requested another day, the result of the query can be quite inflated. See screenshots where earlier the data seems normal but when requested another day it is suddenly inflated.
  • The problematic results seem to be caused by the resolution of the data, since resoltuin 1h gives inflated data, whereas raw and 5m give the actual data.
  • When the metric in question has few data points, the result graph seems to always be correct, but when there's a lot of data in a metric, the resulting graph shows this problem
  • When the sum is removed, the resulting timeseries, summed up by hand, give the expected value. With all resolutions, also for 1h.

You can see on the screenshots provided. Same timeframe, but suddenly the data starts to inflate multiple times over.

inflated3
inflated 1

The query executed:
sum by (datname,usename) (cnpg_backends_total{namespace="$namespace",release="$release",pod=~"${cluster}-.*",usename!="cnpg_pooler_pgbouncer",datname!="postgres",usename!="postgres"})

It seems to resemble what happens in #922 but this should be fixed years ago.

Originally posted by @luuktrs in #8099

@dosubot dosubot bot added the bug label Mar 12, 2025
Copy link

dosubot bot commented Mar 12, 2025

Hey there, @mdraijer! 👋 I'm here to assist you with bug fixes, questions, and more while you wait for a human maintainer.

I found a similar unsolved discussion that might be relevant to your issue: Values in Grafana seem inflated after some time, possibly after Thanos compaction [1]. This discussion highlights issues with inflated data in Grafana when using the sum function, particularly at the 1-hour resolution, which aligns with the problem you're experiencing.

Regarding the sum function and data resolution in Thanos, the aggregation of metrics using the sum function involves creating downsampled chunks that include various aggregations such as sum, count, min, max, and counter. This is done at different resolutions, such as 5 minutes or 1 hour, to balance data resolution and storage efficiency. The sum function is stored in the Chunk sum field of the AggrChunk message, which is part of the downsampling process [2].

The issue might be related to how Thanos handles queries with different resolutions. Thanos uses a function called LookbackDeltaFactory to create lookback deltas based on resolution levels, ensuring the query uses the correct data granularity. The compaction process in Thanos involves downsampling, which can affect data resolution and accuracy, especially at higher resolutions like 1 hour [3][2].

If the problem persists, it might be worth investigating further into how Thanos handles the aggregation and resolution of your specific metrics, especially considering the known issues with the sum function and high-resolution data.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@mdraijer
Copy link
Author

mdraijer commented Mar 12, 2025

More precise example, for query:

sum by (datname,usename) (cnpg_backends_total{namespace="rws-dso-prd",release="cnpgprd",pod=~"cnpgprd-cnpgcluster-pg15-.*",usename="inception_dba",datname="inception"})

the following graphs, for a period of 1 hour, that is 1,5 day in the past, for respectively the resolutions raw, 5m and 1h:

Image

Image

Image

When you remove the sum part, you get a couple of separate time series. In the 'Table' view you see 5 series with value 1 and 1 series with value 25. That matches with the value 30 that we see in the raw and 5m graphs.

However, in the 'Graph' view, for the query without sum, you see 'Result series: 131'. So, somehow over that hour period, Thanos recognizes 131 different series. With resolution 5m and raw it says 60 result series.

@mdraijer
Copy link
Author

After some more analysis of the Thanos source code it is even more clear that this is a bug.

  • The AggrSum value in the AggrChunk is the sum of successive values over a time period for one time series.
  • The sum() that is requested in the PromQL query is the sum of values in one moment for multiple time series.

These should not be mixed. For the calculation of the sum in downsampled data, the average for the time period (i.e. AggrSum/AggrCount) should be used, possibly with some intelligent calculations with the min and max for the period, but not the AggrSum.

@mdraijer
Copy link
Author

Maybe a Thanos developer can have a look at this?

@mdraijer
Copy link
Author

mdraijer commented Apr 4, 2025

@dosu please tell me how I can get a Thanos maintainer to look at this ticket?

Copy link

dosubot bot commented Apr 4, 2025

It seems like you've identified a potential issue with how Thanos handles the aggregation of metrics, particularly when using the sum function with downsampled data. The problem appears to be related to the use of AggrSum in the AggrChunk, which represents the sum of successive values over a time period for one time series, rather than the sum of values at one moment for multiple time series. This could lead to inflated results when querying downsampled data.

To address this, the calculation of the sum in downsampled data should ideally use the average for the time period (i.e., AggrSum/AggrCount) instead of directly using AggrSum. This approach would provide a more accurate representation of the data over the specified period.

Unfortunately, I can't directly contact Thanos maintainers or make changes to the code. However, you might consider creating a detailed issue on the Thanos GitHub repository, providing your analysis and findings. This would help bring the issue to the attention of the maintainers and facilitate further investigation.

This is a tough nut to crack, could really use your help on this, @kakkoyun.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant