Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Metrics interval does not take scraping cron schedule into account #60
Currently every time a metric is scraped it will get a list of all measure points with a fixed interval, currently 1 minute.
This means that the scraping schedule, which is a cron, is not taken into account and can give false positives when using metrics such as average.
changed the title
Metrics interval is hardcoded and does not take cron schedule into account
Apr 26, 2018
I have an observation, which I think is an issue, related to this one, but not sure if its its own issue.
Sorry for the long story in advance.
Here goes :
First consider a timeline, Azure Monitor will log a datapoint at intervals.
Then on top of this there is the concept of Azure Monitor data granularity. Which data granularities are supported is depending on the underlying resource. When getting the resource metric definition, it will list its supported data granularity. Common ones are 30s, 1m, 15m, 30m, 1h, 6h, 12h, 1d
Azure monitor already does aggregation from "raw" datapoints into the above data granularities.
Raw: 10 bytes at T + 15 seconds
Then 1 minute data granularity interval for TOTAL aggregation could be 35 bytes.
Now we have a cron job which fires data collector process every X interval.
It works by every interval the past 5 days of data is grabbed, in metric "data granularity" intervals.
There is a TODO in the code that mentions that should be synced with the CRON job interval. I think though these are both related to time but they are different and should therefore not be synced.
First consideration why I think this wouldn't be ideal is you could have many metrics configured. Each with a different data granularity so in that case it would be hard to have 1 CRON job schedule that would match the data granularity of all underlying metric configured to be measured.
Second consideration like said I think the data granularity instead of being 5m hardcoded should be configurable by the end user in the metric configuration.
Third consideration with the granularity hardcoded to 5m, if interested in lower granularity data say per minute, and Prometheus is set to scrape every minute, then 5 minutes is the best resolution you are ever going to see on the metric. And when the metric is updated it might be a couple of minutes off the latest value.
Below I did a quick example of what I mean (I am not that proficient in tables so it looks ugly but hopefully it's clear enough :-))
This is the desired behaviour I think:
If the scraper runs after T+7 minutes you would expect V7 value for metric1 and V2 value for metric2
In the current build you would have
Thanks for the thorough report @jdevalk2!
I can see you concern where the aggregations are not aligned and configuring them might be better indeed. However, wouldn't it be a problem if you use granularity 1 min but only scrape every 5 min?
I don't think it is but just want to make sure we've thought this through.
Azure Monitor lowest known resolution of 1 minute - depending on the total timeline you are looking at - can be pretty coarse already.
It does pre-aggregation within that 1-minute: min, max, average etc. So at best in Promitor we already get a collapsed summarized view of the resource that is being monitored.
The case you mention where the resource is queried on per-minute-data-granularity and Promitor query loop runs by CRON every 5 minutes could result in a loss of data fidelity. This depends if Promitor should or should not have aggregation functionality of its own.
Could be that Promitor loop has to be as-fast as the lowest datagranularity of all the resources you are running. Example: monitor a resource with 1 minute and one with 5 minute intervals. This example means Promitor has to run every minute.
Another option is Promitor does additional data aggregation itself. It also knows the last data point timestamp it queried for each resource. In this example, if Promitor is configured to scrape a resource with 1 minute granularity and 5 minute granularity intervals, and Promitor runs every 5 minutes. Promitor "knows" that it needs to aggregate the last 5 data points of the 1 minute resource and pass through the 5 minute one.
OR Promitor has to query both resources with the same Interval and let Azure Monitor do the aggregation for it. An option. However since CRON allows for many more schedules than 1 minute or 5 minutes, like every 32 minutes and not on an ODD Saturday I do not see how this can hold up unless in the simplest of cases.
? 1 + 2 were the ones I could think of but there might be other alternatives.
However I think the least complex would be for Promitor just to pass-through and not make any interpretations on the data itself.
If Promitor passes through the data at the lowest resolution, then any aggregation can be taken care of in Prometheus query language, as I think that is what PromQL should be good at with Prometheus being a TimeSeries database.
(NOTE: Also want to make the observation that a factor is also Prometheus itself querying the Metric that promitor exposes, in all my examples I just assume the endpoint is hit every 3 to 10 seconds or so so well within the minute timerange. If Prometheus itself only queries promitor exposed metrics every 5-10 minutes then that might change the story)
For instance resource 1 - 1 minute intervals, resource 2 - 5 minute intervals.
T+0 10 20
In Prometheus DB you get 5 unique average values for R1, 10, 20, 10, 40, 85 for every 1 value in T2: 20.
Then in PromQL you can hopefully normalize those 5 values into 1 value when comparing T1, T2. Again I am relying on the almighty power of the Timeseries database here :-)
One final thought
Coming back to the case where a 1-minute granularity resource is scraped every 5 minutes. Given the alternative that no data aggregation is done, that means data fidelity is lost.
But from the point of taking metrics from a resource. All we do is look at the resource and observe the state that its in at that moment. If we missed the states before that it does not necessarily mean it is wrong, it just means when we have valleys and peaks our valleys and peaks would be chunkier with not as high a resolution.
For what it's worth, Promitor does not do any aggregation at the moment, it just scrapes on a fixed schedule but this is misleading indeed.
As the issue is bigger than aligning CRON for Promitor scraping & metrics aggregation I suggest to pick this up in #198 instead of here and see if we still need this change here.