Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics interval does not take scraping cron schedule into account #60

Closed
tomkerkhove opened this Issue Apr 26, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@tomkerkhove
Copy link
Owner

tomkerkhove commented Apr 26, 2018

Currently every time a metric is scraped it will get a list of all measure points with a fixed interval, currently 1 minute.

This means that the scraping schedule, which is a cron, is not taken into account and can give false positives when using metrics such as average.

@tomkerkhove tomkerkhove added this to the v1.1 milestone Apr 26, 2018

@tomkerkhove tomkerkhove changed the title Metrics interval is hardcoded and does not take cron schedule into account Metrics interval does not take scraping cron schedule into account Apr 26, 2018

@tomkerkhove tomkerkhove modified the milestones: vNext, v0.3 Aug 18, 2018

@tomkerkhove tomkerkhove modified the milestones: v0.3.0, v0.4.0 Oct 17, 2018

@jdevalk2

This comment has been minimized.

Copy link

jdevalk2 commented Oct 18, 2018

I have an observation, which I think is an issue, related to this one, but not sure if its its own issue.

Sorry for the long story in advance.
The short of it is: When retrieving metrics, then Promitor should respect the underlying metric data granularities, and the data granularity should (probably?) be configurable on the Generic Metric definition configuration.

Here goes :

First consider a timeline, Azure Monitor will log a datapoint at intervals.

Then on top of this there is the concept of Azure Monitor data granularity. Which data granularities are supported is depending on the underlying resource. When getting the resource metric definition, it will list its supported data granularity. Common ones are 30s, 1m, 15m, 30m, 1h, 6h, 12h, 1d

Azure monitor already does aggregation from "raw" datapoints into the above data granularities.
Lets say for example the data granularity is one minute. In that minute AM could have retrieved 2 "raw" datapoints from the resource, those will then be aggregated (sum, total, avg, min, max)

Raw: 10 bytes at T + 15 seconds
Raw: 25 bytes at T + 20 seconds

Then 1 minute data granularity interval for TOTAL aggregation could be 35 bytes.
If those points are the only points in 15 minutes then
15 minute data granularity for TOTAL aggregation is 35 bytes.
1 hour data granularity for TOTAL aggregation is 35 bytes.
etc.

Now we have a cron job which fires data collector process every X interval.
Data collector process purpose is to get the latest measured datapoint for the observed metric.

It works by every interval the past 5 days of data is grabbed, in metric "data granularity" intervals.
That data granularity currently is set to 5 minutes, hardcoded.

There is a TODO in the code that mentions that should be synced with the CRON job interval. I think though these are both related to time but they are different and should therefore not be synced.

First consideration why I think this wouldn't be ideal is you could have many metrics configured. Each with a different data granularity so in that case it would be hard to have 1 CRON job schedule that would match the data granularity of all underlying metric configured to be measured.

Second consideration like said I think the data granularity instead of being 5m hardcoded should be configurable by the end user in the metric configuration.
In the Azure Monitor query API interval dictates the data granularity series (aggregated by Azure Monitor internally) that is returned.

Third consideration with the granularity hardcoded to 5m, if interested in lower granularity data say per minute, and Prometheus is set to scrape every minute, then 5 minutes is the best resolution you are ever going to see on the metric. And when the metric is updated it might be a couple of minutes off the latest value.

Below I did a quick example of what I mean (I am not that proficient in tables so it looks ugly but hopefully it's clear enough :-))

This is the desired behaviour I think:

If the scraper runs after T+7 minutes you would expect V7 value for metric1 and V2 value for metric2

Time   Metric1   Metric 2   Scraper
T+0     V1             V1
T+1     V2
T+2     V3 
T+3     V4
T+4     V5             V2  < latest
T+5     V6
T+6     V7 < latest                
T+7     V8                         << Scraper process run here

In the current build you would have
If the scraper runs after T+7 minutes again with 5 minute hardcoded data granularity
You would get V5 value for metric 1 and V2 value for metric2

Time   Metric1   Metric 2   Scraper
T+0     V1             V1
T+1     V2
T+2     V3 
T+3     V4
T+4     V5  <this one V2  < latest
T+5     V6
T+6     V7               
T+7     V8                         << Scraper process run here
@tomkerkhove

This comment has been minimized.

Copy link
Owner Author

tomkerkhove commented Oct 22, 2018

Thanks for the thorough report @jdevalk2!

I can see you concern where the aggregations are not aligned and configuring them might be better indeed. However, wouldn't it be a problem if you use granularity 1 min but only scrape every 5 min?

I don't think it is but just want to make sure we've thought this through.

@jdevalk2

This comment has been minimized.

Copy link

jdevalk2 commented Oct 22, 2018

@tomkerkhove

Azure Monitor lowest known resolution of 1 minute - depending on the total timeline you are looking at - can be pretty coarse already.

It does pre-aggregation within that 1-minute: min, max, average etc. So at best in Promitor we already get a collapsed summarized view of the resource that is being monitored.

The case you mention where the resource is queried on per-minute-data-granularity and Promitor query loop runs by CRON every 5 minutes could result in a loss of data fidelity. This depends if Promitor should or should not have aggregation functionality of its own.

Solution 1

Could be that Promitor loop has to be as-fast as the lowest datagranularity of all the resources you are running. Example: monitor a resource with 1 minute and one with 5 minute intervals. This example means Promitor has to run every minute.

Solution 2

Another option is Promitor does additional data aggregation itself. It also knows the last data point timestamp it queried for each resource. In this example, if Promitor is configured to scrape a resource with 1 minute granularity and 5 minute granularity intervals, and Promitor runs every 5 minutes. Promitor "knows" that it needs to aggregate the last 5 data points of the 1 minute resource and pass through the 5 minute one.

OR Promitor has to query both resources with the same Interval and let Azure Monitor do the aggregation for it. An option. However since CRON allows for many more schedules than 1 minute or 5 minutes, like every 32 minutes and not on an ODD Saturday I do not see how this can hold up unless in the simplest of cases.

Solution 3

? 1 + 2 were the ones I could think of but there might be other alternatives.

However I think the least complex would be for Promitor just to pass-through and not make any interpretations on the data itself.

If Promitor passes through the data at the lowest resolution, then any aggregation can be taken care of in Prometheus query language, as I think that is what PromQL should be good at with Prometheus being a TimeSeries database.

(NOTE: Also want to make the observation that a factor is also Prometheus itself querying the Metric that promitor exposes, in all my examples I just assume the endpoint is hit every 3 to 10 seconds or so so well within the minute timerange. If Prometheus itself only queries promitor exposed metrics every 5-10 minutes then that might change the story)

For instance resource 1 - 1 minute intervals, resource 2 - 5 minute intervals.

 Avg R1:   Avg R2:

T+0 10 20
T+1 20 20
T+2 10 20
T+3 40 20
T+4 85 20
T+5 100 30
T+5 60 30

In Prometheus DB you get 5 unique average values for R1, 10, 20, 10, 40, 85 for every 1 value in T2: 20.
I think if Prometheus scraper runs on Promitor data 5 times in 5 minutes you would get 20, 5 times as well as a datapoint. (?)

Then in PromQL you can hopefully normalize those 5 values into 1 value when comparing T1, T2. Again I am relying on the almighty power of the Timeseries database here :-)

One final thought

Coming back to the case where a 1-minute granularity resource is scraped every 5 minutes. Given the alternative that no data aggregation is done, that means data fidelity is lost.

But from the point of taking metrics from a resource. All we do is look at the resource and observe the state that its in at that moment. If we missed the states before that it does not necessarily mean it is wrong, it just means when we have valleys and peaks our valleys and peaks would be chunkier with not as high a resolution.

@tomkerkhove

This comment has been minimized.

Copy link
Owner Author

tomkerkhove commented Oct 22, 2018

For what it's worth, Promitor does not do any aggregation at the moment, it just scrapes on a fixed schedule but this is misleading indeed.

As the issue is bigger than aligning CRON for Promitor scraping & metrics aggregation I suggest to pick this up in #198 instead of here and see if we still need this change here.

@tomkerkhove

This comment has been minimized.

Copy link
Owner Author

tomkerkhove commented Dec 14, 2018

This will be done via #256 & #258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.