metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

dalesit · 2016-05-25T13:17:21Z

I have a problem on one of my systems which is showing (according to metrics-cpu-mpstat) nearly 100% cpu utilisation, but according to sar, 75% idle time. The act of measuring the CPU over a second seems to be incorrectly putting up the CPU stats, although not on other identical servers which are 88% idle (according to sar). Running top -d 0.5 shows that the check-rabbitmq checks run at high CPU when they come through. If they coincide with the cpu metrics run, the stats will be skewed. However, running metrics-cpu.rb just takes the counter, and compares it against the value the next time it is run, so gives a true picture of the amount of CPU utilisation over that interval.

dalesit · 2016-05-25T13:32:45Z

This will also be an issue with the CPU check - it is sampling (by default) for a second, but at least you are just concerned about whether the CPU is over a high threshold. In addition, you can mitigate the problem because it is possible to change the sleep length so you are increasing the time you are measuring the deltas over, thereby reducing the importance of transient spikes in that measuring period.

If using the metric-cpu.rb and getting accurate stats into graphite, it might be better to run a cpu check by querying the graphite stats, rather than running a check on the client itself and trying to sample the CPU utilisation.

majormoses · 2017-05-02T15:38:17Z

@dalesit could you put together a pr with some of the recommendations (such as increasing the default sleep) outlined on getting more accurate results?

dalesit · 2017-05-03T15:30:44Z

I don't think there is a real solution for this module - for metrics the sampling approach is flawed, as the sensu checks and metrics arrive at the same time, so the CPU will be artificially high at this point. It is worse for machines (or VMs) with a restricted number of cores.

The metrics-cpu.rb check gives accurate figures, but doesn't give the output as CPU usage, but raw ticks. It needs converting to CPU utilisation in the graphing solution.

For the CPU check, at least there is the option to change the sampling window. For a multi-CPU system it is less of a concern, as the sensu checks are less likely to be hogging the CPU for that second. However, for smaller VMs, with one or two cores, there is a higher likelihood of false positives from a sampling approach, and extending the sampling window will mitigate this. It then becomes a tradeoff between the length of time to complete the check and the representativeness of the sample.

majormoses · 2017-05-03T16:27:21Z

@dalesit I see, when I have some time I will read through the code in depth and validate but I am of the opinion that if a check is inherently flawed and no reasonable solution can be found we should remove in a major release. Have you played around at all with the sampling in the CPU check and determined a window that seems reasonable? This will obviously depend on the hardware on the machine but I was just curious if you had any findings to share. If not when I have some time I will try to see what seems like a reasonable window with a couple vms.

majormoses added Status: Thinking Type: Bug labels Oct 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

dalesit commented May 25, 2016

dalesit commented May 25, 2016

majormoses commented May 2, 2017 •

edited

Loading

dalesit commented May 3, 2017

majormoses commented May 3, 2017

metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

Comments

dalesit commented May 25, 2016

dalesit commented May 25, 2016

majormoses commented May 2, 2017 • edited Loading

dalesit commented May 3, 2017

majormoses commented May 3, 2017

majormoses commented May 2, 2017 •

edited

Loading