Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

Open
dalesit opened this issue May 25, 2016 · 4 comments
Open

metrics-cpu-mpstat.rb inflating CPU when run from sensu #11

dalesit opened this issue May 25, 2016 · 4 comments

Comments

@dalesit
Copy link
Contributor

dalesit commented May 25, 2016

I have a problem on one of my systems which is showing (according to metrics-cpu-mpstat) nearly 100% cpu utilisation, but according to sar, 75% idle time. The act of measuring the CPU over a second seems to be incorrectly putting up the CPU stats, although not on other identical servers which are 88% idle (according to sar). Running top -d 0.5 shows that the check-rabbitmq checks run at high CPU when they come through. If they coincide with the cpu metrics run, the stats will be skewed. However, running metrics-cpu.rb just takes the counter, and compares it against the value the next time it is run, so gives a true picture of the amount of CPU utilisation over that interval.

@dalesit
Copy link
Contributor Author

dalesit commented May 25, 2016

This will also be an issue with the CPU check - it is sampling (by default) for a second, but at least you are just concerned about whether the CPU is over a high threshold. In addition, you can mitigate the problem because it is possible to change the sleep length so you are increasing the time you are measuring the deltas over, thereby reducing the importance of transient spikes in that measuring period.

If using the metric-cpu.rb and getting accurate stats into graphite, it might be better to run a cpu check by querying the graphite stats, rather than running a check on the client itself and trying to sample the CPU utilisation.

@majormoses
Copy link
Member

majormoses commented May 2, 2017

@dalesit could you put together a pr with some of the recommendations (such as increasing the default sleep) outlined on getting more accurate results?

@dalesit
Copy link
Contributor Author

dalesit commented May 3, 2017

I don't think there is a real solution for this module - for metrics the sampling approach is flawed, as the sensu checks and metrics arrive at the same time, so the CPU will be artificially high at this point. It is worse for machines (or VMs) with a restricted number of cores.

The metrics-cpu.rb check gives accurate figures, but doesn't give the output as CPU usage, but raw ticks. It needs converting to CPU utilisation in the graphing solution.

For the CPU check, at least there is the option to change the sampling window. For a multi-CPU system it is less of a concern, as the sensu checks are less likely to be hogging the CPU for that second. However, for smaller VMs, with one or two cores, there is a higher likelihood of false positives from a sampling approach, and extending the sampling window will mitigate this. It then becomes a tradeoff between the length of time to complete the check and the representativeness of the sample.

@majormoses
Copy link
Member

@dalesit I see, when I have some time I will read through the code in depth and validate but I am of the opinion that if a check is inherently flawed and no reasonable solution can be found we should remove in a major release. Have you played around at all with the sampling in the CPU check and determined a window that seems reasonable? This will obviously depend on the hardware on the machine but I was just curious if you had any findings to share. If not when I have some time I will try to see what seems like a reasonable window with a couple vms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants