Performance tweak for check mk raw version.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
etc
.gitignore
COPYING
README.rst
disable-binary-compilation.nagios.py.patch
fastchecker
fastchecker.py
fastpinger
fastpinger_push
fastpinger_push.py
fastpinger_setup
fastpinger_setup.py
gpl-3.0.txt
hook.sh
networks.lst
paths.conf
pip.conf
prepare.sh
requirements.txt
venv_runner

README.rst

fastchecker

Warning this is POC code.

This is a CheckMK helper to keep in memory CheckMK python stuffs, so a host check take 0.05s instead of 2s.

By default, nagios double fork and then run a python script built by check_mk. This consumes a ton of cpu, starting the python VM take ages, the check file is read from disk again and again. While the check itself is just a tcp connection to the server and then a connection to an unix socket to report back the result to nagios.

CheckMK propose a new CORE called CMC (instead of using nagios/icinga), written in Python that doesn't have all of this issue because it's all python and all preloaded. But this is not a opensource software...

So here the idea:

I wrote the small daemon that:

  • Load CheckMK modules and all host checks in memory, (it take somes times to start).
  • Provide a HTTP server to run check and inventory

Performance example:

$ time cmk www ; echo $?
OK - Agent version 1.4.0p24, execution time 0.5 sec|execution_time=0.541 user_time=0.070 system_time=0.010 children_user_time=0.000 children_system_time=0.000 time_agent=0.464

real    0m1.742s
user    0m1.220s
sys     0m0.052s
0

We can see the overhead of Python VM and module loading take 1.2seconds

$ time curl -s http://localhost:5001/check/www
0
OK - Agent version 1.4.0p24, execution time 0.5 sec|execution_time=0.508 user_time=0.040 system_time=0.000 children_user_time=0.000 children_system_time=0.000 time_agent=0.464

real    0m0.523s
user    0m0.008s
sys     0m0.004s

Here, the overhead of curl is only 0.008s

So for our supervision that checks 657 hosts every minute, it's 1.2 * 657 / 60 == 13.14 minutes of time wasted on each cycle.

The load of our tiny Intel NUC server was around 40-60..., Now it's 0.6.

The goals of fastchecker is to make icinga/nagios doing only the active_checks/scheduling/notification stuffs. fastchecker will run all python stuffs.

Also, fastpinger ping all your infras with fping and dump a report. The fastchecker hook read the report and mimic check_icmp output.

Configuration

Looks at the prepare.sh script to prepare the environment, and then edit etc/icinga/conf.d/check_mk_templates.cfg, replace the command_line of two main checks by:

# Calling check_mk with precompiled checks
define command {
  command_name  check-mk
  # command_line  python $USER4$/var/check_mk/precompiled/"$HOSTNAME$"
  command_line  $USER4$/etc/check_mk/fastchecker/hook.sh check "$HOSTNAME$"
}

# Inventory check
define command {
  command_name  check-mk-inventory
  # command_line  check_mk --cache --check-discovery "$HOSTNAME$"
  command_line  $USER4$/etc/check_mk/fastchecker/hook.sh inventory "$HOSTNAME$"
}

# Commands for services of PING-only hosts
define command {
  command_name  check-mk-ping
  # command_line  $USER4$/lib/nagios/plugins/check_icmp $ARG1$
  command_line  $USER4$/etc/check_mk/fastchecker/hook.sh ping $ARG1$
}

# Host check commands
define command {
  command_name check-mk-host-ping
  # command_line $USER4$/lib/nagios/plugins/check_icmp $ARG1$ $HOSTADDRESS$
  command_line  $USER4$/etc/check_mk/fastchecker/hook.sh ping $ARG1$ $HOSTADDRESS$
}

define command {
  command_name check-mk-host-ping-cluster
  command_line  $USER4$/etc/check_mk/fastchecker/hook.sh ping -m 1 $ARG1$ $_HOSTNODEIPS$
  # command_line $USER4$/lib/nagios/plugins/check_icmp -m 1 $ARG1$ $_HOSTNODEIPS$
}

```