Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
Monitoring with check_mk Mathias Kettner, January 22nd, 2009 firstname.lastname@example.org www.mathias-kettner.de ---------------------------------------- 1. The basic principle of check_mk The classical approach for integrating service and host checks into Nagios is by specifying small external programs ("plugins") to be run by Nagios at periodic points of time. In case of OS monitoring the plugin contacts a remote daemon running on the target machine in order to fetch one item of information. This could be information about one specific partition, a network interface or simply the current memory utilisation. The most common daemon for monitoring of Linux und UNIX is the NRPE (Nagios Remote Plugin Executor). Variants are SNMP or SSH. check_mk takes a different approach - with some crucial advantages as we will see later. The basic idea of check_mk is to fetch *all* information about a target host at once. For each host to be monitored check_mk is called by Nagios only once per time period (e.g. once per minute). It contacts a small daemon called "mknagios" on the target machine, which outputs all relevant information about the host - regardless of which items are actually being monitored. That daemon does not get any parameters and does not need to be configured. check_mk now processes that information and extracts all items that have been configured for monitoring and checks them against configured levels (the configuration is done in one file main.mk on the Nagios server). It then sends the check results (OK/WARNING/CRITICAL plus performance data, if any) to Nagios via passive service checks. Nagios just processes these passive service checks just like active ones. One of the main advantages of this approach as shown so far is a massive cut down of needed CPU resources - both on the monitoring host and on the target host. Consider an average number of 100 different services per host. In the classic approach with NRPE each check period 100 seperate processes have to be started on the Nagios host, 100 TCP-connections to be built up and down and 100 plugins have to be executed locally on the target machine. check_mk in contrary just needs one single plugin to be started, one single TCP connection to be built up and one plugin to be run locally on the target host. The estimated speed up is at least a factor of 5 to 10. This allows us to implement a much larger number of checks per time period on the monitoring server and at the same time save CPU resources on the target machines. 2. Implemention of the plugin and the agent The plugin check_mk is installed only on the monitoring server itself and implemented with Python. The local agent on the target hosts is implemented as shell script and run via inetd or xinetd. That way not binary code is needed. Currently Linux and Solaris are supported. Porting the script to other Unices is straight forward. The linux version of the script has successfully been tested and used on SLES 9 and SLES 10, both on 32 and 64 bit architecture. 3. Automated inventory check_mk provides further advantages beyond the actual monitoring. One key feature is the automated inventory function. It scans one, several or all hosts for items not yet monitored and automatically configures them for being monitored. This way new partitions, database instances, network interfaces, multipath devices, raid arrays and other items will automatically added to the monitoring and cannot be forgotten. Also adding one new host or even a list of hosts to the monitoring environment is very easy. check_mk can do this because of the nature of the mknagios-agent, which always sends a complete list of all items of the host the are relevant for monitoring. 4. Automatic creation of Nagios configuration For all items checked by check_mk the Nagios configuration files are created automatically on demand. Nagios parameters can be customized manually by making use of the Nagios template mechanism. 5. Direct RRD updates check_mk supports graphical statistics over all checks that produce values that can be measured. It does this by using PNP4Nagios and round robin databases. For sake of performance check_mk does *not* hand over the data for the RRDs to Nagios but directly enters the values into the RRDs itself. This saves a significant amount of CPU resources. 6. Service aggregations check_mk lets you configure "service aggregations". If you use that feature then for each host H there will be created a host H-summary. Groups of services in H will be aggregated to one single service in H-summary. This way it is for example possible to group all services beloning to a specific database instance into one single aggregated service for that instance in H-summary. The aggregated service is considered to be CRITICAL, if at least one of the underlying services on H is CRITICAL. That way it is much easier to get an overview over the overall state of the hosts. The aggregated services also can be used for notifications. 7. Clusters If you monitor services running on HA-clusters you have the problem that the monitoring server does not know which of the physical nodes a service is currently running on. If the cluster does have a service ip address then you can use that for monitoring. Some types of clusters do not have a service address, however. check_mk supports monitoring such clusters by fetching the data from all pysical nodes and checking if the specific service is running on at least one of them. In Nagios such clusters appear as artificial hosts with no ip address. 8. SNMP monitoring check_mk also supports the "fetch all at once" principle for monitoring of networking devices via SNMP - with all the advantages explained above. This includes an automatic inventory of Ethernet and FibreChannel switches for used and unused ports. 9. Integration into Nagios From Nagios' point of view check_mk is just one plugin next to others. It is straight forward to mix check_mk based and classical checks on one Nagios host.