Skip to content

Commit

Permalink
Initial file system monitor plugin
Browse files Browse the repository at this point in the history
Currently limited to monitoring a single file system.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
  • Loading branch information
troglobit committed Dec 20, 2023
1 parent 28390e0 commit 9af73c5
Show file tree
Hide file tree
Showing 14 changed files with 160 additions and 49 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/coverity.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ jobs:
- name: Configure
run: |
./autogen.sh
./configure --with-generic --with-loadavg --with-filenr \
--with-meminfo --enable-examples
./configure --with-generic --with-loadavg --with-filenr --with-fsmon \
--with-meminfo --enable-examples --enable-compat
- name: Build
run: |
export PATH=`pwd`/coverity/bin:$PATH
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ aspects of the system, such as:
- Load average
- Memory leaks
- File descriptor leaks
- File system usage
- Process live locks
- Reset counter, e.g., for snmpEngineBoots (RFC 2574)
- Generic script
Expand Down
20 changes: 15 additions & 5 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@ AC_ARG_WITH([filenr],
AS_HELP_STRING([--with-filenr[=SEC]], [Enable file descriptor leak monitor, poll: 300 sec]),
[with_filenr=$withval], [with_filenr=no])

AC_ARG_WITH([fsmon],
AS_HELP_STRING([--with-fsmon[=SEC]], [Enable file system monitor, poll: 300 sec]),
[with_fsmon=$withval], [with_fsmon=no])

AC_ARG_WITH([meminfo],
[AS_HELP_STRING([--with-meminfo[=SEC]], [Enable memory leak monitor, poll: 300 sec])],
[with_meminfo=$withval], [with_meminfo=no])
Expand All @@ -83,6 +87,10 @@ AS_IF([test "x$with_filenr" != "xno"], [
AS_IF([test "x$with_filenr" = "xyes"], [with_filenr=300])
AC_DEFINE_UNQUOTED(FILENR_PLUGIN, $with_filenr, [Enable file descriptor leak monitor])])

AS_IF([test "x$with_fsmon" != "xno"], [
AS_IF([test "x$with_fsmon" = "xyes"], [with_fsmon=300])
AC_DEFINE_UNQUOTED(FSMON_PLUGIN, $with_fsmon, [Enable file system monitor])])

AS_IF([test "x$with_meminfo" != "xno"], [
AS_IF([test "x$with_meminfo" = "xyes"], [with_meminfo=300])
AC_DEFINE_UNQUOTED(MEMINFO_PLUGIN, $with_meminfo, [Enable memory leak monitor])])
Expand All @@ -108,11 +116,12 @@ AS_IF([test "x$with_systemd" = "xyes" -o "x$with_systemd" = "xauto"], [
AS_IF([test "x$with_systemd" != "xno"],
[AC_SUBST([systemddir], [$with_systemd])])

AM_CONDITIONAL([HAVE_SYSTEMD], [test "x$with_systemd" != "xno"])
AM_CONDITIONAL(LOADAVG_PLUGIN, [test "x$with_loadavg" != "xno"])
AM_CONDITIONAL(FILENR_PLUGIN, [test "x$with_filenr" != "xno"])
AM_CONDITIONAL(MEMINFO_PLUGIN, [test "x$with_meminfo" != "xno"])
AM_CONDITIONAL(GENERIC_PLUGIN, [test "x$with_generic" != "xno"])
AM_CONDITIONAL([HAVE_SYSTEMD], [test "x$with_systemd" != "xno"])
AM_CONDITIONAL(LOADAVG_PLUGIN, [test "x$with_loadavg" != "xno"])
AM_CONDITIONAL(FILENR_PLUGIN, [test "x$with_filenr" != "xno"])
AM_CONDITIONAL(FSMON_PLUGIN, [test "x$with_fsmon" != "xno"])
AM_CONDITIONAL(MEMINFO_PLUGIN, [test "x$with_meminfo" != "xno"])
AM_CONDITIONAL(GENERIC_PLUGIN, [test "x$with_generic" != "xno"])
AM_CONDITIONAL(ENABLE_EXAMPLES, [test "$enable_examples" = yes])

# Expand $sbindir early, into $SBINDIR, for systemd unit file
Expand Down Expand Up @@ -159,6 +168,7 @@ cat <<EOF
generic script (sec): $with_generic
loadavg poll (sec)..: $with_loadavg
filenr poll (sec)...: $with_filenr
fsmon poll (sec)....: $with_fsmon
meminfo poll (sec)..: $with_meminfo

------------- Compiler version --------------
Expand Down
6 changes: 0 additions & 6 deletions doc/TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,3 @@ Add system health monitor with capabilities to monitor:
* Temperature sensor
* Network connectivity, e.g. ping with optional outbound iface and a
script to run if ping (three attempts) fails
* RAM disks used for log files. Best way is probably to implement this
as a generic checker that the user can define any way they like. E.g,

fs-monitor /var { warning = 90%, critical 95% }

Use the C library API statfs(2).
76 changes: 52 additions & 24 deletions doc/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,36 @@ Built-in Monitors
-----------------

[watchdogd(8)][] supports optional monitoring of several system resources that
can be enabled in [watchdogd.conf(5)][]. First, system load average that can
be monitored with:
can be enabled in [watchdogd.conf(5)][].

All of these monitors can be *very* useful on an embedded or headless
system with little or no operator supervision.

The two values, `warning` and `critical`, are the warning and reboot
levels in percent. The latter is optional, if it is omitted reboot is
disabled. A script can also be run instead of reboot, see the `.conf`
file for details.

Determining suitable system load average levels is tricky. It always
depends on the system and use-case, not just the number of CPU cores.
Peak loads of 16.00 on an 8 core system may be responsive and still
useful but 2.00 on a 2 core system may be completely bogged down. Make
sure to read up on the subject and thoroughly test your system before
enabling a reboot trigger value. `watchdogd` uses an average of the
first two load average values, the one (1) and five (5) minute. For
more information on the UNIX load average, see this [StackOverflow
question][loadavg].

The RAM usage monitor only triggers on systems without swap. This is
detected by reading the file `/proc/meminfo`, looking for the
`SwapTotal:` value. For more details on the underlying mechanisms of
file descriptor usage, see [this article][filenr]. For more info on the
details of memory usage, see [this article][meminfo].


### System Load

System load average that can be monitored with:

```
loadavg {
Expand All @@ -35,7 +63,9 @@ syslog output for load average looks like this:

watchdogd[2323]: Loadavg: 0.32, 0.07, 0.02 (1, 5, 15 min)

Second, the memory leak detector, a value of 1.0 means 100% memory use:
### Memory Usage

The memory leak detector, a value of 1.0 means 100% memory use:

```
meminfo {
Expand All @@ -51,7 +81,9 @@ The syslog output looks like this:

watchdogd[2323]: Meminfo: 59452 kB, cached: 23912 kB, total: 234108 kB

Third, file descriptor leak detector:
### File Descriptor Usage

File descriptor leak detector:

```
filenr {
Expand All @@ -67,29 +99,25 @@ The syslog output looks like this:

watchdogd[2323]: File nr: 288/17005

All of these monitors can be *very* useful on an embedded or headless
system with little or no operator supervision.

The two values, `warning` and `critical`, are the warning and reboot
levels in percent. The latter is optional, if it is omitted reboot is
disabled. A script can also be run instead of reboot, see the `.conf`
file for details.
### File System Usage

Determining suitable system load average levels is tricky. It always
depends on the system and use-case, not just the number of CPU cores.
Peak loads of 16.00 on an 8 core system may be responsive and still
useful but 2.00 on a 2 core system may be completely bogged down. Make
sure to read up on the subject and thoroughly test your system before
enabling a reboot trigger value. `watchdogd` uses an average of the
first two load average values, the one (1) and five (5) minute. For
more information on the UNIX load average, see this [StackOverflow
question][loadavg].
Currently only a single file system can be monitored, in this example we
monitor `/var` every five minutes.

The RAM usage monitor only triggers on systems without swap. This is
detected by reading the file `/proc/meminfo`, looking for the
`SwapTotal:` value. For more details on the underlying mechanisms of
file descriptor usage, see [this article][filenr]. For more info on the
details of memory usage, see [this article][meminfo].
```
fsmon /var {
enabled = true
interval = 300 # Every five minutes
logmark = true
warning = 0.8
critical = 0.95
}
```

The syslog output looks like this:

watchdogd[2323]: Fsmon /var: blocks 404/28859 inodes 389/28874


Generic Script
Expand Down
2 changes: 2 additions & 0 deletions man/watchdogd.8
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ Memory leaks
.It
File descriptor leaks
.It
File system usage
.It
Process live locks
.It
Reset counter, e.g., for snmpEngineBoots (RFC 2574)
Expand Down
43 changes: 43 additions & 0 deletions man/watchdogd.conf.5
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ Process supervisor, monitor the heartbeat of processes
.It Cm filenr
File descriptor monitor, also covers sockets, and other descriptor based
resources
.It Cm fsmon
File system monitor, checks both availble blocks and inodes.
.It Cm loadavg
CPU load average monitor
.It Cm meminfo
Expand Down Expand Up @@ -178,6 +180,47 @@ action is used. The scripts are called in the same way as the global
script, same arguments.
.El
.El
.Ss File System Monitor
.Bl -tag -width TERM
.It Cm fsmon Ar /mounpoint {}
Monitors file the given path
.Ar /mountpoint
for block and inode usage. If either exceeds the configured watermarks
action is taken.
.Pp
The script is called with the
.Cm fsmon
label as the first argument, and the monitored path and exceeded
resource are available as environment variables:
.Pp
.Bl -tag -compact
.It Cm FSMON_TYPE
One of 'blocks' or 'inodes' that exceeded the watermark.
.It Cm FSMON_NAME
Name of monitored path.
.El
.Pp
The settings are the same as the other monitor plugins:
.Bl -tag -width TERM
.It Cm enabled = Ar true | false
Enable or disable plugin, default: disabled
.It Cm interval = Ar SEC
Poll interval, default: 300 sec
.It Cm logmark = Ar true | false
Log current stats every poll interval. Default: disabled
.It Cm warning = Ar LEVEL
High watermark level, alert sent to log.
.It Cm critical = Ar LEVEL
Critical watermark level, alert sent to log, followed by reboot or
script action.
.It Cm script = Ar "/path/to/reboot-action.sh"
Optional script to run instead of reboot if critical watermark level is
reached. If omitted the global
.Ql script
action is used. The scripts are called in the same way as the global
script, same arguments.
.El
.El
.Ss CPU Load Average Monitor
.Bl -tag -width TERM
.It Cm loadavg Ar {}
Expand Down
3 changes: 3 additions & 0 deletions src/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ watchdogd_SOURCES = watchdogd.c private.h \
if FILENR_PLUGIN
watchdogd_SOURCES += filenr.c
endif
if FSMON_PLUGIN
watchdogd_SOURCES += fsmon.c
endif
if LOADAVG_PLUGIN
watchdogd_SOURCES += loadavg.c
endif
Expand Down
22 changes: 14 additions & 8 deletions src/conf.c
Original file line number Diff line number Diff line change
Expand Up @@ -27,27 +27,29 @@

static char *fn;

#if defined(LOADAVG_PLUGIN) || defined(MEMINFO_PLUGIN) || defined(FILENR_PLUGIN)
#if defined(LOADAVG_PLUGIN) || defined(MEMINFO_PLUGIN) || defined(FILENR_PLUGIN) || defined(FSMON_PLUGIN)
static int checker(uev_ctx_t *ctx, cfg_t *cfg, const char *sect,
int (*init)(uev_ctx_t *, int, int, float, float, char *))
int (*init)(uev_ctx_t *, const char *, int, int, float, float, char *))
{
int rc;
cfg_t *sec;
int rc;

sec = cfg_getnsec(cfg, sect, 0);
if (sec && cfg_getbool(sec, "enabled")) {
int period, logmark;
char *script;
const char *name;
float warn, crit;
char *script;

name = cfg_title(sec);
period = cfg_getint(sec, "interval");
logmark = cfg_getbool(sec, "logmark");
warn = cfg_getfloat(sec, "warning");
crit = cfg_getfloat(sec, "critical");
script = cfg_getstr(sec, "script");
rc = init(ctx, period, logmark, warn, crit, script);
rc = init(ctx, name, period, logmark, warn, crit, script);
} else {
rc = init(ctx, 0, 0, 0.0, 0.0, NULL);
rc = init(ctx, NULL, 0, 0, 0.0, 0.0, NULL);
}

return rc;
Expand All @@ -57,10 +59,10 @@ static int checker(uev_ctx_t *ctx, cfg_t *cfg, const char *sect,
#if defined(GENERIC_PLUGIN)
static int generic_plugin_checker(uev_ctx_t *ctx, cfg_t *cfg)
{
cfg_t *sec;
int warn_level, crit_level;
char *script, *monitor;
int period, timeout;
int warn_level, crit_level;
cfg_t *sec;

sec = cfg_getnsec(cfg, "generic", 0);
if (!sec || !cfg_getbool(sec, "enabled")) {
Expand Down Expand Up @@ -214,6 +216,7 @@ int conf_parse_file(uev_ctx_t *ctx, char *file)
CFG_SEC ("reset-reason", reset_reason_opts, CFGF_NONE),
CFG_STR ("script", NULL, CFGF_NONE),
CFG_SEC ("filenr", checker_opts, CFGF_NONE),
CFG_SEC ("fsmon", checker_opts, CFGF_MULTI | CFGF_TITLE),
CFG_SEC ("loadavg", checker_opts, CFGF_NONE),
CFG_SEC ("meminfo", checker_opts, CFGF_NONE),
CFG_SEC ("generic", generic_plugin_opts, CFGF_NONE),
Expand Down Expand Up @@ -280,6 +283,9 @@ int conf_parse_file(uev_ctx_t *ctx, char *file)
#ifdef FILENR_PLUGIN
checker(ctx, cfg, "filenr", filenr_init);
#endif
#ifdef FSMON_PLUGIN
checker(ctx, cfg, "fsmon", fsmon_init);
#endif
#ifdef LOADAVG_PLUGIN
checker(ctx, cfg, "loadavg", loadavg_init);
#endif
Expand Down
5 changes: 4 additions & 1 deletion src/filenr.c
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,11 @@ static void cb(uev_t *w, void *arg, int events)
}
}

int filenr_init(uev_ctx_t *ctx, int T, int mark, float warn, float crit, char *script)
int filenr_init(uev_ctx_t *ctx, const char *name, int T, int mark,
float warn, float crit, char *script)
{
(void)name;

if (!T) {
INFO("File descriptor leak monitor disabled.");
return uev_timer_stop(&watcher);
Expand Down
5 changes: 4 additions & 1 deletion src/loadavg.c
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,11 @@ static void cb(uev_t *w, void *arg, int events)
* Every T seconds we check loadavg
* First run is after 1 sec on init, then every period seconds
*/
int loadavg_init(uev_ctx_t *ctx, int T, int mark, float warn, float crit, char *script)
int loadavg_init(uev_ctx_t *ctx, const char *name, int T, int mark,
float warn, float crit, char *script)
{
(void)name;

if (!T) {
INFO("Load average monitor disabled.");
return uev_timer_stop(&watcher);
Expand Down
5 changes: 4 additions & 1 deletion src/meminfo.c
Original file line number Diff line number Diff line change
Expand Up @@ -125,8 +125,11 @@ static void cb(uev_t *w, void *arg, int events)
}
}

int meminfo_init(uev_ctx_t *ctx, int T, int mark, float warn, float crit, char *script)
int meminfo_init(uev_ctx_t *ctx, const char *name, int T, int mark,
float warn, float crit, char *script)
{
(void)name;

if (!T) {
INFO("Memory leak monitor disabled.");
return uev_timer_stop(&watcher);
Expand Down
2 changes: 2 additions & 0 deletions src/monitor.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#define WDOG_MONITOR_H_

int filenr_init (uev_ctx_t *ctx, const char *name, int T, int mark, float warn, float crit, char *script);
int fsmon_init (uev_ctx_t *ctx, const char *name, int T, int mark, float warn, float crit, char *script);
int fsmon_init (uev_ctx_t *ctx, const char *name, int T, int mark, float warn, float crit, char *script);
int loadavg_init (uev_ctx_t *ctx, const char *name, int T, int mark, float warn, float crit, char *script);
int meminfo_init (uev_ctx_t *ctx, const char *name, int T, int mark, float warn, float crit, char *script);

Expand Down
Loading

0 comments on commit 9af73c5

Please sign in to comment.