Initial implementation of zed (ZFS Event Daemon) #2085

Closed
wants to merge 9 commits into
from

Projects

None yet

7 participants

@dun
Member
dun commented Jan 29, 2014

zed monitors ZFS events. When a zevent is posted, zed will run any scripts that have been enabled for the corresponding zevent class. Multiple scripts may be invoked for a given zevent. The zevent nvpairs are passed to the scripts as environment variables. Refer to the zed(8) manpage for details.

Initial scripts have been developed to log events to syslog, automatically rebuild to a hot spare device, and send email in response to checksum, resilver.finish, and scrub.finish events. To enable email notifications, configure ZED_EMAIL in zed.rc. To enable hot sparing, uncomment ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in zed.rc. (Note that the autoexpand property is not yet supported.)

An EID (Event IDentifier) has been added to each event to uniquely identify it throughout the lifetime of the loaded ZFS kernel module; it is a monotonically increasing integer that resets to 1 each time the module is loaded. To support hot spare functionality, the following members have been added to all zevent ereports that include a vdev: vdev_spare_paths, vdev_spare_guids, vdev_read_errors, vdev_write_errors, and vdev_cksum_errors.

zed is a work-in-progress. There are still rough edges and dark corners where you may be eaten by a grue. Due to its size, I wanted to start the pull-request now while I continue on code cleanup and fixes. It has not yet been fully integrated into the spec file to start at boot.

Assimilates issue #1896 and lays the groundwork for moving forward on issue #2.

behlendorf and others added some commits Nov 22, 2013
@behlendorf @dun behlendorf Add a unique "eid" value to all zevents
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents.  By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
e9f69b6
@behlendorf @dun behlendorf Add zpool_events_seek() functionality
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID.  When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
09111e3
@behlendorf @dun behlendorf Clarify zpool_events_next() comment
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location.  When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor.  Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor.  When the file descriptor is
closed the kernel state tracking the cursor is destroyed.

This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
11de213
@dun dun Add defs for makefile installation dir vars
Add macro definitions to AM_CPPFLAGS to propagate makefile installation
directory variables for libexecdir, runstatedir, sbindir, and
sysconfdir.

https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Installation-Directory-Variables.html

  A corollary is that you should not use these variables except
  in makefiles. For instance, instead of trying to evaluate
  datadir in configure and hard-coding it in makefiles using e.g.,
  'AC_DEFINE_UNQUOTED([DATADIR], ["$datadir"], [Data directory.])',
  you should add -DDATADIR='$(datadir)' to your makefile's definition
  of CPPFLAGS (AM_CPPFLAGS if you are also using Automake).

The runstatedir directory is for "installing data files which the
programs modify while they run, that pertain to one specific machine,
and which need not persist longer than the execution of the program".

https://www.gnu.org/prep/standards/html_node/Directory-Variables.html

It will be defined by autoconf 2.70 or later, and default to
"$(localstatedir)/run".

http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=commit;h=a197431414088a417b407b9b20583b2e8f7363bd

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
ad057bc
@dun dun Initial implementation of zed (ZFS Event Daemon)
zed monitors ZFS events.  When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent.  The zevent
nvpairs are passed to the scripts as environment variables.

Initial scripts have been developed to log events to syslog and send
email in response to checksum, resilver.finish, and scrub.finish
events.  By default, email will only be sent if the ZED_EMAIL variable
is configured in zed.rc.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
a2db376
@behlendorf @dun behlendorf Make command line guid parsing more tolerant
Several of the zfs utilities allow you to pass a vdev's guid rather
than the device name.  However, the utilities are not consistent in
how they parse that guid.  For example, 'zinject' expects the guid
to be passed as a hex value while 'zpool replace' wants it as a
decimal.  The user is forced to just know what format to use.

This patch improve things by making the parsing more tolerant.
When strtol(3) is called using 0 for the base, rather than say
10 or 16, it will then accept hex, decimal, or octal input based
on the prefix.  From the man page.

    If base is zero or 16, the string may then include a "0x"
    prefix, and  the number  will  be read in base 16; otherwise,
    a zero base is taken as 10 (decimal) unless the next character
    is '0', in which case it  is  taken as 8 (octal).

NOTE: There may be additional conversions not caught be this patch.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
66c444c
@behlendorf @dun behlendorf Add missing DATA_TYPE_STRING_ARRAY output
This functionality has always been missing.  But until now there
were no zevents which included an array of strings so it wasn't
missed.  However, that's now changed so to ensure this information
is output correctly by 'zpool events -v' the DATA_TYPE_STRING_ARRAY
has been implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
2366416
@behlendorf @dun behlendorf Add automatic hot spare functionality
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.

To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev.  This covers both io and checksum zevents but
may be used but other scripts.

In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:

  vdev_spare_paths  - List of vdev paths for all hot spares.
  vdev_spare_guids  - List of vdev guids for all hot spares.
  vdev_read_errors  - Read errors for the problematic vdev
  vdev_write_errors - Write errors for the problematic vdev
  vdev_cksum_errors - Checksum errors for the problematic vdev.

By default the required hot spare scripts are installed but this
functionality is disabled.  To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.

These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system.  It also requires that the
autoexpand policy be ported from Illumos, see:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
c410014
@FransUrbo
Member

AWSOME!!! Can't wait to look at it.

Maybe rebase as one commit?

@FransUrbo
Member

Some notes (without having tested it, just looking through the code - looks awesome by the way):

  • Default path to zed.rc is ./ for the scripts, but SYSCONFDIR/zfs/zed.conf in zed.h.
    Even though it seems that it first does a chdir to the dir, there's still a missmatch between the filename. And I still think that the script(s) should have the SYSCONFDIR/zfs/zed.conf set - the script(s) might be located somewhere else...
  • What does this actually (in practice) mean:
# This script only provides the functionality for automatically kicking in
# a hot spare.  It does not provide any of the autoreplace functionality.
# This means that once the required repair is complete the hot spare must
# be manually retired using the 'zpool detach' command.

especially since the code further down looks like this:

              if [ "${STATUS}" = "AVAIL" ]; then
                      ${ZPOOL} replace ${ZEVENT_POOL} \
                          ${ZEVENT_VDEV_GUID} ${SPARE} && break
              fi

Looks like a replace to me...

@behlendorf
Member

Nice job @dun! This is a great first step, thanks for posting it so we can get some feedback.

Maybe rebase as one commit?

We could potentially squash all the commits which add the new infrastructure together. But there are a few that are just bug fixes and I think it makes sense to keep those separate. The could be safely merged independently of the rest of this.

zed.conf and zed.rc, there's still a missmatch between the file names.

Good eye, this is actually intentional at the moment and perhaps should be described in the merge commit message.. The zed.rc is expected to be just a stop gap configuration file for the zed scripts with the zed.conf eventually replacing it when the config file parser is written. We just didn't want to let that work block getting an initial pull request open for comment.

It might not be such a bad idea though to put the path for the default zed.rc in the environment of the script rather than assuming it's in the same directory.

What does this actually (in practice) mean:

First off let me say that all the scripts should be considered works in progress. We wanted to first build out a flexible infrastructure and then put together some scripts to take advantage of it. What I'm hoping the community can do is to jump on board and help flesh out additional useful scripts and document the expected events. We've just provided some obvious initial scripts as examples to get people started.

The hot spare script you've mentioned is a first pass at implementing something very close to the policy used under Illumos. We'd love feedback on it. All it does is fault/degrade the failing vdev and then replace it with a configured hot spare. The comment refers to the fact that autoreplace is not yet supported so if you were to hot swap the failed drive it won't immediately rebuild to it. That still takes a manual step. However, I agree the comment could be clearer.

Thanks for the feedback!

@FransUrbo
Member

For parsing the config file, why not have a look at http://www.hyperrealm.com/libconfig/. So we don't reinvent the wheel (again! :).

@behlendorf
Member

@FransUrbo Both @dun and myself have written enough parsers that neither of us are too excited about doing it again. So if there's something out there already which we can use and meets out needs I'm all for it.

I wasn't familiar with libconfig so I took a quick look and it does seem to check all the boxes.

  • Good C bindings
  • Portable to other POSIX platforms (FreeBSD, Illumos)
  • Development libraries are in the standard Debian, Ubuntu, Fedora package manager.
  • A well defined grammar.
  • LGPL Licensed.
  • Good API documentation.

Other have suggested using Lua or perhaps YAML but we really haven't decided anything yet. The config file was something we definitely didn't want to get hung up on which is why we punted on it for now.

@FransUrbo If you have some time you might want to start thinking about how you could use this framework to retire/simplify the existing libshare infrastructure. My thinking was if we started generating share/unshare events we could use the scripting infrastructure provided here to invoke the right exportfs, samba, iscsi` commands. This would make maintaining that code far easier and allow people to easily modify the behavior. The two tricky bits which come to bind are.

  • Extending this infrastructure to provide a way to synchronously run events and return status.
  • Determining what/if any caching needs to be done to handle sharing 1000's of datasets.
@FransUrbo
Member

@behlendorf So have I (but probably not as many as you :) and I hate doing it in C!! Which is why I've been looking for something that does the work. I haven't actually tested libConfig, but I to think it looks good. And seems quite simple.

Eventually (if someone doesn't bet me to it), I'll be happy to do the implementation for zEventd (PLEEEEESE can we have that spelling!! :).

I've been thinking about how to use the zEventd for libshare ever since I first saw the code (yesterday :), but so far there's to many "can't". I'll let it simmer a couple more days, and maybe I'll come up with some basic idea at least.

Looking at zed_event_service() I see that the events is already done synchronously. This is the way we do it today (and this seems to be working just fine - the time it takes to share seems to be way less than it takes to do the mounting). I'd much rather have a asynchronous sharing - just fork of all sharing in one, big swoop. Not that we seem to be needing it, but...

Doing this for SMB and iSCSI is no problem. Neither of them have any problem with this. A child dataset to a shared dataset is automatically visible by any mounting client. So sharing a child dataset would essentially be it's own standalone ('top') share.

And iSCSI don't do child sharing. Each ZVOL is always (?) the "end target" (no child sharing).

However, this don't work for NFS. Each child filesystem needs to be shared separately. So some locking, or 'share order' needs to be figured out (unless we're so lucky that each share event comes from the core synchronously?). That part isn't visible in this pull request. I understand that the actual event 'thingie' have been there for quite some time (I think I saw a bug report a few days ago mentioning version 2.something). There just haven't been a consumer for it until now.

The more I think about it (and actually write my thoughts down and reading them back :), the more I think we have no problems! We could start consuming share events right this minute. Doing the share scripts wouldn't take more than a few minutes...

It would be nice if someone (with a working pool :) could do some tests with thousands of nfs, smb and iSCSI shares and see how long it takes to share them. I did that early last year, and if I remember correctly, it's not a problem. I only had problems with the mount part...

So I don't think we need any caching. And technically, this could be done in the script (making it fork itself and return quickly to the zEventd - it doesn't seem care about any return or failure code anyway).

@FransUrbo
Member

Doing the tests with my SMB and iSCSI pull request that is. Just for testing how long it will take...

@aarcane
aarcane commented Jan 31, 2014

Purely for novelty purposes, I'd like to propose that the name be changed from zed to simply z (pronounced zed everywhere else except usa) and similarly rename all internal interfaces, mechanisms, objects, classes, methods, and structures.

@dun
Member
dun commented Feb 2, 2014

Whoops! Looks like I broke it during cleanup. Let me get that fixed...

@dun dun Fix zed read of new state file
This fixes zed_conf_read_state() when reading a new state file,
arising from either the initial run or from use of the '-Z'
command-line option.  A read of 0 bytes is not considered an error,
and now resets the eid to 0.
92b6e88
@pyavdr
Contributor
pyavdr commented Feb 3, 2014

zed is running, activated all options in zed.rc. Removed a vdev in a zpool, spares are available. After an export/import cycle, the pool is degraded. zed stays ready, but there is no event for this situation, so it does nothing. What went wrong here?

raidz3-20              DEGRADED     0     0     0
        /test/vols/vfil401   ONLINE       0     0     0
        /test/vols/vfil402   ONLINE       0     0     0
        /test/vols/vfil403   ONLINE       0     0     0
        /test/vols/vfil404   ONLINE       0     0     0
        /test/vols/vfil405   ONLINE       0     0     0
        /test/vols/vfil406   ONLINE       0     0     0
        /test/vols/vfil407   ONLINE       0     0     0
        /test/vols/vfil408   ONLINE       0     0     0
        /test/vols/vfil409   ONLINE       0     0     0
        /test/vols/vfil410   ONLINE       0     0     0
        /test/vols/vfil411   ONLINE       0     0     0
        /test/vols/vfil412   ONLINE       0     0     0
        /test/vols/vfil413   ONLINE       0     0     0
        /test/vols/vfil414   ONLINE       0     0     0
        /test/vols/vfil415   ONLINE       0     0     0
        /test/vols/vfil416   ONLINE       0     0     0
        /test/vols/vfil417   ONLINE       0     0     0
        /test/vols/vfil418   ONLINE       0     0     0
        /test/vols/vfil419   ONLINE       0     0     0
        8237275327943044029  UNAVAIL      0     0     0  was /test/vols/vfil420
    logs
      /test/vols/vfil421     ONLINE       0     0     0
      /test/vols/vfil422     ONLINE       0     0     0
    spares
      /test/vols/vfil423     AVAIL   
      /test/vols/vfil424     AVAIL   
      /test/vols/vfil425     AVAIL   
      /test/vols/vfil426     AVAIL   
      /test/vols/vfil427     AVAIL   
      /test/vols/vfil428     AVAIL   
      /test/vols/vfil429     AVAIL   
      /test/vols/vfil430     AVAIL   
      /test/vols/vfil431     AVAIL   
      /test/vols/vfil432     AVAIL   
      /test/vols/vfil433     AVAIL   

errors: No known data errors

@behlendorf
Member

@FransUrbo @aarcane Alternate naming suggestions duly noted. But I think we should stick with zed since it follows the existing naming convention used in the code and that prefix is already used extensively. But thanks for the feedback.

@pyavdr Thanks for giving this a spin, we want all the feedback we can get since we know this is just an initial implementation. The hot sparing as currently implemented only kicks in for two cases.

  • A leaf VDEV has observed an I/O error from the device.
  • A leaf VDEV has observed a significant number of checksum errors.

For both of these cases the offending leaf VDEV will be retired and replaced by a spare. There is currently no policy for your case where you import a pool with a missing leaf VDEV. There's also no policy for when a leaf VDEV is manually offlined.

These area's are exactly where we'd love the help fleshing out a flexible policy and writing the needed scripts. I think it's worth asking the following questions for these cases.

  • What would be a reasonable default policy for the majority of users.
  • What sort of common tuning make sense.
  • What does Illumos do in a similar situation, should we stay consistent with that.

These are all open questions from our point of view so let is know what you think,

@pyavdr
Contributor
pyavdr commented Feb 4, 2014

Maybe you post this questions into Google group zfsdiscuss? I just want to have a testcase, to ensure that disk replacement is working, so i can use it on my production zfspools. Something like vdev pull out of enclosure and zed replaces it with a spare.
Another case is SMART. When there are error reports from smart, it should replace that disc too.

@behlendorf
Member

@pyavdr I can sympathize with wanting to roll this out in to production right away. But I want to be clear we view the changes proposed here as just an initial starting point to build on. The idea is to get the framework in 0.6.3 so people like yourself can begin thinking about and developing scripts which leverage it. The initial hot spare functionality was just one example.

That said you could test the hot sparing by pulling a drive from a live system (or a VM) to ensure that it does kick in the spare. Alternated you could use dd to damage half of a mirror/raidz and then kick of a scrub which will detect the damaged checksums and kick in a spare.

@FransUrbo
Member

I can't get it to replace a disk either:

Feb  5 17:24:54 DebianZFS-Wheezy-SCST2 kernel: [  524.983178] sd 6:0:0:0: [sde] Synchronizing SCSI cache
Feb  5 17:24:54 DebianZFS-Wheezy-SCST2 kernel: [  524.984343] sd 6:0:0:0: [sde] Stopping disk
Feb  5 17:24:54 DebianZFS-Wheezy-SCST2 kernel: [  524.986098] ata7.00: disabled
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=31 pid=5170
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: eid=31 class=vdev.too_small pool=mypool
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=31 pid=5170 exit=0
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=32 pid=5172
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: eid=32 class=scrub.start pool=mypool
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=32 pid=5172 exit=0
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=33 pid=5174
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: eid=33 class=scrub.finish pool=mypool
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=33 pid=5174 exit=0
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Invoking "scrub.finish-email.sh" eid=33 pid=5176
Feb  5 17:25:25 DebianZFS-Wheezy-SCST2 zed: Finished "scrub.finish-email.sh" eid=33 pid=5176 exit=0

The pool looks like this (after removing one disk and then issue a scrub):

        mypool                                           DEGRADED     0     0     0
          raidz3-0                                       DEGRADED     0     0     0
            scsi-SATA_VBOX_HARDDISK_VB4782108c-e0486278  ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VB0817e509-cd01666d  ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VB00bf6a15-b84628e5  ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VBd0588e0f-3168f063  UNAVAIL      0     0     0  corrupted data
        spares
          scsi-SATA_VBOX_HARDDISK_VB002ae305-086dd5c3    AVAIL
@FransUrbo
Member

And if just removing a device and write to the pool, I get:

DebianZFS-Wheezy-SCST2:~# tail -f /var/log/syslog  -n0
Feb  5 17:33:52 DebianZFS-Wheezy-SCST2 kernel: [  117.119281] sd 5:0:0:0: [sdd] Synchronizing SCSI cache
Feb  5 17:33:52 DebianZFS-Wheezy-SCST2 kernel: [  117.121265] sd 5:0:0:0: [sdd] Stopping disk
Feb  5 17:33:52 DebianZFS-Wheezy-SCST2 kernel: [  117.121531] ata6.00: disabled

Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 kernel: [  136.891325] ZFS: zio error=19 type=2 offset=4362240 size=131072 flags=60440 delay=0
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 kernel: [  136.892145] ZFS: zio error=19 type=1 offset=270336 size=8192 flags=2c441 delay=0
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 kernel: [  136.892157] ZFS: zio error=19 type=1 offset=8578932736 size=8192 flags=2c441 delay=0
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 kernel: [  136.892163] ZFS: zio error=19 type=1 offset=8579194880 size=8192 flags=2c441 delay=0
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=25 pid=3303
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 zed: eid=25 class=io pool=mypool
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=25 pid=3303 exit=0
Feb  5 17:34:11 DebianZFS-Wheezy-SCST2 zed: Invoking "io-spare.sh" eid=25 pid=3305
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "io-spare.sh" eid=25 pid=3305 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=26 pid=3320
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: eid=26 class=io pool=mypool
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=26 pid=3320 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "io-spare.sh" eid=26 pid=3322
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "io-spare.sh" eid=26 pid=3322 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=27 pid=3336
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: eid=27 class=io pool=mypool
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=27 pid=3336 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "io-spare.sh" eid=27 pid=3338
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "io-spare.sh" eid=27 pid=3338 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=28 pid=3352
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: eid=28 class=probe_failure pool=mypool
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=28 pid=3352 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=29 pid=3354
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: eid=29 class=config.sync pool=mypool
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=29 pid=3354 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=30 pid=3356
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: eid=30 class=config.sync pool=mypool
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=30 pid=3356 exit=0
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Invoking "all-syslog.sh" eid=31 pid=3358
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: eid=31 class=config.sync pool=mypool
Feb  5 17:34:15 DebianZFS-Wheezy-SCST2 zed: Finished "all-syslog.sh" eid=31 pid=3358 exit=0

With the pool looking like this:

        mypool                                           DEGRADED     0     0     0
          raidz3-0                                       DEGRADED     0     0     0
            scsi-SATA_VBOX_HARDDISK_VB4782108c-e0486278  ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VB0817e509-cd01666d  ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VB00bf6a15-b84628e5  FAULTED      3     3     0  too many errors
            scsi-SATA_VBOX_HARDDISK_VBd0588e0f-3168f063  ONLINE       0     0     6
        spares
          scsi-SATA_VBOX_HARDDISK_VB002ae305-086dd5c3    AVAIL   

So this time, at least it ran the io-spare.sh event script. I'll look into why it didn't do a replace...

@FransUrbo
Member

Adding some debugging in the io-spare.sh script by echoing variables into a log file - can't seem to get any output from the script otherwise.

FEATURE REQUEST: Anything on stderr should be caught and logged with the rest of the zed logs...

The logging shows that it finds the correct device to replace (both the broken and the spare). Kind'a:

DEBUG: class=ereport.fs.zfs.io
DEBUG: action=fault
DEBUG: vdev_status(): pool=mypool, VDEV=scsi-SATA_VBOX_HARDDISK_VB00bf6a15-b84628e5-part1
DEBUG: status=
DEBUG: spares='/dev/disk/by-id/scsi-SATA_VBOX_HARDDISK_VB002ae305-086dd5c3-part1'
DEBUG: vdev_status(): pool=mypool, VDEV=scsi-SATA_VBOX_HARDDISK_VB002ae305-086dd5c3-part1
DEBUG: spare status=

So it finds the correct spare, BUT with '-part1' added, which makes the vdev_status() fail because it can't find that device in the pool.

@FransUrbo
Member

Hacking the script (by removing '-part?' from the device line kicks in the replace, but with:

        mypool                                                   DEGRADED     0     0     0
          raidz3-0                                               DEGRADED     0     0     0
            scsi-SATA_VBOX_HARDDISK_VB4782108c-e0486278          ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VB0817e509-cd01666d          ONLINE       0     0     0
            spare-2                                              FAULTED      0     0     0
              scsi-SATA_VBOX_HARDDISK_VB00bf6a15-b84628e5        FAULTED      3     1     0  too many errors
              scsi-SATA_VBOX_HARDDISK_VB002ae305-086dd5c3-part1  ONLINE       0     0     0
            scsi-SATA_VBOX_HARDDISK_VBd0588e0f-3168f063          ONLINE       0     0     0
        spares
          scsi-SATA_VBOX_HARDDISK_VB002ae305-086dd5c3            INUSE     currently in use

It adds the '-part1' in the spare-2 vdev..

Could be fixed as well, but for those that actually USE a partition there, this won't work...

@FransUrbo FransUrbo referenced this pull request in dun/zfs Feb 5, 2014
Closed

zEventd improvenments #1

@FransUrbo
Member

Ok, my pull request now have something that should work for everyone in this regard.

@FransUrbo
Member

When running this last time, I noticed some interesting subclases in the debug log: ZEVENT_SUBCLASS=probe_failure and ZEVENT_SUBCLASS=vdev.spare.

The first one sounds obvious - ZFS couldn't read from a vdev. That might be important. I'm going to up the allowed write faults and see if I can trigger only that, and not a replace...

The second seems to be a spare added to the system.

I'm thinking a system with multiple admins. Sending out a mail about added vdevs might be a good thing?

I just can't seem to trigger it on its own...

@FransUrbo
Member

Also, some events don't seen to repeat - I'm trying to get it to mail me when getting vdev.removed, but I only got it once and now can't repeat it!

@behlendorf
Member

FEATURE REQUEST: Anything on stderr should be caught and logged with the rest of the zed logs...

I'm not sure what the right solution is for this. On the surface allowing everything from stderr to be logged sounds reasonable. The problem is that only works today because the zed serializes everything. Eventually the zed will handle events concurrently at which point your asking for a horrible mishmash of unreadable gobbledygook the log.

When I was developing the initial/prototype sparing script I just redirected all stderr and stdout from the script to a file. That worked but was somewhat inconvenient. Perhaps the cleanest thing to do in the short term is explicitly log errors/warnings to syslog.

Partitions / whole disks.

Yeah, I thought partitions might pose a problem. When I originally wrote this script I wanted to use the GUID for the spares so there would be no ambiguity. Unfortunately, the way the tools were written you can't pass the GUID of a spare for a zpool replace. This isn't an issue in the Illumos code because they use a C program to kick in the spare which is linked against libzfs. This provides them some additional functionality which isn't possible through the existing command line tools.

For now the thing to do may be to update the script to attempt first plausible spare names. Longer term pulling over the rest of the Illumos FMA C code would be nice. One step at a time.

Also, some events don't seen to repeat - I'm trying to get it to mail me when getting vdev.removed, but I only got it once and now can't repeat it!

We're still trying to get a handle of the meaning of all the events too. We've preserved all the events Illumos would generate, but exactly what they mean, and when they get generated isn't documented anywhere but in the source. We'd like to get that information in a man page once we understand it so anyone can write their own scripts. I'm sure there are new events we'd like to add too.

Sending out a mail about added vdevs might be a good thing?

I could see that being useful and the sort of thing it would be great to provide a script for. Along similar lines emailing an admin if the pool was suspended would be nice too.

@behlendorf
Member

@dun It would be great if you could review @FransUrbo proposed improvements when you next refresh this. There's some good stuff there, I've made a few comments on them. Also if you could include a systemd unit in the next refresh that would be helpful. The rest of the systemd improvements were merged.

@aarcane
aarcane commented Feb 6, 2014

I would posit for the stderr log that we should take a multi tiered approach. each spawned process should be logged to /var/log/zed/$HANDLER-$PID.err.log, but as procees finish, those logs should be rolled into /var/log/zed/error.log but wrapped in a message akin to
"blah blah handler finished. The output was:
$file
"

The purpose of the output getting piped to a file is for diagnostic purposes in the event of a zed crash. I think this could be optional with output simply being cached in memory and dumped in similar fashion.

In general, stdout and stdlog should also be captured. stdlog should be logged, but stdout should be discarded, unless it's being requested logged for diagnostic purposes.

@FransUrbo
Member

We could always use the filehandle 8 that is used by some scripts to log anything. Then check if that file is >0bytes and if so merge it into the the log just before 'Finishing ...'

@FransUrbo
Member

About the vdev.removed event. I think I've successfully figured out that it is only triggered when doing a zpool detach .... As in, detaching the spare used to replace a failing drive when the (previously) failed drive comes back. Or (just tested), when detaching a device from a mirror.

So maybe it should be renamed to vdev.detached and a new vdev.removed should be created?

@FransUrbo
Member

I've looked into the possibility to use this for sharing as well. But it seems that, in first glance, that might not be possible. The events are 'pool' oriented, not 'dataset' oriented (i.e., they require an open pool - the spa struct).

This is way to close to the core for me to know what I'm doing and I feel a little ... anxious messing that close to the core :)

@dun
Member
dun commented Feb 7, 2014

@FransUrbo @behlendorf @aarcane If I need stderr for script debugging/development, I typically do something like this towards the top of the script:

exec >>/tmp/`basename "$0"`.out 2>&1
set -x

That's essentially what I did in zed.d/all-debug.sh.

Longer term, I'm not sure what the best solution is for this. Something to think about...

@thegreatgazoo thegreatgazoo commented on the diff Feb 19, 2014
module/zfs/fm.c
@@ -509,6 +518,12 @@ struct erpt_kstat {
return;
}
+ eid = atomic_inc_64_nv(&zevent_eid);
+ if (nvlist_add_uint64(nvl, FM_EREPORT_EID, eid)) {
+ atomic_add_64(&erpt_kstat_data.erpt_set_failed.value.ui64, 1);
+ return;
@thegreatgazoo
thegreatgazoo Feb 19, 2014 Member

If we return from here, nvl and cb hasn't been saved in a zevent_t. In this case, how does the caller know when to free nvl? Looking at call sites, it seemed that nvl would be freed in the cb callback, but since cb is not saved it appeared to me that cb would not be called. I'm new to the code, so I guess I must have missed something here.

@behlendorf
behlendorf Feb 19, 2014 Member

Your right, this looks like a leak in the case you describe. It's also possible in the other two early returns. This function really shouldn't return a void type but instead return an error code so it's clear if the nvlist must be cleaned up.

@thegreatgazoo thegreatgazoo commented on the diff Feb 19, 2014
module/zfs/fm.c
@@ -527,6 +542,7 @@ struct erpt_kstat {
ev->ev_nvl = nvl;
ev->ev_detector = detector;
ev->ev_cb = cb;
+ ev->ev_eid = eid;
@thegreatgazoo
thegreatgazoo Feb 19, 2014 Member

So cb is saved in ev->ev_cb and called later in zfs_zevent_free():
/* Run provided cleanup callback */
ev->ev_cb(ev->ev_nvl, ev->ev_detector);

But it seemed OK to call zfs_zevent_post(cb==NULL):
zfs_ereport_send_interim_checksum(zio_cksum_report_t *report)
{
#ifdef _KERNEL
zfs_zevent_post(report->zcr_ereport, report->zcr_detector, NULL);

Then would zfs_zevent_free() dereference a NULL pointer when calling ev_cb without checking it against NULL?

@behlendorf
behlendorf Feb 19, 2014 Member

Yes, it would be better to either not all NULL to be set as a valid callback, or to handle the NULL properly.

Since both of these issues aren't directly related to these proposed changes could you open up an issue for each of them so we don't loose track of making these improvements. Better yet, propose a patch as well!

@thegreatgazoo
thegreatgazoo Feb 19, 2014 Member

OK, will open an issue and come up with a patch.

@behlendorf
Member

@dun When you get this rebased and updated with the next round of changes go ahead and open a new pull request. Then you can close this one out and we'll just reference the new pull request from here.

@dun
Member
dun commented Feb 22, 2014

@behlendorf OK, will do. I need to finish the systemd integration and work through a pull request. Do we have a patch for the potential nvl leak?

@behlendorf
Member

No patch for the potential leak yet but it's not a new issue either. We should fix it but in practice it just doesn't happen.

@behlendorf behlendorf closed this Apr 2, 2014
@lundman lundman referenced this pull request in openzfsonosx/zfs Apr 3, 2014
Closed

zfs event daemon merge and test #138

@lundman
Contributor
lundman commented Apr 8, 2014

I was ready to make a pull-requests after porting zed over to OSX, but disappointingly, I didn't have to change anything.

# uname -a
Darwin OSX109.local 13.1.0 Darwin Kernel Version 13.1.0: Thu Jan 16 19:40:37 PST 2014; root:xnu-2422.90.20~2/RELEASE_X86_64 x86_64
bash-3.2# ./cmd/zed/zed -vfF
zed: Registered script "all-debug.sh"
zed: Registered script "all-syslog.sh"
zed: Registered script "checksum-email.sh"
zed: Registered script "checksum-spare.sh"
zed: Registered script "data-email.sh"
zed: Registered script "generic-email.sh"
zed: Registered script "io-email.sh"
zed: Registered script "io-spare.sh"
zed: Registered script "resilver.finish-email.sh"
zed: Registered script "scrub.finish-email.sh"
zed: Registered script "zed.rc"
zed: ZFS Event Daemon 0.6.2-rc1_77_ge3deed1
zed: Processing events since eid=0

I've yet to test an event, but in terms of porting, there was no work required.

About the only thing confusing was:

zed: ZFS Event Daemon 0.6.2-rc1_77_ge3deed1
zed: Failed to create directory "/usr": File exists

Which is easily fixed with mkdir -p /usr/local/var/run - it is just that the error message truncates the path used for pidfile.

Thanks for the great work.

@dun
Member
dun commented Apr 8, 2014

I'm glad to hear the port went well. I see how that error message is getting triggered. I should be able to get that fixed tomorrow.

@lundman
Contributor
lundman commented Apr 8, 2014

Even the scripts are so well written we've had no issues using the default ones. The only thing we are missing is "flock -x", but I can ship a perl script performing the equivalent call, and doesn't stop the script from working. Hell, even looks like we can use event class=config.sync to rename over the zpool.cache.tmp file to work around missing kernel VNOP_RENAME.

@dun
Member
dun commented Apr 8, 2014

Awesome news! This makes my day.

@behlendorf behlendorf added this to the 0.6.3 milestone Apr 8, 2014
@behlendorf behlendorf added the Feature label Apr 8, 2014
@dun
Member
dun commented Apr 8, 2014

@lundman I've submitted pull #2248 to fix the directory creation error you saw.

@lundman
Contributor
lundman commented Apr 9, 2014

I see it excellent.
With autoconf, we noticed that on "make install"

  • zed.rc is installed 644, but requires 755 by zed
  • /usr/local/var/run is not created

Which I temporarily fixed with

 install-data-local:
        $(MKDIR_P) "$(DESTDIR)$(zedconfdir)"
+       $(MKDIR_P) "$(runstatedir)"
        for f in $(zedconfdefaults); do \
          test -f "$(DESTDIR)$(zedconfdir)/$${f}" -o \
               -L "$(DESTDIR)$(zedconfdir)/$${f}" || \
            ln -s "$(zedexecdir)/$${f}" "$(DESTDIR)$(zedconfdir)"; \
        done
+
+install-data-hook:
+       chmod 755 "$(DESTDIR)$(zedconfdir)/zed.rc"

but I think that is more of a local peculiarity, so not something upstream needs to worry about.

@dun
Member
dun commented Apr 9, 2014

@lundman zed doesn't directly use zed.rc. It is only sourced by the enabled scripts when they're exec'd. 644 perms should be sufficient since execute perms aren't necessary to source a file. Does OS X differ in this regard?

Recent Linux distributions have moved /var/run (or /run) to a tmpfs filesystem. As such, any required directories there now need to be created by the daemon at runtime instead of getting created by the Makefile and installed by the package. Creating it via the install-data-local target should no longer be necessary once the above pull request gets merged in.

@lundman
Contributor
lundman commented Apr 9, 2014

Ah I see where I went wrong then, it was due to;

# chmod 644 /usr/local/etc/zfs/zed.d/zed.rc
# zed
zed: Registered script "scrub.finish-email.sh"
zed: Ignoring "zed.rc": not executable by user
zed: Registered script "zpool.destroy.sh"

Made me think it failed to start zed, but what it is saying is that it will ignore the non-executable script 'zed.rc' - as it should. Great, neither change will be needed once I merge. Sorry for the mixup.

Meanwhile I added event for pool import, to match zpool.destroy and made our script issue OS X notification. It's turning out great for us.

We will probably use it for mounting of snapshots, since we can not issue mount from kernel.

@behlendorf
Member

@lundman Please feel free to open pull requests for any improvements or fixes you make which may make sense for us to take upstream.

@lundman
Contributor
lundman commented Apr 10, 2014

@behlendorf Thanks, appreciated. But I think the serious abuse of zed is so far related to OSX :)

But we finally have automatic mounting of snapshots:

# ls -la /TEST/.zfs/snapshot/
Apr 10 10:26:00 OSX109.local zed[2067]: eid=9 class=snapshot.mount pool=TEST
zed: Invoking "snapshot.mount.sh" eid=9 pid=2068
Apr 10 10:26:00 OSX109.local zed[2070]: Snapshot mount TEST@send
zed: Finished "snapshot.mount.sh" eid=9 pid=2068 exit=0

# df
TEST@send         129746      839    128907     1%      11  128907    0%   /TEST/.zfs/snapshot/send

It is a bit naughty to use zed for tasks not related to "event reporting", but I didn't want to have 2 root daemons either. Thanks for making zed so flexible!

@lundman
Contributor
lundman commented Apr 11, 2014

@behlendorf @dun
I am playing with some changes that could skirt around the line of interest for you;

https://gist.github.com/lundman/10435317

1: In OSX the kernel module loading is async, so it might not be ready when we start zed. It would be convenient if zed simply sat around waiting for it. I overloaded the "force to become daemon" argument to include this. Simply made the libzfs_init call sleep and retry if it fails.

2: It is my guess that in Linux, if kernel module is unloaded, you receive SIGPIPE, and zed can clean up and exit? In OSX the ioctl call gets errno 19 (ENODEV). I had to make the change in libzfs_pool to accept this errno, and bubble it up the stack. In doing so, I had to remove the error = zfs_error_fmt and replace with (void), or zpool_events_next() would return 0 even when there was an error. (and presumably the call to zfs_error_fmt would blow away errno). Not sure it was intended to return 0 even when there was an actual error?

I could simply return the error in my case ENODEV, but I also wanted the zfs shutdown message. There is room here to change it around for sure.

3: In zed_event_service we need to detect the error and quit the program. This finally lets zed exit cleanly on OSX, before it would CPU spin on ioctl getting error 19.
To encourage the main loop to exit, I send a signal to my own process, which set _got_exit. This is not normally how I would handle it. Possibly, I would change zed_event_service from void to int, and look for failure in the main loop. But I was going for the minimal changeset in this case. Thoughts?

4: And finally, if force daemon argument was set, simply loop back to waiting for /dev/zfs. This allowed zed to keep running while the kernel module is unloaded, and re-loaded.

@behlendorf
Member

We have a similar issue on Linux which we do want to try and address. It would be good if whatever solution we came up with worked for both use cases. Let me try and explain the Linux behavior...

1: It would be convenient if zed simply sat around waiting for it.

I could see this being a useful feature. Under Linux I can imagine a scenario where you would just want to start the zed and allow it to lazily connect once the /dev/zfs device appears. But right now on Linux calling libzfs_init will result in the modules being automatically loaded if possible. If we did something along these lines it really should have it's own option and a well defined behavior.

2: It is my guess that in Linux, if kernel module is unloaded, you receive SIGPIPE, and zed can clean up and exit?

Actually no. Under Linux when the zed opens /dev/zfs the kernel takes a reference on the module. This reference prevents the kernel module from being unloaded as long as the user space consumer has the file descriptor open. So as it stands we must SIGTERM the zed to unload the kernel modules.

Normally this isn't a problem but I can see us needing a better long term solution to handle things like package updates. It's also an issue for developers who want the zed to be running but need to keep unload/loading development builds of the kernel modules. It might be useful to have a mechanism which the kernel module to use to signal the zed that it should close it's open file handle for /dev/zfs. Potentially it could then fallback on the retry behavior above to reopen the device once the kernel modules were swapped out.

@lundman
Contributor
lundman commented Apr 15, 2014

To avoid the infinite-loop, we had to fix 3:

It is interesting your zed holds a reference, where as on OSX you can unload the kext all you want, and ioctl just received ENODEV. Possibly I am lacking a "isbusy()" check somewhere.

On 1: issue, our first thought was to simply set an ENVVAR in zed before libzfs_init, and with a test for it in libzfs, skip the kernel module load. But if there is a sexier way, that is just fine with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment