improper STATE reporting when disk UNAVAIL due to corruption #4653

Closed
jimsalterjrs opened this Issue May 16, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@jimsalterjrs

jimsalterjrs commented May 16, 2016

Accidentally discovered today that zpool status improperly reports pool and vdev as ONLINE, not DEGRADED, when a disk has been failed entirely out of the vdev as UNAVAIL due to corrupt metadata.

http://jrs-s.net/2016/05/16/zfs-practicing-failures/

TL;DR - pool with two 2-disk mirror vdevs:

root@banshee:~# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    test        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nbd0    ONLINE       0     0     0
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors

Corrupt all blocks on disk nbd0, scrub, and check status:

root@banshee:~# pv < /dev/zero > /dev/nbd0
pv: write failed: No space left on device<=>                                                 ]
root@banshee:~# zpool scrub test
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 16 12:36:38 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nbd0    UNAVAIL      0     0 1.40K  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors

The human-readable status message for the pool is correct, but the STATE data for both pool test and vdev mirror-0 are showing ONLINE, where they should be showing DEGRADED.

Physically removing disk nbd0 does cause both pool and vdev to show DEGRADED status properly:

root@banshee:~# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected
root@banshee:~# zpool scrub test
root@banshee:~# zpool status test
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 16 12:42:35 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        DEGRADED     0     0     0
      mirror-0  DEGRADED     0     0     0
        nbd0    UNAVAIL      0     0 1.40K  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors

I'm guessing this is a corner case that nobody really tested? Most automated alerting/monitoring systems are going to be looking at the STATE flags, not the human-readable error message. Could be bad if a disk blows out this way in production and nobody knows because the monitoring system never sends an alarm.

@dasjoe

This comment has been minimized.

Show comment
Hide comment
@dasjoe

dasjoe May 16, 2016

Contributor

Out of interest, what does zpool status -x report for the degraded pool?

Contributor

dasjoe commented May 16, 2016

Out of interest, what does zpool status -x report for the degraded pool?

@jimsalterjrs

This comment has been minimized.

Show comment
Hide comment
@jimsalterjrs

jimsalterjrs May 19, 2016

Out of interest, what does zpool status -x report for the degraded pool?

root@locutus:/data/test# zpool status -x test
  pool: test
 state: ONLINE
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu May 19 16:29:24 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors

Same output as without the -x.

At least it doesn't just say "pool is healthy". Still, not great.

jimsalterjrs commented May 19, 2016

Out of interest, what does zpool status -x report for the degraded pool?

root@locutus:/data/test# zpool status -x test
  pool: test
 state: ONLINE
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu May 19 16:29:24 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors

Same output as without the -x.

At least it doesn't just say "pool is healthy". Still, not great.

@joshuaimmanuel

This comment has been minimized.

Show comment
Hide comment
@joshuaimmanuel

joshuaimmanuel Mar 13, 2017

Like wise, when the disk which is in raidz configuration is removed physically, still the status reports the removed disk as online and the zpool status -x reports "all pools are healthy"

Only after doing a zpool clear <pool-name>, and then running the zpool status reports the disk as unavailable

Similarly, when reconnecting the drive back, only after doing a zpool clear <pool-name> that disk becomes online. Until then, the zpool status reports the disk as unavailable.

So, who is responsible for maintaining the disk status? I thought the zfs module will do it.

joshuaimmanuel commented Mar 13, 2017

Like wise, when the disk which is in raidz configuration is removed physically, still the status reports the removed disk as online and the zpool status -x reports "all pools are healthy"

Only after doing a zpool clear <pool-name>, and then running the zpool status reports the disk as unavailable

Similarly, when reconnecting the drive back, only after doing a zpool clear <pool-name> that disk becomes online. Until then, the zpool status reports the disk as unavailable.

So, who is responsible for maintaining the disk status? I thought the zfs module will do it.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Mar 13, 2017

Member

@joshuaimmanuel the issue here is the kernel module doesn't receive any notification of the drive removal until it attempts to perform some kind of IO to it. It's only then that it can realize the drive was removed. The good news is that this issue was addressed by master. The ZED now monitors udev device add/remove events for the system manages the drives accordingly.

Member

behlendorf commented Mar 13, 2017

@joshuaimmanuel the issue here is the kernel module doesn't receive any notification of the drive removal until it attempts to perform some kind of IO to it. It's only then that it can realize the drive was removed. The good news is that this issue was addressed by master. The ZED now monitors udev device add/remove events for the system manages the drives accordingly.

@joshuaimmanuel

This comment has been minimized.

Show comment
Hide comment
@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Mar 14, 2017

Member

Closing. This issue was resolved in master with changes to the ZED.

Member

behlendorf commented Mar 14, 2017

Closing. This issue was resolved in master with changes to the ZED.

@behlendorf behlendorf closed this Mar 14, 2017

@joshuaimmanuel

This comment has been minimized.

Show comment
Hide comment
@joshuaimmanuel

joshuaimmanuel Jun 7, 2017

@behlendorf I couldn't find this issue mentioned in v0.7.0-rc4. Is this fix available in this release?

@behlendorf I couldn't find this issue mentioned in v0.7.0-rc4. Is this fix available in this release?

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 7, 2017

Member

This functionality was merged in several parts to extend the ZED drive management functionality. This work is all in 0.7.0-rc4 with the critical bit you're interested in getting merged in commit d02ca37. The ZED now actively monitors udev events (via libudev) so it will detect things like idle drive removal/addition.

If you have a chance it would be great if you could try it out open new issues if you discover problems.

Member

behlendorf commented Jun 7, 2017

This functionality was merged in several parts to extend the ZED drive management functionality. This work is all in 0.7.0-rc4 with the critical bit you're interested in getting merged in commit d02ca37. The ZED now actively monitors udev events (via libudev) so it will detect things like idle drive removal/addition.

If you have a chance it would be great if you could try it out open new issues if you discover problems.

@tonyhutter

This comment has been minimized.

Show comment
Hide comment
@tonyhutter

tonyhutter Jun 7, 2017

Member

Note that while zed can detect drive removals via udev, it doesn't currently do anything about it. That is, if zed sees a drive removed it doesn't offline or fault the drive. That may be something we want to look into for future releases. Right now though, the vdev will eventually fault when it gets issued IO, which gives you the same result. I believe the "fault drive on bad IOs" action requires you to be running zed, so make sure you're running it.

Member

tonyhutter commented Jun 7, 2017

Note that while zed can detect drive removals via udev, it doesn't currently do anything about it. That is, if zed sees a drive removed it doesn't offline or fault the drive. That may be something we want to look into for future releases. Right now though, the vdev will eventually fault when it gets issued IO, which gives you the same result. I believe the "fault drive on bad IOs" action requires you to be running zed, so make sure you're running it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment