-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improper STATE reporting when disk UNAVAIL due to corruption #4653
Comments
Out of interest, what does |
Same output as without the -x. At least it doesn't just say "pool is healthy". Still, not great. |
Like wise, when the disk which is in raidz configuration is removed physically, still the status reports the removed disk as online and the Only after doing a Similarly, when reconnecting the drive back, only after doing a So, who is responsible for maintaining the disk status? I thought the zfs module will do it. |
@joshuaimmanuel the issue here is the kernel module doesn't receive any notification of the drive removal until it attempts to perform some kind of IO to it. It's only then that it can realize the drive was removed. The good news is that this issue was addressed by master. The ZED now monitors udev device add/remove events for the system manages the drives accordingly. |
@behlendorf Thanks |
Closing. This issue was resolved in master with changes to the ZED. |
@behlendorf I couldn't find this issue mentioned in v0.7.0-rc4. Is this fix available in this release? |
This functionality was merged in several parts to extend the ZED drive management functionality. This work is all in 0.7.0-rc4 with the critical bit you're interested in getting merged in commit d02ca37. The ZED now actively monitors udev events (via libudev) so it will detect things like idle drive removal/addition. If you have a chance it would be great if you could try it out open new issues if you discover problems. |
Note that while zed can detect drive removals via udev, it doesn't currently do anything about it. That is, if zed sees a drive removed it doesn't offline or fault the drive. That may be something we want to look into for future releases. Right now though, the vdev will eventually fault when it gets issued IO, which gives you the same result. I believe the "fault drive on bad IOs" action requires you to be running zed, so make sure you're running it. |
Accidentally discovered today that zpool status improperly reports pool and vdev as ONLINE, not DEGRADED, when a disk has been failed entirely out of the vdev as UNAVAIL due to corrupt metadata.
http://jrs-s.net/2016/05/16/zfs-practicing-failures/
TL;DR - pool with two 2-disk mirror vdevs:
Corrupt all blocks on disk nbd0, scrub, and check status:
The human-readable status message for the pool is correct, but the STATE data for both pool test and vdev mirror-0 are showing ONLINE, where they should be showing DEGRADED.
Physically removing disk nbd0 does cause both pool and vdev to show DEGRADED status properly:
I'm guessing this is a corner case that nobody really tested? Most automated alerting/monitoring systems are going to be looking at the STATE flags, not the human-readable error message. Could be bad if a disk blows out this way in production and nobody knows because the monitoring system never sends an alarm.
The text was updated successfully, but these errors were encountered: