# © Copyright 2016 Hewlett Packard Enterprise Development LP Feature: Identify failed drive As a ZFS administrator I want a failed drive's fault LED to turn on So I can quickly find the drive to replace The ZFS `zpool status -v` command reports these states for a drive: * DEGRADED * FAULTED * OFFLINE * ONLINE * REMOVED * UNAVAIL The drive has failed if it is in the DEGRADED, FAULTED, or UNAVAIL state. An ONLINE drive is working properly. An OFFLINE drive was taken offline by an administrator. A REMOVED drive was physically removed while the system was running. The ZFS Event Daemon (ZED) runs the script io-spare.sh when an I/O error occurs and the checksum-spare.sh script when a checksum error occurs. checksum-spare.sh is currently a symlink to the io-spare.sh script so checksum and I/O errors actually run the same script. io-spare.sh moves a drive to the DEGRADED state when the number of checksum errors has exceeded ZED_SPARE_ON_CHECKSUM_ERRORS. ZFS will continue to use the drive when necessary to keep the pool available. io-spare.sh moves a drive to the FAULTED state when the number of I/O errors (i.e., the number of read errors added to the number of write errors) has exceeded ZED_SPARE_ON_IO_ERRORS. ZFS will prevent further use of the disk. A drive is placed in the UNAVAIL (unavailable) state when it cannot be opened. No existing script specifically handles this event. ZED_SPARE_ON_CHECKSUM_ERRORS and ZED_SPARE_ON_IO_ERRORS are defined in zed.rc, which is generally in /etc/zfs/zed.d. They are not set by default (the assignments are commented out) so replacement with a hot spare is disabled by default. The existing code sets these values: ZED_SPARE_ON_CHECKSUM_ERRORS = 10 ZED_SPARE_ON_IO_ERRORS = 1 The fault LED must be turned on when a drive has failed. The fault LED must be turned off when the drive is physically replaced. The drive fault LED will be controlled using the open source library for storage management (libstoragemgmt) [https://github.com/libstorage/libstoragemgmt] since it provides a vendor neutral API for controlling the LED (provided a Storage Enclosure Processor [SEP] is present) and it can take the device path directly (so no mapping to the enclosure and slot is required). Hewlett Packard Enterprise drives have a single Fault/UID LED which can be amber or blue and be on, off, or blinking. The lights are interpreted like this: Fault/UID LED Interpretation --------------- ----------------------------------------------------------- Off Normal operation. The drive is online, offline, a spare, or not configured as part of any array. Amber Fault LED is on. A critical fault condition has been identified and the drive has been placed offline. Blue UID LED is on. Drive has been selected by a management application. Flashing Fault LED is flashing (1 Hz). Predictive failure alert. amber Alternating Fault LED and UID LED are on. Predictive failure alert and amber and blue the drive has been selected by a management application. Background: Given ZED is running And ZED_SPARE_ON_CHECKSUM_ERRORS = 10 And ZED_SPARE_ON_IO_ERRORS = 2 Scenario: Healthy pool Given a healthy raidz2 pool Then the fault LED should be off on all drives Scenario: Drive with a few checksum errors Given a healthy raidz2 pool When drive 1 has 9 checksum errors Then the fault LED should be off on all drives Scenario: Drive with too many checksum errors Given a healthy raidz2 pool When drive 1 has 10 checksum errors Then only drive 1 should have the fault LED on Scenario: Drive with an I/O error Given a healthy raidz2 pool When drive 5 has 1 I/O error Then the fault LED should be off on all drives Scenario: Drive with too many I/O errors is faulted Given a healthy raidz2 pool When drive 5 has 2 I/O errors Then only drive 5 should have the fault LED on Scenario: Drive with too many I/O errors is degraded Given a healthy raid1z pool When drive 5 has 2 I/O errors Then only drive 5 should have the fault LED on Scenario: Failed drive is physically replaced Given a degraded raidz2 pool And drive 11 has 3 I/O errors When the failed drive 11 is physically replaced Then the fault LED should be off on all drives Scenario: Unavailable drive It may not be possible to get this scenario to pass. If libstoragemgmt cannot control the fault LED for a device which cannot be opened, this scenario will be removed. Given a degraded raidz2 pool When drive 2 is unavailable Then only drive 2 should have the fault LED on