# © Copyright 2016 Hewlett Packard Enterprise Development LP

Feature: Identify failed drive

  As a ZFS administrator
  I want a failed drive's fault LED to turn on
  So I can quickly find the drive to replace

  The ZFS `zpool status -v` command reports these states for a drive:
    * DEGRADED
    * FAULTED
    * OFFLINE
    * ONLINE
    * REMOVED
    * UNAVAIL

  The drive has failed if it is in the DEGRADED, FAULTED, or UNAVAIL state. An
  ONLINE drive is working properly. An OFFLINE drive was taken offline by an
  administrator. A REMOVED drive was physically removed while the system was
  running.

  The ZFS Event Daemon (ZED) runs the script io-spare.sh when an I/O error
  occurs and the checksum-spare.sh script when a checksum error occurs.
  checksum-spare.sh is currently a symlink to the io-spare.sh script so
  checksum and I/O errors actually run the same script.

  io-spare.sh moves a drive to the DEGRADED state when the number of checksum
  errors has exceeded ZED_SPARE_ON_CHECKSUM_ERRORS. ZFS will continue to use
  the drive when necessary to keep the pool available.

  io-spare.sh moves a drive to the FAULTED state when the number of I/O errors
  (i.e., the number of read errors added to the number of write errors) has
  exceeded ZED_SPARE_ON_IO_ERRORS. ZFS will prevent further use of the disk.

  A drive is placed in the UNAVAIL (unavailable) state when it cannot be
  opened. No existing script specifically handles this event.

  ZED_SPARE_ON_CHECKSUM_ERRORS and ZED_SPARE_ON_IO_ERRORS are defined in
  zed.rc, which is generally in /etc/zfs/zed.d. They are not set by default
  (the assignments are commented out) so replacement with a hot spare is
  disabled by default. The existing code sets these values:
    ZED_SPARE_ON_CHECKSUM_ERRORS = 10
    ZED_SPARE_ON_IO_ERRORS = 1

  The fault LED must be turned on when a drive has failed. The fault LED must
  be turned off when the drive is physically replaced.

  The drive fault LED will be controlled using the open source library for
  storage management (libstoragemgmt)
  [https://github.com/libstorage/libstoragemgmt] since it provides a vendor
  neutral API for controlling the LED (provided a Storage Enclosure Processor
  [SEP] is present) and it can take the device path directly (so no mapping to
  the enclosure and slot is required).

  Hewlett Packard Enterprise drives have a single Fault/UID LED which can be
  amber or blue and be on, off, or blinking. The lights are interpreted like
  this:

    Fault/UID LED    Interpretation
    ---------------  -----------------------------------------------------------
    Off              Normal operation. The drive is online, offline, a spare,
                     or not configured as part of any array.

    Amber            Fault LED is on. A critical fault condition has been
                     identified and the drive has been placed offline.

    Blue             UID LED is on. Drive has been selected by a management
                     application.

    Flashing         Fault LED is flashing (1 Hz). Predictive failure alert.
    amber

    Alternating      Fault LED and UID LED are on. Predictive failure alert and
    amber and blue   the drive has been selected by a management application.

  Background:
    Given ZED is running
    And ZED_SPARE_ON_CHECKSUM_ERRORS = 10
    And ZED_SPARE_ON_IO_ERRORS = 2

  Scenario: Healthy pool
    Given a healthy raidz2 pool
    Then the fault LED should be off on all drives

  Scenario: Drive with a few checksum errors
    Given a healthy raidz2 pool
    When drive 1 has 9 checksum errors
    Then the fault LED should be off on all drives

  Scenario: Drive with too many checksum errors
    Given a healthy raidz2 pool
    When drive 1 has 10 checksum errors
    Then only drive 1 should have the fault LED on

  Scenario: Drive with an I/O error
    Given a healthy raidz2 pool
    When drive 5 has 1 I/O error
    Then the fault LED should be off on all drives

  Scenario: Drive with too many I/O errors is faulted
    Given a healthy raidz2 pool
    When drive 5 has 2 I/O errors
    Then only drive 5 should have the fault LED on

  Scenario: Drive with too many I/O errors is degraded
    Given a healthy raid1z pool
    When drive 5 has 2 I/O errors
    Then only drive 5 should have the fault LED on

  Scenario: Failed drive is physically replaced
    Given a degraded raidz2 pool
    And drive 11 has 3 I/O errors
    When the failed drive 11 is physically replaced
    Then the fault LED should be off on all drives

  Scenario: Unavailable drive

    It may not be possible to get this scenario to pass.  If libstoragemgmt
    cannot control the fault LED for a device which cannot be opened, this
    scenario will be removed.

    Given a degraded raidz2 pool
    When drive 2 is unavailable
    Then only drive 2 should have the fault LED on