From d9027b519571e7e0d9733e7bcd9ca249ebb9b070 Mon Sep 17 00:00:00 2001 From: Stig Telfer Date: Mon, 10 Oct 2022 22:37:03 +0100 Subject: [PATCH] Add details for Ceph drive replacement The process for removing and replacing a Ceph drive is described. Ceph-Ansible invocation for redeployment is variable between deployments. --- source/ceph_storage.rst | 9 ++-- source/include/ceph_ansible.rst | 44 ++++++++++++++++- source/include/ceph_troubleshooting.rst | 63 +++++++++++++++++++++++++ 3 files changed, 110 insertions(+), 6 deletions(-) diff --git a/source/ceph_storage.rst b/source/ceph_storage.rst index 241bb11..076d6db 100644 --- a/source/ceph_storage.rst +++ b/source/ceph_storage.rst @@ -16,6 +16,11 @@ Ceph Storage The Ceph deployment is not managed by StackHPC Ltd. +Ceph Operations and Troubleshooting +=================================== + +.. include:: include/ceph_troubleshooting.rst + .. ifconfig:: deployment['ceph_ansible'] Ceph Ansible @@ -23,7 +28,3 @@ Ceph Storage .. include:: include/ceph_ansible.rst - Ceph Troubleshooting - ==================== - - .. include:: include/ceph_troubleshooting.rst diff --git a/source/include/ceph_ansible.rst b/source/include/ceph_ansible.rst index f72a444..0212b42 100644 --- a/source/include/ceph_ansible.rst +++ b/source/include/ceph_ansible.rst @@ -1,5 +1,45 @@ Making a Ceph-Ansible Checkout ------------------------------- +============================== Invoking Ceph-Ansible ---------------------- +===================== + +Replacing a Failed Ceph Drive +============================= + +Once an OSD has been identified as having a hardware failure, +the affected drive will need to be replaced. + +.. note:: + + Hot-swapping a failed device will change the device enumeration + and this could confuse the device addressing in Kayobe LVM + configuration. + + In kayobe-config, use ``/dev/disk/by-path`` device references to + avoid this issue. + + Alternatively, always reboot a server when swapping drives. + +If rebooting a Ceph node, first set ``noout`` to prevent excess data +movement: + +.. code-block:: console + + ceph# ceph osd set noout + +Apply LVM configuration using Kayobe for the replaced device (here on ``storage-0``): + +.. code-block:: console + + kayobe$ kayobe overcloud host configure -t lvm -kt none -l storage-0 -kl storage-0 + +Before running Ceph-Ansible, also remove vestigial state directory +from ``/var/lib/ceph/osd`` for the purged OSD + +Reapply Ceph-Asnible in the usual manner. + +.. note:: + + Ceph-Ansible runs can fail to complete if there are background activities + such as backfilling underway when the Ceph-Ansible playbook is invoked. diff --git a/source/include/ceph_troubleshooting.rst b/source/include/ceph_troubleshooting.rst index 161abf7..7f61f7c 100644 --- a/source/include/ceph_troubleshooting.rst +++ b/source/include/ceph_troubleshooting.rst @@ -1,3 +1,66 @@ +Investigating a Failed Ceph Drive +--------------------------------- + +After deployment, when a drive fails it may cause OSD crashes in Ceph. +If Ceph detects crashed OSDs, it will go into `HEALTH_WARN` state. +Ceph can report details about failed OSDs by running: + +.. code-block:: console + + ceph# ceph health detail + +A failed OSD will also be reported as down by running: + +.. code-block:: console + + ceph# ceph osd tree + +Note the ID of the failed OSD. + +The failed hardware device is logged by the Linux kernel: + +.. code-block:: console + + storage-0# dmesg -T + +Cross-reference the hardware device and OSD ID to ensure they match. +(Using `pvs` and `lvs` may help make this connection). + +Removing a Failed Ceph Drive +---------------------------- + +If a drive is verified dead, stop and eject the osd (eg. `osd.4`) +from the cluster: + +.. code-block:: console + + storage-0# systemctl stop ceph-osd@4.service + storage-0# systemctl disable ceph-osd@4.service + ceph# ceph osd out osd.4 + +.. ifconfig:: deployment['ceph_ansible'] + + Before running Ceph-Ansible, also remove vestigial state directory + from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4: + + .. code-block:: console + + storage-0# rm -rf /var/lib/ceph/osd/ceph-4 + +Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will +backfill all the data when we reintroduce the drive). + +.. code-block:: console + + ceph# ceph osd purge --yes-i-really-mean-it 4 + +Unset noout for osds when hardware maintenance has concluded - eg. +while waiting for the replacement disk: + +.. code-block:: console + + ceph# ceph osd unset noout + Inspecting a Ceph Block Device for a VM ---------------------------------------