Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions source/ceph_storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,15 @@ Ceph Storage

The Ceph deployment is not managed by StackHPC Ltd.

Ceph Operations and Troubleshooting
===================================

.. include:: include/ceph_troubleshooting.rst

.. ifconfig:: deployment['ceph_ansible']

Ceph Ansible
============

.. include:: include/ceph_ansible.rst

Ceph Troubleshooting
====================

.. include:: include/ceph_troubleshooting.rst
44 changes: 42 additions & 2 deletions source/include/ceph_ansible.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,45 @@
Making a Ceph-Ansible Checkout
------------------------------
==============================

Invoking Ceph-Ansible
---------------------
=====================

Replacing a Failed Ceph Drive
=============================

Once an OSD has been identified as having a hardware failure,
the affected drive will need to be replaced.

.. note::

Hot-swapping a failed device will change the device enumeration
and this could confuse the device addressing in Kayobe LVM
configuration.

In kayobe-config, use ``/dev/disk/by-path`` device references to
avoid this issue.

Alternatively, always reboot a server when swapping drives.

If rebooting a Ceph node, first set ``noout`` to prevent excess data
movement:

.. code-block:: console

ceph# ceph osd set noout

Apply LVM configuration using Kayobe for the replaced device (here on ``storage-0``):

.. code-block:: console

kayobe$ kayobe overcloud host configure -t lvm -kt none -l storage-0 -kl storage-0

Before running Ceph-Ansible, also remove vestigial state directory
from ``/var/lib/ceph/osd`` for the purged OSD

Reapply Ceph-Asnible in the usual manner.

.. note::

Ceph-Ansible runs can fail to complete if there are background activities
such as backfilling underway when the Ceph-Ansible playbook is invoked.
63 changes: 63 additions & 0 deletions source/include/ceph_troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,66 @@
Investigating a Failed Ceph Drive
---------------------------------

After deployment, when a drive fails it may cause OSD crashes in Ceph.
If Ceph detects crashed OSDs, it will go into `HEALTH_WARN` state.
Ceph can report details about failed OSDs by running:

.. code-block:: console
ceph# ceph health detail
A failed OSD will also be reported as down by running:

.. code-block:: console
ceph# ceph osd tree
Note the ID of the failed OSD.

The failed hardware device is logged by the Linux kernel:

.. code-block:: console
storage-0# dmesg -T
Cross-reference the hardware device and OSD ID to ensure they match.
(Using `pvs` and `lvs` may help make this connection).

Removing a Failed Ceph Drive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That guide is for ceph-ansible only, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem so; therefore I'm not really in position to review it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I tried to split according to general operations and content specific to Ceph-Ansible.
I'm already looking forward to the CephAdm content!

----------------------------

If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
from the cluster:

.. code-block:: console
storage-0# systemctl stop ceph-osd@4.service
storage-0# systemctl disable ceph-osd@4.service
ceph# ceph osd out osd.4
.. ifconfig:: deployment['ceph_ansible']

Before running Ceph-Ansible, also remove vestigial state directory
from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:

.. code-block:: console
storage-0# rm -rf /var/lib/ceph/osd/ceph-4
Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
backfill all the data when we reintroduce the drive).

.. code-block:: console
ceph# ceph osd purge --yes-i-really-mean-it 4
Unset noout for osds when hardware maintenance has concluded - eg.
while waiting for the replacement disk:

.. code-block:: console
ceph# ceph osd unset noout
Inspecting a Ceph Block Device for a VM
---------------------------------------

Expand Down