Skip to content

Commit

Permalink
etcd instructions (#10)
Browse files Browse the repository at this point in the history
  • Loading branch information
jschaul committed Sep 30, 2019
1 parent a718b13 commit 50fb9ee
Showing 1 changed file with 57 additions and 8 deletions.
65 changes: 57 additions & 8 deletions source/how-to/administrate/etcd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,32 @@ This section only covers the bare minimum, for more information, see the `etcd d
How to see cluster health
~~~~~~~~~~~~~~~~~~~~~~~~~~

If the file `/usr/local/bin/etcd-health.sh` is available, you can run

.. code:: sh
TODO
etcd-health.sh
which should produce an output similar to::

Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
member 7c37f7dc10558fae is healthy: got healthy result from https://10.10.1.11:2379
member cca4e6f315097b3b is healthy: got healthy result from https://10.10.1.10:2379
member e767162297c84b1e is healthy: got healthy result from https://10.10.1.12:2379
cluster is healthy

If that helper file is not available, create it with the following contents:

.. code:: bash
#!/usr/bin/env bash
HOST=$(hostname)
etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-$HOST.pem --key-file=/etc/ssl/etcd/ssl/member-$HOST-key.pem --debug cluster-health
and then make it executable: ``chmod +x /usr/local/bin/etcd-health.sh``

How to inspect tables and data manually
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -25,14 +48,40 @@ How to rolling-restart an etcd cluster

On each server one by one:

1. check your cluster is healthy (see above)
2.
1. Check your cluster is healthy (see above)
2. Stop the process with ``systemctl stop etcd`` (this should be safe since etcd clients retry their operation if one endpoint becomes unavailable, see `this page <https://etcd.io/docs/v3.3.12/learning/client-architecture/>`__)
3. Do any operation you need, if any.
4. ``systemctl start etcd``
5. Wait for your cluster to be healthy again.
6. Do the same on the next server.


Troubleshooting
~~~~~~~~~~~~~~~~~~~~~~~~~~


How to recover from a single unhealthy etcd node after snapshot restore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After restoring an etcd machine from an earlier snapshot of the machine disk, etcd members may become unable to join.

Symptoms: That etcd process is unable to start and crashes, and other etcd nodes can't reach it::

3. ``systemctl stop etcd`` (to stop the process)
4. do any operation you need, if any
5. ``systemctl start etcd``
6. Wait for your cluster to be healthy again.
7. Do the same on the next server.
failed to check the health of member e767162297c84b1e on https://10.10.1.12:2379: Get https://10.10.1.12:2379/health: dial tcp 10.10.1.12:2379: getsockopt: connection refused
member e767162297c84b1e is unreachable: [https://10.10.1.12:2379] are all unreachable

Logs from the crashing etcd::

(...)
Sep 25 09:27:05 node2 etcd[20288]: 2019-09-25 07:27:05.691409 I | raft: e767162297c84b1e [term: 28] received a MsgHeartbeat message with higher term from cca4e6f315097b3b [term: 30]
Sep 25 09:27:05 node2 etcd[20288]: 2019-09-25 07:27:05.691620 I | raft: e767162297c84b1e became follower at term 30
Sep 25 09:27:05 node2 etcd[20288]: 2019-09-25 07:27:05.692423 C | raft: tocommit(16152654) is out of range [lastIndex(16061986)]. Was the raft log corrupted, truncated, or lost?
Sep 25 09:27:05 node2 etcd[20288]: panic: tocommit(16152654) is out of range [lastIndex(16061986)]. Was the raft log corrupted, truncated, or lost?
Sep 25 09:27:05 node2 etcd[20288]: goroutine 90 [running]:
(...)

To remediate the situation, do the following:

.. code:: sh
TODO

0 comments on commit 50fb9ee

Please sign in to comment.