Skip to content

Commit 7f450bf

Browse files
authored
Merge pull request etcd-io#10296 from gyuho/learner
docs: rename to "learner" (from "non-voting member")
2 parents 99c933b + bfd9596 commit 7f450bf

15 files changed

+32
-32
lines changed

docs/index.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Still working in progress...
88

99
* :ref:`set-up`: setting up an etcd cluster.
1010
* :ref:`operate`: operating an etcd cluster.
11-
* :ref:`server-non-voting-member`: describes etcd Non-voting member.
11+
* :ref:`server-learner`: describes etcd non-voting member, Learner
1212
* :ref:`client-architecture`: describes etcd client components.
1313
* :ref:`client-feature-matrix`: compares client features.
1414

@@ -34,7 +34,7 @@ Still working in progress...
3434
:maxdepth: 2
3535
:caption: Architecture
3636

37-
server-non-voting-member
37+
server-learner
3838
client-architecture
3939

4040
.. toctree::

docs/server-non-voting-member.rst renamed to docs/server-learner.rst

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
.. _server-non-voting-member:
1+
.. _server-learner:
22

33

4-
Non-voting Member
5-
#################
4+
Learner
5+
#######
66

77

88
:Authors:
@@ -19,93 +19,93 @@ Membership reconfiguration has been one of the biggest operational challenges. L
1919

2020
A newly joined etcd member starts with no data, thus demanding more updates from leader until it catches up with leader’s logs. Then leader’s network is more likely to be overloaded, blocking or dropping leader heartbeats to followers. In such case, a follower may election-timeout to start a new leader election. That is, a cluster with a new member is more vulnerable to leader election. Both leader election and the subsequent update propagation to the new member are prone to causing periods of cluster unavailability (see *Figure 1*).
2121

22-
.. image:: img/server-non-voting-member-figure-01.png
22+
.. image:: img/server-learner-figure-01.png
2323
:align: center
24-
:alt: server-non-voting-member-figure-01
24+
:alt: server-learner-figure-01
2525

2626
What if network partition happens? It depends on leader partition. If the leader still maintains the active quorum, the cluster would continue to operate (see *Figure 2*).
2727

28-
.. image:: img/server-non-voting-member-figure-02.png
28+
.. image:: img/server-learner-figure-02.png
2929
:align: center
30-
:alt: server-non-voting-member-figure-02
30+
:alt: server-learner-figure-02
3131

3232
What if the leader becomes isolated from the rest of the cluster? Leader monitors progress of each follower. When leader loses connectivity from the quorum, it reverts back to follower which will affect the cluster availability (see *Figure 3*).
3333

34-
.. image:: img/server-non-voting-member-figure-03.png
34+
.. image:: img/server-learner-figure-03.png
3535
:align: center
36-
:alt: server-non-voting-member-figure-03
36+
:alt: server-learner-figure-03
3737

3838
When a new node is added to 3 node cluster, the cluster size becomes 4 and the quorum size becomes 3. What if a new node had joined the cluster, and then network partition happens? It depends on which partition the new member gets located after partition. If the new node happens to be located in the same partition as leader’s, the leader still maintains the active quorum of 3. No leadership election happens, and no cluster availability gets affected (see *Figure 4*).
3939

40-
.. image:: img/server-non-voting-member-figure-04.png
40+
.. image:: img/server-learner-figure-04.png
4141
:align: center
42-
:alt: server-non-voting-member-figure-04
42+
:alt: server-learner-figure-04
4343

4444
If the cluster is 2-and-2 partitioned, then neither of partition maintains the quorum of 3. In this case, leadership election happens (see *Figure 5*).
4545

46-
.. image:: img/server-non-voting-member-figure-05.png
46+
.. image:: img/server-learner-figure-05.png
4747
:align: center
48-
:alt: server-non-voting-member-figure-05
48+
:alt: server-learner-figure-05
4949

5050
What if network partition happens first, and then a new member gets added? A partitioned 3-node cluster already has one disconnected follower. When a new member is added, the quorum changes from 2 to 3. Now, this cluster has only 2 active nodes out 4, thus losing quorum and starting a new leadership election (see *Figure 6*).
5151

52-
.. image:: img/server-non-voting-member-figure-06.png
52+
.. image:: img/server-learner-figure-06.png
5353
:align: center
54-
:alt: server-non-voting-member-figure-06
54+
:alt: server-learner-figure-06
5555

5656
Since member add operation can change the size of quorum, it is always recommended to “member remove” first to replace an unhealthy node.
5757

5858
Adding a new member to a 1-node cluster changes the quorum size to 2, immediately causing a leader election when the previous leader finds out quorum is not active. This is because “member add” operation is a 2-step process where user needs to apply “member add” command first, and then starts the new node process (see *Figure 7*).
5959

60-
.. image:: img/server-non-voting-member-figure-07.png
60+
.. image:: img/server-learner-figure-07.png
6161
:align: center
62-
:alt: server-non-voting-member-figure-07
62+
:alt: server-learner-figure-07
6363

6464
An even worse case is when an added member is misconfigured. Membership reconfiguration is a two-step process: “etcdctl member add” and starting an etcd server process with the given peer URL. That is, “member add” command is applied regardless of URL, even when the URL value is invalid. If the first step is applied with invalid URLs, the second step cannot even start the new etcd. Once the cluster loses quorum, there is no way to revert the membership change (see *Figure 8*).
6565

66-
.. image:: img/server-non-voting-member-figure-08.png
66+
.. image:: img/server-learner-figure-08.png
6767
:align: center
68-
:alt: server-non-voting-member-figure-08
68+
:alt: server-learner-figure-08
6969

7070
Same applies to a multi-node cluster. For example, the cluster has two members down (one is failed, the other is misconfigured) and two members up, but now it requires at least 3 votes to change the cluster membership (see *Figure 9*).
7171

72-
.. image:: img/server-non-voting-member-figure-09.png
72+
.. image:: img/server-learner-figure-09.png
7373
:align: center
74-
:alt: server-non-voting-member-figure-09
74+
:alt: server-learner-figure-09
7575

7676
As seen above, a simple misconfiguration can fail the whole cluster into an inoperative state. In such case, an operator need manually recreate the cluster with ``etcd --force-new-cluster`` flag. As etcd has become a mission-critical service for Kubernetes, even the slightest outage may have significant impact on users. What can we better to make etcd such operations easier? Among other things, leader election is most critical to cluster availability: Can we make membership reconfiguration less disruptive by not changing the size of quorum? Can a new node be idle, only requesting the minimum updates from leader, until it catches up? Can membership misconfiguration be always reversible and handled in a more secure way (wrong member add command run should never fail the cluster)? Should an user worry about network topology when adding a new member? Can member add API work regardless of the location of nodes and ongoing network partitions?
7777

7878
Raft Learner
7979
============
8080

81-
In order to mitigate such availability gaps in the previous section, `Raft §4.2.1 <https://ramcloud.stanford.edu/~ongaro/thesis.pdf>`_ introduces a new node state “Learner”, which joins the cluster as a non-voting member until it catches up to leader’s logs.
81+
In order to mitigate such availability gaps in the previous section, `Raft §4.2.1 <https://ramcloud.stanford.edu/~ongaro/thesis.pdf>`_ introduces a new node state “Learner”, which joins the cluster as a **non-voting member** until it catches up to leader’s logs.
8282

8383
Features in v3.4
8484
----------------
8585

8686
An operator should do the minimum amount of work possible to add a new learner node. ``member add --learner`` command to add a new learner, which joins cluster as a non-voting member but still receives all data from leader (see *Figure 10*).
8787

88-
.. image:: img/server-non-voting-member-figure-10.png
88+
.. image:: img/server-learner-figure-10.png
8989
:align: center
90-
:alt: server-non-voting-member-figure-10
90+
:alt: server-learner-figure-10
9191

9292
When a learner has caught up with leader’s progress, the learner can be promoted to a voting member using ``member promote`` API, which then counts towards the quorum (see *Figure 11*).
9393

94-
.. image:: img/server-non-voting-member-figure-11.png
94+
.. image:: img/server-learner-figure-11.png
9595
:align: center
96-
:alt: server-non-voting-member-figure-11
96+
:alt: server-learner-figure-11
9797

9898
etcd server validates promote request to ensure its operational safety. Only after its log has caught up to leader’s can learner be promoted to a voting member (see *Figure 12*).
9999

100-
.. image:: img/server-non-voting-member-figure-12.png
100+
.. image:: img/server-learner-figure-12.png
101101
:align: center
102-
:alt: server-non-voting-member-figure-12
102+
:alt: server-learner-figure-12
103103

104104
Learner only serves as a standby node until promoted: Leadership cannot be transferred to learner. Learner rejects client reads and writes (client balancer should not route requests to learner). Which means learner does not need issue Read Index requests to leader. Such limitation simplifies the initial learner implementation in v3.4 release (see *Figure 13*).
105105

106-
.. image:: img/server-non-voting-member-figure-13.png
106+
.. image:: img/server-learner-figure-13.png
107107
:align: center
108-
:alt: server-non-voting-member-figure-13
108+
:alt: server-learner-figure-13
109109

110110
In addition, etcd limits the total number of learners that a cluster can have, and avoids overloading the leader with log replication. Learner never promotes itself. While etcd provides learner status information and safety checks, cluster operator must make the final decision whether to promote learner or not.
111111

0 commit comments

Comments
 (0)