-
Hi etcd team, I'm running a 3-node etcd cluster and noticed some behavior that I'm not sure is expected. I'd appreciate your feedback on whether this is normal or if I'm missing something. Problem When I fully shut down one of the etcd nodes (i.e., the whole machine is powered off or unreachable at the network level), I see the following error repeatedly in a very short period: During this time, the cluster seems unstable and not fully functional. However, if I only stop the etcd service on the node (but leave the machine itself reachable on the network), everything behaves normally—no errors, and the cluster stays "healthy", one node degraded. If I stop the etcd service on the node and later on shutdown the machine, the same errors occurs. This seems counterintuitive: I would expect both cases to degrade the cluster in the same way (1 node down out of 3), but apparently the fully unreachable node causes more trouble. Cluster Setup I'm running 3 etcd nodes with the following configuration on each (example from one node):
System:
OS:
Etcd:
Is this behavior expected when a node is completely unreachable at the network level? Thanks in advance for your help! Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @Kroseida - Thanks for your question. It looks like you are running etcd That version is not recommended to be run in production please refer to https://github.com/etcd-io/etcd/tree/main/CHANGELOG#v35-data-corruption-issue We have fixed countless bugs and security issues in later versions of |
Beta Was this translation helpful? Give feedback.
-
Hi @jmhbnz, Thanks for the heads-up! I went ahead and updated etcd to a more recent version, and that seems to have resolved the issue. The cluster remains stable now, even when a node is completely unreachable. Appreciate for the support! Best regards, |
Beta Was this translation helpful? Give feedback.
Hi @Kroseida - Thanks for your question. It looks like you are running etcd
v3.5.0
which is a four year old release.That version is not recommended to be run in production please refer to https://github.com/etcd-io/etcd/tree/main/CHANGELOG#v35-data-corruption-issue
We have fixed countless bugs and security issues in later versions of
v3.5
. Please upgrade to a later patch release as soon as possible and then report back if this issue persists, thanks.