tikv should retry connecting to pd indefinitely #4500
Is your feature request related to a problem? Please describe:
In #3827 I requested retrying connectivity to pd for a few seconds, and then giving up.
However as part of Jepsen testing, it was reported that clusters can fail to come online for hundreds of seconds.
Describe the feature you'd like:
I like the 300ms retry interval as-is, but retry should be indefinite. To reduce log spam, I suggest we should only log every
My suggestion for
Describe alternatives you've considered:
The alternative is to increase the number of retries to a very high number (1000). Because tikv servers rejoining is a safe operation, there is no reason why there needs to be a give-up threshold.
Teachability, Documentation, Adoption, Migration Strategy:
It should be straight forward. TiDB will require the same behavior as well.
referenced this issue
Apr 9, 2019
We've already talked about this in chat, but for posterity, it looks like TiKV 2.1.7 will retry for up to 60 seconds on startup before crashing. This could lead to situations in which a finite sequence of failure and recovery events results in a cluster that stays partially or totally unavailable until an operator intervenes. For instance, in this Jepsen test, we kill all TiKV nodes, pause all PD nodes, restart all TiKV nodes, and, after 70 seconds, resume PD. All TiKV nodes crash, and the cluster remains down indefinitely: http://jepsen.io.s3.amazonaws.com/analyses/tidb-2.1.7/20190409T175516-tikv-start-without-pd.zip.
This behavior could surprise operators and reduce availability. For instance, some init systems start services on system boot, but don't automatically restart them. Others will restart a crashing service only a finite number of times before giving up. If a TiKV node is rebooted, it'll come back online automatically--so long as PD is available. But if a compound outage occurs, such that PD is not available to that node, that reboot could take that KV node offline indefinitely--even once PD comes back.
Having a continuous retry loop will allow TiKV (and, when implemented there, TiDB) to recover automatically from these sorts of problems, and reduce the number of states and sequences operators have to reason about during an outage--they can restart nodes in any order, concurrently, or in rapid sequence, without fear of winding up in nondeterministic failure states.
I don't really have a good answer for you, @aphyr .
Most folks using TiKV are using tidb-ansible or tidb-operator for their deployment. These provide systemd service files and manage some of the ordering of system startup. So this issue may be relatively rarely experienced as a result.
As more folks adopt our project and try it in different settings, new usability issues are discovered. Like this one! Thanks for raising it.