Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv should retry connecting to pd indefinitely #4500

Closed
morgo opened this issue Apr 9, 2019 · 10 comments

Comments

Projects
None yet
4 participants
@morgo
Copy link

commented Apr 9, 2019

Feature Request

Is your feature request related to a problem? Please describe:

In #3827 I requested retrying connectivity to pd for a few seconds, and then giving up.

However as part of Jepsen testing, it was reported that clusters can fail to come online for hundreds of seconds.

Describe the feature you'd like:

I like the 300ms retry interval as-is, but retry should be indefinite. To reduce log spam, I suggest we should only log every nth connectivity attempt.

My suggestion for n is 10 (3 seconds).

Describe alternatives you've considered:

The alternative is to increase the number of retries to a very high number (1000). Because tikv servers rejoining is a safe operation, there is no reason why there needs to be a give-up threshold.

Teachability, Documentation, Adoption, Migration Strategy:

It should be straight forward. TiDB will require the same behavior as well.

@Hoverbear Hoverbear self-assigned this Apr 9, 2019

@Hoverbear

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

I should have a PR for this soon. :)

@aphyr

This comment has been minimized.

Copy link

commented Apr 9, 2019

We've already talked about this in chat, but for posterity, it looks like TiKV 2.1.7 will retry for up to 60 seconds on startup before crashing. This could lead to situations in which a finite sequence of failure and recovery events results in a cluster that stays partially or totally unavailable until an operator intervenes. For instance, in this Jepsen test, we kill all TiKV nodes, pause all PD nodes, restart all TiKV nodes, and, after 70 seconds, resume PD. All TiKV nodes crash, and the cluster remains down indefinitely: http://jepsen.io.s3.amazonaws.com/analyses/tidb-2.1.7/20190409T175516-tikv-start-without-pd.zip.

latency-raw (25)

This behavior could surprise operators and reduce availability. For instance, some init systems start services on system boot, but don't automatically restart them. Others will restart a crashing service only a finite number of times before giving up. If a TiKV node is rebooted, it'll come back online automatically--so long as PD is available. But if a compound outage occurs, such that PD is not available to that node, that reboot could take that KV node offline indefinitely--even once PD comes back.

Having a continuous retry loop will allow TiKV (and, when implemented there, TiDB) to recover automatically from these sorts of problems, and reduce the number of states and sequences operators have to reason about during an outage--they can restart nodes in any order, concurrently, or in rapid sequence, without fear of winding up in nondeterministic failure states.

@aphyr

This comment has been minimized.

Copy link

commented Apr 9, 2019

To reproduce this problem, and confirm behavior experimentally, you can check out Jepsen abcc56e39a4b5bf6d477edccf59b57aea717e2ed and run something like

lein run test --auto-retry-limit 0 --time-limit 400 --concurrency 2n -w bank --nemesis restart-kv-without-pd; and-bell
@Hoverbear

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Thank you for sharing this knowledge of best practices with us, @aphyr . :)

Maybe it's an appropriate topic for us to add to https://tikv.org/deep-dive/ as well! This seems like it's information valuable to our friends making other distributed systems.

@siddontang

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

I think it is better to provide some configurations to control the retry, not unlimited.
Here we also need to care is the retry time when we start the service, it is too short not only in TiKV, but also in TiDB.

/cc @nolouch @overvenus

@siddontang

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

we can provide a timeout configuration for start, timeout 0 is unlimited.

@aphyr

This comment has been minimized.

Copy link

commented Apr 10, 2019

Out of curiosity, what happens when TiKV loses its connection to PD during normal operation? Does it kill the process?

@Hoverbear

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

TiKV will attempt to reconnect!

I suspect the behavior we saw in this issue was because in our deployments we typically provision PD first, so we rarely have problems with TiKV's initial connections.

@aphyr

This comment has been minimized.

Copy link

commented Apr 10, 2019

So... why would you want TiKV to recover from an unavailable PD automatically some of the time, but crash other times?

@Hoverbear

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

I don't really have a good answer for you, @aphyr . 😢

Most folks using TiKV are using tidb-ansible or tidb-operator for their deployment. These provide systemd service files and manage some of the ordering of system startup. So this issue may be relatively rarely experienced as a result.

As more folks adopt our project and try it in different settings, new usability issues are discovered. Like this one! Thanks for raising it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.