Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Errors when connecting to multiple validators with same chain-id #249
When configuring two validator entries for the same network, tmkms logs an error when (I assume) both validators attempt to propose a block. This also leads to disconnect of the late validator, which is undesirable for HA.
Related to tendermint/tendermint#3583, the Tendermint side is fairly vocal about non-deterministic signatures:
Assuming multiple active validators is to be supported, I believe the behavior should be modified to not disconnect.
It would probably also be beneficial to have a mechanism whereby KMS can tell Tendermint "I won't sign this block, but please don't panic" (or something to that effect).
Some other related issues:
I'd agree this is the simplest first step to support an HA validator setup.
Before we starting going down any particular path in this regard, I think it'd be good to have a rough high-level plan from the Tendermint team regarding how HA setups like this should work in general. Notably this approach leaves little margin for error if there's ever a bug in the KMS's double signing detection.
Another thing to consider is having two KMS instances connecting to both validators. Right now that's uncoordinated, so I'd be a bit worried that if the KMS processes are uncoordinated, and multiple validator instances are delivering signing requests simultaneously, that there's potential for double signing, particularly if the validator and KMS hosts are uncoordinated. Something needs to be in charge when determining which validator and/or KMS instances are active and signing at a given time.
Some precedent for this sort of thing is Google's Certificate Transparency logs. Google's approach is to run 5 instances of each log in a georeplicated manner, and use their internal Chubby locking service (similar to Zookeeper/etcd) to elect which one is active at a given time. That sort of approach seems safer to me. CT faces similar risks in that "double signing" (see this example of where things went wrong).
Our use-case is an active/passive KMS setup, with the active node connecting to two+ validators.
Specifically we are not looking to run multiple active KMS, not connecting multiple KMS processes to a single Tendermint instance.
Since I assume the KMS codebase will stabilize, it should be fairly straight-forward to keep the KMS running continously. The big advantage of Tendermint HA would be the ability to update validators without downtime, where I'd expect this to be a much more frequent need.
I think this setup is a very good starting point of Tendermint HA and would greatly improve the validator