High availability using multiple TM nodes and multiple signers #29
I'm finding difficulty reasoning about the KMS<->Validator design. It seems that TM and KMS both support 1-M connections, but you still have to have a single point of failure, either a validator node or a KMS node.
Are you planning to support a solution for active/active configuration without a single point of failure (i.e., multiple validating nodes and multiple signer nodes)?
The text was updated successfully, but these errors were encountered:
See this issue for previous discussion on this topic:
The longer-term goal is definitely to support multiple KMS nodes, however distributed double signing prevention (#11) requires consensus on the current block height.
I've proposed an active/passive(/passive/etc) configuration, since double signing avoidance requires a total ordering over the signatures, to be achieved by (TBD) distributed consensus.
A true active-active configuration would generally imply a configuration that's partition-tolerant and capable of split-brain operation, e.g. active-active databases generally employ some form of eventual consistency. That doesn't work for a double signing defense though.
An active/passive(/passive) configuration still has no single-point-of-failure, though. That's the entire point of having hot spares.
It seems that if we use something like DynamoDB conditional updates (strongly consistent reads), we can have total ordering based on (Height, Round, Step) by only carrying out conditional upserts (Compare-And-Swap) to update the table with a new HRS before creating a new signature, if the condition failed then we don't sign at that HRS and wait for the next message to sign (this of course implies that the other validator already signed, if not and it crashed before signing the message then we miss that HRS entirely). I think this is a more economical solution than trying to maintain a quorum on the HRS (minimum 3 machines).
Of course if DynamoDB/store is down no signing is possible.
I personally don't think it makes sense to tether the KMS to a particular cloud vendor's product to solve this problem, and would rather pursue approaches that allow it to be isolated from the rest of the network. If nothing else, using a 3rd party cloud solution adds attack surface.
A more economical solution with fewer machines is to pursue traditional 2-node active/passive failover techniques. This avoids the need for a consensus algorithm as well.
I think the attack surface argument for this is equivalent to using AWS/Google/IBM to host sentry nodes, if someone already uses a cloud provider for sentry nodes having DynamoDB or a similar cloud storage service doesn't extend the attack surface. With active/passive there's still the hard problem of detecting a node failure, experiencing latency anomalies can bring up the passive validator while the primary is still alive. I'd rather have some active mechanism that eliminates the possibility of double signing altogether.
I'm willing to contribute the DynamoDB part to KMS, perhaps we can have it as an optional crate feature?
If we support some sort of pluggable system for double-signing prevention, it's going to be awhile. We're not even at MVP yet for the KMS (hopefully soon!), and that will definitely use active/standby with manual failover for launch. What more advanced double signing prevention looks like needs a bigger discussion (and this issue seems like an appropriate place to have it).
In regards to this:
...the threat models of Sentries vs KMS are quite different. Sentries effectively act as a "smart" proxy, and are untrusted from the perspective of the network, with the trust in the actual Validator nodes.
The threat model of the KMS, on the other hand, assumes the Validator is compromised or otherwise exhibiting byzantine behavior, and given that threat still needs to ensure that it won't double sign.
A provider module hierarchy, generic provider structure and traits to implement the API interface, that shouldn't be too bad. I'm happy to work on this too. Since any provider would have to have a key-value store and a CAS primitive as an operation on the store, regardless to how it scales or how it achieves consensus, the API should be quite simple. (e.g.,
I've been closely following the changes, there shouldn't be much work left for an MVP. Though we should have more tests.
I don't personally view the DynamoDB solution above as advanced, a Raft or some version of Paxos (e.g., ZAB) setup is advanced (and expensive). We should at least agree that people have to chose whether to provide that data store themselves, or outsource it to another provider, and that we are not trying to arrive at an ideal setup from one perspective or another. If you agree, I could start working on some storage provider API.
It is equivalent in the sense that compromise of enough sentries, can cause a split-brain for the validator they are trying to defend and have it sign malicious blocks. So compromise of a cloud provider's privileged account results in the same outcome for both the KMS (if it is dependent on cloud storage provider) and the sentry nodes. One just takes more skill than the other.
Validator: A node on the TM network that inaddition of carrying out the distributed protocol tasks defined in the TM protocol to facility consensus it is also tasked with voting using a CSP that all of the network's security boils down to. A gross simplification, but for all intended purposes equivalent from a security stand point. If you don't agree, please clarify.
KMS: A vanilla KMS is not concerned with the particulars of any application protocol, however since our KMS has to play an additional role which is to prevent accidental double signing we'll maintain this definition to refer to the tendermint/kms repository. If you don't agree, please clarify.
I believe that in order for the KMS to protect itself from a behaviorally compromised validator it must become itself a validator and actively participate on the network at large using the full protocol and not only a subset of it (i.e., signing votes). That of course in order for it to be able protect itself from byzantine actors on the TM network, which the validator is part of. Only in this case where the KMS threat model becomes equivalent to the validator itself.
According to the KMS definition above where it is only concerned with accidental and non-byzantine double-signing, the KMS should use a data store that is fault tolerant or byzantine fault tolerant (a choice validators get to make) to synchronize itself with other KMSs over signing data at TM's current HRS.
I think whether or not this functionality should be outsourced is a debatable point, as are KMS deployment models. And more than anything else I think talking about specific solutions at this stage is a bit premature. Before we talk about that, we should back up and talk about the actual requirements.
The deployment models of note, and compatible coordination mechanisms:
Personally I think of all of these options having out-of-the-box support for active/standby deployments and automated mechanisms to promote the standby to active (and deactivate the current active node) should be the priority. I think we can support 2 and 3 node deployments in this model simply, without the need for any coordination service or consensus protocol, and that this is currently going to be both the most pervasive deployment model and the most pressing need.
As for supporting external coordination services, I think if we're going to do it it'd be good to pick a single one to support initially before attempting to build any sort of pluggable backends or abstractions across them. I'm certainly not opposed to going this route entirely: my understanding is Google deploys Certificate Transparency logs in an active/standby/standby manner (with, I believe, 5 copies of the log available globally in different georeplicated datacenters, with one active at a time) and using Chubby as the coordination service to elect an active log.
I would look to Chubby-like services as having the feature set we'd probably want for coordinating the KMS: distributed locking and/or leader election, and consistent storage for the current block height. You seem to be proposing a much weaker model than this based on CAS alone. Perhaps it's okay, but I think it needs a lot of discussion before anyone runs off and implements anything.
All that said, I'm not sure DynamoDB is a good candidate for a coordination service for several reasons, most notably it will only work for AWS users, and not GCP or other cloud users, or bare metal users.
I think it's probably best to spend pre-launch time collecting requirements and comparing alternatives, rather than prescribing a solution and attempting to build some sort of multi-coordination service abstraction without knowing in advance what the other coordination backends might be. The way these services operate are quite different and I think trying to build an abstraction around them is both premature and probably deserves to be a crate in and of itself.
I disagree with the premise as active/passive is an equivalently hard problem to solve when compared to active/active.
Active/Passive deployments consist of a single active actor, N passive actors and most importantly, an active failure detection algorithm. An example algorithm would be composed of:
For (1) you need to have a highly available service, and for (2) you need a reliable and highly available method to communicate last HRS to the secondary.
I'm aware that a simple solution for (2) might be storing last signed HRS to disk on the primary and have it moved to the secondary before it becomes active, but it will not hold in all cases and assumes that you have efficient access to the primary even in failure which could not be the case (e.g., kernel panic).
So to avoid being locked out of the primary you'll have to store the last signed HRS at some location that is independent of the primary OS. But is this automated? Then it requires investing in hardware and secure communications infrastructure to reliably carry out. Communication failure of last HRS necessarily means either significantly adding to the downtime period and requires manual human intervention, or double-signing if the secondary doesn't wait for or experiences a stale read for the last HRS. If it is not automated, then some poor human will have to live with the idea that a single mistake from him/her can lead to multi-million dollar loss.
I believe that reliably and efficiently solving active/passive is not only equivalent to solving active/active in the general sense, it is specifically worse for our situation where we have to synchronize the shutdown sequence and reliably and efficiently communicate last HRS to a secondary. Compared to specifically coordinating (not implying a coordination service is required) two validators that never really need access to a shared resource (i.e., since each validator only needs to verify and sign HRS+1) the active/active problem is significantly reduced to a problem where no two validators need access to or sign at the same HRS.
You could argue that active/passive works and have last HRS stored on some storage service, but then you still need the monitoring service (and it being a critical component of the system) and a storage service that is independent of the primary OS and hardware. While for our specific needs in an active/active configuration, we only really need a highly available and consistent storage service and coordination is unnecessary as participating validators can never sign the same block, there's no shared resource that requires coordination. This is not a throughput problem where we need concurrency, we can do just fine without it in a lockless and coordinator-less way.
Example of storage service and algorithm guaranteeing no double-signing:
The only database property required in this scenario is a an atomic CAS operation, which is trivial and exists in many databases today (e.g., conditional updates or the INC operator or locks for interleaving transactions in an SQL database). This works if you have a master/replica setup or a ring database setup (how the database arrives at consensus is abstracted away) as long as the CAS operation is atomic.
We could now start addressing distributed database problems such as network partitions, but this is merely a database choice where we prefer a database that would prefer consistency through completely sacrificing availability or employing a consensus-quorum and fence away faulty replicas or nodes, but this becomes simply irrelevant or out of topic for the KMS, no coordination service, no required and critical monitoring system, less and simpler choices.
And to drive this home further, let's say that all of what I just said about active/active is wrong and we absolutely need a coordination service (I strongly believe we don't), so it is better to launch with active/passive and then worry about active/active later. We still need to figure out (2) to efficiently and reliably communicate last HRS and the algorithm I proposed above works for this purpose too as-is. In other words, an API for storage providers such as
I think introducing a coordination service could make sense if there were a single coordination service everyone could agree on and is happy to run.
I don't think introducing an abstract trait API to databases with CAS-like features is a good idea. It substantially complicates integration testing and means we must consider the behavior of all datastores involved. And all to synchronize a modicum of data...
Would you consider building a crate for this purpose out-of-tree and take on testing it and guaranteeing its behavior across multiple coordination services, enough to cover AWS, GCP, and bare metal use cases? Are you willing to maintain such a crate on an ongoing basis? If you can build a good crate for that specific purpose, we can certainly consider using it.
Otherwise I would strongly suggest picking a single service which is the least controversial least common denominator for this purpose. That might be something like etcd.
But I really think for launch, it makes the most sense to support an active/passive configuration with manual failover (which people can automate as they see fit).
State replication will be best effort but can manually be verified by an operator (or automatically verified/updated by a monitoring system).
I also think it's still very much worth asking if it makes sense for the KMS to rely on an external service at all.
and then you said..
etcd v3 supports KV transactions (a group of put/get/delete operations) that can be used to implement more than a CAS-like operation, it can be used to implement a fully fledged STM. The problem is that the etcd rust crate only supports v2 which doesn't have KV transactions, I don't think I can justify adding v3 support and then adding this integration to my people, but will circulate. Regardless, this still is perfectly implementable under the API I proposed above and it will still have to be an optional crate feature for those that prefer active/passive and all, right?
I don't see how it makes sense, manual failover is quite risky, and the more you try to make the manual failover less risky the larger the downtime period and the possibility of a downtime penalty. As my colleague put it
Sure, but implementing a distributed database/consensus/kv/X from scratch inside KMS?... There are tens of really good solutions for every distributed problem out there, with huge communities and even commercial support backing it. So...
We have 2-3 weeks to ship KMS. You are proposing:
Bare metal datacenters are particularly important point, as at this point in time validators are expected to be operating datacenters and using HSMs for key storage. Given that, I'm a bit confused why you want to run the KMS in the cloud to begin with. Some of us running the KMS in datacenters also have network segmentation in place such that the KMS is heavily isolated from a layer 3/4 perspective, meaning using a bare metal hosted KMS in conjunction with a cloud database would mean undoing a bunch of firewall rules that have been intentionally added for defense-in-depth purposes.
Regardless, we don't have the time to implement a complex new multi-backend database adapter for the KMS before launch. There are too many other pressing tasks like implementing the SecretConnection protocol and key management tasks like provisioning to be taken care of. We need to do the simplest thing possible for now, because even doing that we're in danger of the KMS being the last piece which delays the launch. We still need to integration test with validators, and for that I would definitely prefer fewer moving parts, especially for launch.
Of all the things you responded to in my last post, you completely ignored this:
You are proposing a complex and heavy handed solution which will be difficult to test. What work are you actually willing to do here? So far I heard "DynamoDB adapter and trait abstraction", but trait abstraction for what? It sounds like you want to write a trait which maps to the DynamoDB behavior, and then expect others to implement all of the other necessary backends to make it run other places than AWS.
I think it would make much more sense to actually pick the multiple backends we need to cover all of the KMS use cases, look at what APIs they have available, and try to implement an abstraction that covers all of them, then look at the resulting code and decide if it's actually a good idea.
There are robust implementations of consensus algorithms available for use in an embedded capacity inside of the KMS, most notably:
I think these are worth considering as an alternative to an external database. I think running something like this could be an optional feature in the event you have sufficient KMS nodes to run it.
All that said, I'm still not even convinced we need any solutions as heavy handed as any of this. The state the KMS needs to coordinate on is quite a bit different from a database. We only really need to ensure monotonicity, not implement a distributed database. I think it's worth taking the time to consider whether there are simpler solutions.
First of all, thanks for your work on this. My $0.02:
I think for expedite launch and to gain some operational experience, keeping it simple is important. At the same time, consideration should be given to the future.
On the topic of validator uptime, I am not convinced that adding in redundancy at the KMS level is the right place to do it. Assuming the redundant KMS runs on hardware isolated from the validator (gaiad), the validator itself would still be a single point of failure.
An idea I have been floating earlier is that of a "logical validator" concept, which I believe can also be made safe in the context of double signing (though I am not a consensus wizard). If done right, it should AFAICS be highly available and allow for simple and cheap deployment.
A logical validator means a number of gaiad instances signing with the same public key (via KMS or with locally stored key, doesn't matter). Let's imagine each such instance runs on its own dedicated hardware and call it a physical validator.
Each physical validator is assigned a priority, which is must be unique among all the physical validators that constitutes the logical validator.
All physical validators are active at the same time and behaves exactly as they do today. The only difference is that they include their unique priority when voting (the priority is signed as well).
Physical validators might observe transactions in a different order, essentially double signing. This is resolved at the consensus layer, by only considering the votes of the physical validator that has the highest priority and ignoring the rest of the votes from the same public key (logical validator).
To put this in active/passive terms: the physical validator with the highest priority, that manages to vote in a round, is thus considered the "active" part while the remaining physical validators (sharing same pubkey) are considered "passive". The existing consensus code thus takes care of failing over and there are no new dependencies introduced.
Double signing (with same priority and public key) should of course still result in slashing, but this should no longer by a risk of failing over.
Depending on the implementation details at the consensus layer, it seems to me that this might provide good availability while introducing little complexity.
It also lowers the barrier to entry to setup a HA validator, which should translate to more determistic block times overall.
Really appreciate this conversation but I suspect it's better served elsewhere and would recommend closing this issue. I think ideally we'd keep this KMS quite small and do only what it's intended to do: implement SocketPV and talk to multiple HSM backends. We can easily start to layer intermediaries on top so instead of TM<->KMS we have TM<->INSERT_HIGH_AVAILABILITY_SOLUTION_HERE<->KMS and members of the community can start investigating the most appropriate implementations of INSERT_HIGH_AVAILABILITY_SOLUTION_HERE
I wrote some thoughts about it here: tendermint/tendermint#1758 (comment)