Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Interactions between BFT time & unbonding period #2653
I'm not entirely sure whether this issue belongs in Tendermint or the SDK, since it really results from the interaction of the median BFT time calculation in Tendermint and the bonded-proof-of-stake model implemented in the SDK, so has to be considered with both in mind - putting it here for now.
Concern 1: Timewarp attack
I'm concerned that the current model of BFT time & the unbonding period substantially changes the Byzantine attack surface of the Cosmos Hub — in particular, it gives too much power to 34% (just more than a third) of the stake.
Presently, if we assume a time oracle, 34% of coordinating stake can:
Halting the chain - although unfortunate - is easily detectable by "humans watching the system" in practice, can easily be fixed by forking out the offending stake, and doesn't lead to double-spends for any other services connected to the chain or other blockchains connected over IBC.
The second attack is more problematic, but it requires complete network control (in practice difficult). Once complete network control breaks, double-sign proofs will be submitted from both forks to each other and the offending 34% will be slashed on both (or both will halt, but either case is attributable). Likewise for the current IBC model - proof-of-double-sign can be submitted to IBC contracts on the other chains and the contracts can immediately lock assets / prevent further value transfer.
However, with our current median BFT time plus the unbonding period which utilizes it, I think 34% of stake can do the following:
This attack requires no ability to partition the other 2/3 of validators, puts no stake at risk, and can happen in a matter of a few blocks before anyone notices. It is still attributable, but not in-protocol - governance would have to elect to slash the offending validators, which could be controversial, takes time, doesn't work with IBC, etc etc.
The SDK could not check evidence timestamps like this, but then the 34% cabal could increase the timestamp above the evidence rejection threshold at the Tendermint P2P layer instead.
In practice this seems like a much worse attack than either of the two above — it doesn't require network control, allows double-spending, isn't necessarily attributable or slashable, and happens almost instantly.
Concern 2: Inflationary incentives
Separate from the Byzantine case above, I think rational self-interested validators who are not explicitly colluding (which is our threat model) might be incentivized to lie about time.
What does the timestamp do in the Cosmos Hub incentive model? Two things:
In different cases I could see lying about time in both ways being rational, but I'm more concerned about the "fast time" case. Because timestamp controls inflation, stakers control their own rate of payment for staking on the network. As a validator - even one who isn't colluding at all - the later the timestamp I pick, the more the median slightly shifts and the (slightly) more I get paid. As a rational delegator, I'll vote for validators which pick later timestamps and increase (slightly) my rewards.
In the otherwise-honest model (where the only "non-protocol-compliant" thing validators are doing is lying about time) this does still require 51% of stake to lie in this way to actually be a problem - otherwise the timestamp will just be too far ahead, but by a constant amount since the honest 51% control the median and are just setting their time from an external oracle. But since there's no punishment for lying and a (slight) benefit even as a single validator who changes only their action, I'm not sure we have sufficiently strong reasons to expect that 51% of stake would be consistently honest.
In the Byzantine model, the 34% attack - without double-signing - applies here as well: a 34% cabal can censor half the other votes, control the timestamp, and speed up the inflation rate by any factor they like. (this might be even worse because I think they can also selectively censor precommits, ref cosmos/cosmos-sdk#2522)
In general, it seems to me like we have not thought enough about the ramifications of utilizing a timestamp completely controlled by the validator set for core protocol security state machine logic. I think we:
Let me know if the above explanations are clear or if I missed anything.
referenced this issue
Oct 18, 2018
I don't think this addresses the main concern which is fundamentally:
So it seems like we should be considering a more hybrid approach that includes both time and number of blocks for things like inflationary rewards and max evidence age.
referenced this issue
Oct 22, 2018
I'm not sure what we're gaining with the hybrid approach vs just pure number of blocks.
Time is only adding something in that model if blocks are being produced unexpectedly fast, but 1/3 of validators aren't lying about time. However if we determined that the time based attack would happen under a rational 1/3 of validators, I'm not sure this is a valuable model, and perhaps we should just use pure block number.
Summary of the state of things:
Currently Tendermint is using a MaxAge parameter which determines how long evidence is valid for. This prevents spam from really old evidence. Evidence older than the MaxAge (ie. CurrentHeight - evidence.Height > MaxAge) is rejected .
We're considering changing MaxAge to be time-based, rather than height based, because most other considerations in the state machine use time now that we have a BFT time. Instead of saying "Evidence older than 10,000 blocks is no longer valid", we'd say, "Evidence older than 2 weeks is no longer valid".
The problem is that, currently, block time is determined entirely by the median of the timestamps in the LastCommit. There is no other subjective element to ensure timestamps are "reasonable".
This means a +1/3 cabal of validator could manipulate the timestamp however they want. For instance, they could make the timestamp for block H+1 be one month after that for block H, then double sign for block H. When the evidence is published, it will be considered too old and will be ignored, because as far as the protocol can tell, it happened a month ago, even though it was just in the last block!
There are two general approaches to solving this:
The hybrid idea would require that a piece of evidence is too old both in height and in time to be considered invalid. In that case, even if it looked like it was a month old, it would also have to be at least some number of blocks old to be considered invalid. This is a reasonably simple approach.
There are two proposals for how to solve this using subjective time
Proposer Based Time
In one proposal, the proposer of a block sets the timestamp according to their local clock, and other validators accept it if it's within some tight range of their own local clocks (ie. on the order of seconds).
This solves the previous problem, since a +1/3 cabal can no longer manipulate the timestamp however they want (they would need +2/3). However, this introduces new dependencies in the Tendermint software on synchronized clocks - if the clocks get out of sync, Tendermint would halt, and there's no built in mechanism to get the clocks re-synched, which means Tendermint nodes would have to depend on external clock synchronization services.
A solution to this could be to have a low frequency component of the consensus protocol that uses median timestamps to realign everyones clocks. However, due to the complexity that would entail, and the desire to keep Tendermint free of timing assumptions, we've decided to postpone that for now.
Node Based Rejection
An alternative solution which we can easily implement today is to continue with the median timestamps as is, but to add a loose subjective validity criteria - ie. nodes will only accept blocks if the timestamp is within some large range of their own local clocks (ie. on the order of hours). This has the following benefits:
After writing this up, it seems the hybrid approach and the node-based-rejection approach end up quite similar in the end. If anything, the hybrid approach seems safer, if it assumes minimum 1 block per minute, while the node-based-rejection might enable 1 block per hour (ie. if the +1/3 cabal increases the timestamp by an hour with each block).
I would like to explicitly note that their are additional security caveats in the node based rejection case. In the event of an attack, the next proposer can't honestly propose due to the time it'd honestly propose at being greater than the time. (As time is required to be monotonically increasing) Also if we assume a rational 2/3rds, then all the same attacks persist.
I still don't understand what the utility of time in the hybrid approach is, and why we wouldn't just go with a pure block number based approach. I get that we have nice properties when the validators are "playing nice", but why would we expect them to? We can put more faith in them to do so if there is additionally subjective validatity.
Because of this, I think the design space ought to be:
I personally prefer 1).
I think we have some reason to expect the validators to "play nice" - the 1/3 timewarp attack is way less appealing if the most you can do is reduce the unbonding period by half (if we took a block-height evidence threshold on the order of half the expected time). Votes for timestamps are attributable, so it would be easy for anyone looking at the system to recognize what was going on and take appropriate action (maybe through governance, or through hard-forking out the offending validators).
That said, evidence height only prevents the 1/3 timewarp evidence attack (which is probably the worst one), a 1/3 stake cabal may have other reasons to speed up time (e.g. to create more inflation), and it would be nice in general for nodes to be able to check the on-chain time against some external reference. I think proposer-based time is appealing if we can resolve the liveness concerns - there may also be more radical points in the design space, perhaps some sort of timestamp commit-reveal scheme (where censorship of the reveals just results in halting), but they would also require consensus changes.
I'm in favor of the hybrid approach for launch, and further research into more complex protocol alterations afterwards.
It just means if the +1/3 cabal makes a median thats too far in the future, the chain will halt until people's clocks are within ~hour of that median. That's not necessarily a bad thing.
Presumably there won't be a constant +1/3 cabal constant cabal of attackers smashing on the timestamp. In that case, time gives us a more accurate evidence period.
I'm also realizing that maybe the significance of this attack is overstated. If the point is to advance the timestamp so they can double sign without being slashed, that doesn't make much sense, because a +1/3 can already prevent themselves from getting slashed by just not accepting blocks with the evidence ... the double signing is still attributable and they can still be forked out, in exactly the same way as if they +1/3 double-signed without the timestamp attack. So what makes the timestamp piece so important here?
That's true, I missed that. There are still other reasons for a 34% cabal to accelerate time (more inflation, perhaps), but invalidating evidence doesn't seem like a particularly compelling one when they could just censor it instead (although at least with accelerating time, as soon as they had accelerated past three weeks, they wouldn't need to censor blocks anymore, so maybe that's a minor advantage).
Maybe we can't do anything useful here prelaunch then.
I do wonder if there are other subjective validity conditions which incur a risk of liveness bugs but still make sense in practice - timestamp-based is one, non-inclusion of evidence might be another, or so might be a substantial difference (>> 1/2 or >> 2/3) between the expected proposer set for the past
Interesting. That kind of thinking once again harks back to the idea of some form of ABCI-based precheck on blocks before they are voted before. Very likely we'll adopt some form of that.
From a launch perspective then, should we just leave the MaxAge as Height and consider this "attack" a non-issue?
I'm not sure what this is proposing. The only interpretation I can grok subjects ourselves to a consensus failure // network split upon a timing attack, without actually solving the BFT time problem on the Tendermint side, which is what we need to do. We're mixing the short term concerns
This proposal for subjective time validity isn't vulnerable to liveness or safety failure w/ at least 1/3 honest validators, or it's broken, it can be fixed with minimal changes to make it so (my claim -- proof or counter-proof left to the reader), and works just as well with ~hour granularity vs second granularity. It can work with globally sync'd clocks, and it can also work with adjustable clocks. One complaint I heard about this proposal is that it requires a globally sync'd clock, unlike current-Tendermint. That's not true. Either we care about global time, or we don't. If we care about global time, then we require nodes to have clocks synchronized with global time anyways (obviously). If we don't care about global time, then just keep a local time-offset by comparing blockchain time to local measured time. It still works with imperfect clocks that drifts w/ some maximum bound, just like current-Tendermint.
The very simple and reasonable solution to evidence is to "kill" a validator consensus pubkey upon any evidence of double-signing. We were going to do this anyways in the SDK -- once a double-spend is detected, the validator's consensus address is no longer usable, and everyone must re-delegate away from that malfunctioning validator consensus pubkey. The validator operator can create another consensus pubkey and rebond, but delegators who aren't active in monitoring are not subjected to 100% slashing upon the theft of both the operator & consensus keys by a hacker, who can force repeated double-signing and thus slashing to 0. This is safer for delegators. This also makes sense for hardware HSM signers that securely generate the consensus privkey internal to the HSM -- if the device is faulty, the solution is to stop using the device (and thus the consensus pubkey), regardless of what block height or time it signed. Similarly, even without an HSM, if the consensus pubkey was hacked/leaked, then the solution is to stop using the pubkey, regardless. All this engineering around the validity of evidence IMO is unnecessary complexity.
On the Tendermint side, it just needs to keep track of dead validator consensus pubkeys. One evidence is sufficient forever; there is no need to consider the "validity" of evidence across time or block heights. The SDK/app side should also keep track of dead consensus pubkeys. If the SDK/app doesn't keep track of dead consensus pubkeys, then Tendermint can submit the same evidence again, or even panic (invalid application).
This also simplifies the logic on the SDK x/slashing logic. There's no need to keep track of which delegator "contributed" to a validator's double-signing attack. If you were delegating to a validator consensus pubkey which had ever double-signed, then you might as well get slashed for it. Anything else IMO is over-engineering.
Premise 1: We're going to address the BFT time problem on the Tendermint side.
Where to go from here:
The BFT time is solved by median timestamps. This is just about preventing +1/3 from making their timestamps arbitrarily far in the future by rejecting blocks that are ~hours ahead of our local clock.
This is about preventing DoS on the evidence reactor since we don't want to bother with old useless evidence (ie. its older than the unbonding period).
After further consideration (see the other comments above), we realized the original issue was actually blown out of proportion and we don't really need to make any changes, except possibly changing the MaxAge from height to time so it matches the metric used for unbonding periods. So I think we should probably just close this issue, and either move forward with #2565 (ie. change MaxAge from height to time), or do nothing.
My point still stands. This
My comment addresses the DoS issue with radical simplicity. No need to bother with even defining what evidence is "valid". Any double-signing evidence is valid at any time.
I suggest that we discuss my proposal in depth, and leave this issue open to figure out the long term solution. Again, we should implement validator tombstones and radically simplify everything.
This is with +1/3 byzantine. Proposal-based time is equally vulnerable here (the +1/3 cabal can refuse to sign for blocks that don't have timestamps far in the future).
I don't think it's this simple. The question is: do we want Tendermint to work even if the clocks go out of sync. I would say yes. Adopting proposal-based time says no. My ultimate preference would be to find a clean way to support both options, so the user can choose what they want - or at least to be able to disable the subjective-time checks if they want to run Tendermint but don't care about time.
No doubt we're moving to proposal-based time eventually - we need to in order to support signature aggregation. But we should probably make it optional so Tendermint can still run in a clock-independent mode.
I didn't quite understand this initially, thanks for re-stating. Is there a proper write up of the tombstone idea? It sounds potentially quite elegant, though it would be a significant imposition on ABCI apps.
In any case, we should probably close this issue as it seems the initial attack was over blow, and open a new issue to discuss the tombstone idea.
This was referenced
Nov 14, 2018
Jae explained to me that this satisfies my
Are there any other unresolved problems here?