New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactions between BFT time & unbonding period #2653

Open
cwgoes opened this Issue Oct 17, 2018 · 14 comments

Comments

Projects
None yet
4 participants
@cwgoes
Copy link

cwgoes commented Oct 17, 2018

I'm not entirely sure whether this issue belongs in Tendermint or the SDK, since it really results from the interaction of the median BFT time calculation in Tendermint and the bonded-proof-of-stake model implemented in the SDK, so has to be considered with both in mind - putting it here for now.

Concern 1: Timewarp attack

I'm concerned that the current model of BFT time & the unbonding period substantially changes the Byzantine attack surface of the Cosmos Hub — in particular, it gives too much power to 34% (just more than a third) of the stake.

Presently, if we assume a time oracle, 34% of coordinating stake can:

  • Halt the chain by refusing to sign blocks (no network control required, no stake at risk)
  • Network partition the other 2/3, double-sign and cause a fork (complete network control of other 2/3 required, stake at risk)

Halting the chain - although unfortunate - is easily detectable by "humans watching the system" in practice, can easily be fixed by forking out the offending stake, and doesn't lead to double-spends for any other services connected to the chain or other blockchains connected over IBC.

The second attack is more problematic, but it requires complete network control (in practice difficult). Once complete network control breaks, double-sign proofs will be submitted from both forks to each other and the offending 34% will be slashed on both (or both will halt, but either case is attributable). Likewise for the current IBC model - proof-of-double-sign can be submitted to IBC contracts on the other chains and the contracts can immediately lock assets / prevent further value transfer.

However, with our current median BFT time plus the unbonding period which utilizes it, I think 34% of stake can do the following:

  • Censor all other proposers so that the 34% cabal exclusively controls the included vote set of each block, and only include the votes from other 1/3 of stake (totaling just over 66%, so enough to commit blocks) - thus completely controlling the median timestamp, since the 34% will comprise 51% of the votes in each block.
  • Double-sign a block at some height h, but wait to publish the signatures
  • In block h+1 or h+2, increase the timestamp by three weeks
  • Submit the double-signed block to an IBC connection or light client a few headers behind, and voila - double-spend with no punishment, because the SDK will reject the evidence as being too old

This attack requires no ability to partition the other 2/3 of validators, puts no stake at risk, and can happen in a matter of a few blocks before anyone notices. It is still attributable, but not in-protocol - governance would have to elect to slash the offending validators, which could be controversial, takes time, doesn't work with IBC, etc etc.

The SDK could not check evidence timestamps like this, but then the 34% cabal could increase the timestamp above the evidence rejection threshold at the Tendermint P2P layer instead.

In practice this seems like a much worse attack than either of the two above — it doesn't require network control, allows double-spending, isn't necessarily attributable or slashable, and happens almost instantly.

Concern 2: Inflationary incentives

Separate from the Byzantine case above, I think rational self-interested validators who are not explicitly colluding (which is our threat model) might be incentivized to lie about time.

What does the timestamp do in the Cosmos Hub incentive model? Two things:

  • Timestamp controls the unbonding period - oldest age of valid evidence and how fast unbonding stake is unlocked
  • Timestamp controls inflation (the annual target inflation rate is applied incrementally every so often according to elapsed time)

In different cases I could see lying about time in both ways being rational, but I'm more concerned about the "fast time" case. Because timestamp controls inflation, stakers control their own rate of payment for staking on the network. As a validator - even one who isn't colluding at all - the later the timestamp I pick, the more the median slightly shifts and the (slightly) more I get paid. As a rational delegator, I'll vote for validators which pick later timestamps and increase (slightly) my rewards.

In the otherwise-honest model (where the only "non-protocol-compliant" thing validators are doing is lying about time) this does still require 51% of stake to lie in this way to actually be a problem - otherwise the timestamp will just be too far ahead, but by a constant amount since the honest 51% control the median and are just setting their time from an external oracle. But since there's no punishment for lying and a (slight) benefit even as a single validator who changes only their action, I'm not sure we have sufficiently strong reasons to expect that 51% of stake would be consistently honest.

In the Byzantine model, the 34% attack - without double-signing - applies here as well: a 34% cabal can censor half the other votes, control the timestamp, and speed up the inflation rate by any factor they like. (this might be even worse because I think they can also selectively censor precommits, ref cosmos/cosmos-sdk#2522)

In general, it seems to me like we have not thought enough about the ramifications of utilizing a timestamp completely controlled by the validator set for core protocol security state machine logic. I think we:

  1. should think more about it and sketch out the security model more concretely
  2. should consider using or also using a time metric which at least has some real-world logistic constraints - if we additionally require a minimum number of blocks for unbonding, for example, the 34% cabal attack is far less effective even if the number of blocks we only expect to take half an unbonding period since they can't speed up the rate of block production

Let me know if the above explanations are clear or if I missed anything.

cc @ValarDragon @sunnya97 @ebuchman @milosevic

@jaekwon

This comment has been minimized.

Copy link
Contributor

jaekwon commented Oct 17, 2018

Relevant: https://github.com/tendermint/tendermint/blob/jae/bft_time/docs/specification/new-spec/bft-time.md
I'd like to push for subjective time validity post-launch, & ideally otherwise keeping most of our BFT as is for now. On the SDK side at least (or on the Tendermint side) we still need to ensure time monotonicity to prevent logical issues that may arise from time moving backwards, so I proposed that we create an issue for this and work on it after game of stakes is up.

@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Oct 18, 2018

I'd like to push for subjective time validity post-launch

I don't think this addresses the main concern which is fundamentally:

In general, it seems to me like we have not thought enough about the ramifications of utilizing a timestamp completely controlled by the validator set for core protocol security state machine logic

So it seems like we should be considering a more hybrid approach that includes both time and number of blocks for things like inflationary rewards and max evidence age.

@ValarDragon

This comment has been minimized.

Copy link
Member

ValarDragon commented Oct 25, 2018

I'm not sure what we're gaining with the hybrid approach vs just pure number of blocks.

Time is only adding something in that model if blocks are being produced unexpectedly fast, but 1/3 of validators aren't lying about time. However if we determined that the time based attack would happen under a rational 1/3 of validators, I'm not sure this is a valuable model, and perhaps we should just use pure block number.

@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Nov 14, 2018

Summary of the state of things:

Currently Tendermint is using a MaxAge parameter which determines how long evidence is valid for. This prevents spam from really old evidence. Evidence older than the MaxAge (ie. CurrentHeight - evidence.Height > MaxAge) is rejected .

We're considering changing MaxAge to be time-based, rather than height based, because most other considerations in the state machine use time now that we have a BFT time. Instead of saying "Evidence older than 10,000 blocks is no longer valid", we'd say, "Evidence older than 2 weeks is no longer valid".

Problem

The problem is that, currently, block time is determined entirely by the median of the timestamps in the LastCommit. There is no other subjective element to ensure timestamps are "reasonable".

This means a +1/3 cabal of validator could manipulate the timestamp however they want. For instance, they could make the timestamp for block H+1 be one month after that for block H, then double sign for block H. When the evidence is published, it will be considered too old and will be ignored, because as far as the protocol can tell, it happened a month ago, even though it was just in the last block!

Solutions

There are two general approaches to solving this:

  • use a hybrid age that includes both height and time
  • use a subjective time validity

Hybrid

The hybrid idea would require that a piece of evidence is too old both in height and in time to be considered invalid. In that case, even if it looked like it was a month old, it would also have to be at least some number of blocks old to be considered invalid. This is a reasonably simple approach.

Subjective Time

There are two proposals for how to solve this using subjective time

Proposer Based Time

In one proposal, the proposer of a block sets the timestamp according to their local clock, and other validators accept it if it's within some tight range of their own local clocks (ie. on the order of seconds).

This solves the previous problem, since a +1/3 cabal can no longer manipulate the timestamp however they want (they would need +2/3). However, this introduces new dependencies in the Tendermint software on synchronized clocks - if the clocks get out of sync, Tendermint would halt, and there's no built in mechanism to get the clocks re-synched, which means Tendermint nodes would have to depend on external clock synchronization services.

A solution to this could be to have a low frequency component of the consensus protocol that uses median timestamps to realign everyones clocks. However, due to the complexity that would entail, and the desire to keep Tendermint free of timing assumptions, we've decided to postpone that for now.

Node Based Rejection

An alternative solution which we can easily implement today is to continue with the median timestamps as is, but to add a loose subjective validity criteria - ie. nodes will only accept blocks if the timestamp is within some large range of their own local clocks (ie. on the order of hours). This has the following benefits:

  • much weaker subjective criteria - keeping clocks synced within ~hours is much easier and less concerning than keeping them synced within seconds
  • +1/3 cabal is prevented from arbitrarily tampering with the timestamp - the most they can advance it is ~ an hour per block, which gives the network time to respond to the attack.

Conclusion

After writing this up, it seems the hybrid approach and the node-based-rejection approach end up quite similar in the end. If anything, the hybrid approach seems safer, if it assumes minimum 1 block per minute, while the node-based-rejection might enable 1 block per hour (ie. if the +1/3 cabal increases the timestamp by an hour with each block).

@ValarDragon

This comment has been minimized.

Copy link
Member

ValarDragon commented Nov 14, 2018

I would like to explicitly note that their are additional security caveats in the node based rejection case. In the event of an attack, the next proposer can't honestly propose due to the time it'd honestly propose at being greater than the time. (As time is required to be monotonically increasing) Also if we assume a rational 2/3rds, then all the same attacks persist.

I still don't understand what the utility of time in the hybrid approach is, and why we wouldn't just go with a pure block number based approach. I get that we have nice properties when the validators are "playing nice", but why would we expect them to? We can put more faith in them to do so if there is additionally subjective validatity.

Because of this, I think the design space ought to be:

  1. block number only
  2. hybrid with node based rejection
  3. some idea not yet thought of

I personally prefer 1).

@cwgoes

This comment has been minimized.

Copy link

cwgoes commented Nov 14, 2018

I still don't understand what the utility of time in the hybrid approach is, and why we wouldn't just go with a pure block number based approach. I get that we have nice properties when the validators are "playing nice", but why would we expect them to? We can put more faith in them to do so if there is additionally subjective validatity.

I think we have some reason to expect the validators to "play nice" - the 1/3 timewarp attack is way less appealing if the most you can do is reduce the unbonding period by half (if we took a block-height evidence threshold on the order of half the expected time). Votes for timestamps are attributable, so it would be easy for anyone looking at the system to recognize what was going on and take appropriate action (maybe through governance, or through hard-forking out the offending validators).

That said, evidence height only prevents the 1/3 timewarp evidence attack (which is probably the worst one), a 1/3 stake cabal may have other reasons to speed up time (e.g. to create more inflation), and it would be nice in general for nodes to be able to check the on-chain time against some external reference. I think proposer-based time is appealing if we can resolve the liveness concerns - there may also be more radical points in the design space, perhaps some sort of timestamp commit-reveal scheme (where censorship of the reveals just results in halting), but they would also require consensus changes.

I'm in favor of the hybrid approach for launch, and further research into more complex protocol alterations afterwards.

@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Nov 14, 2018

In the event of an attack, the next proposer can't honestly propose due to the time it'd honestly propose at being greater than the time.

It just means if the +1/3 cabal makes a median thats too far in the future, the chain will halt until people's clocks are within ~hour of that median. That's not necessarily a bad thing.

I still don't understand what the utility of time in the hybrid approach is, and why we wouldn't just go with a pure block number based approach

Presumably there won't be a constant +1/3 cabal constant cabal of attackers smashing on the timestamp. In that case, time gives us a more accurate evidence period.

I'm also realizing that maybe the significance of this attack is overstated. If the point is to advance the timestamp so they can double sign without being slashed, that doesn't make much sense, because a +1/3 can already prevent themselves from getting slashed by just not accepting blocks with the evidence ... the double signing is still attributable and they can still be forked out, in exactly the same way as if they +1/3 double-signed without the timestamp attack. So what makes the timestamp piece so important here?

@cwgoes

This comment has been minimized.

Copy link

cwgoes commented Nov 14, 2018

I'm also realizing that maybe the significance of this attack is overstated. If the point is to advance the timestamp so they can double sign without being slashed, that doesn't make much sense, because a +1/3 can already prevent themselves from getting slashed by just not accepting blocks with the evidence ... the double signing is still attributable and they can still be forked out, in exactly the same way as if they +1/3 double-signed without the timestamp attack. So what makes the timestamp piece so important here?

That's true, I missed that. There are still other reasons for a 34% cabal to accelerate time (more inflation, perhaps), but invalidating evidence doesn't seem like a particularly compelling one when they could just censor it instead (although at least with accelerating time, as soon as they had accelerated past three weeks, they wouldn't need to censor blocks anymore, so maybe that's a minor advantage).

Maybe we can't do anything useful here prelaunch then.

I do wonder if there are other subjective validity conditions which incur a risk of liveness bugs but still make sense in practice - timestamp-based is one, non-inclusion of evidence might be another, or so might be a substantial difference (>> 1/2 or >> 2/3) between the expected proposer set for the past n blocks and the actual proposer set, since that would indicate either censorship or an adversarial networking environment, either of which might be cause to halt and manually debug.

@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Nov 14, 2018

Interesting. That kind of thinking once again harks back to the idea of some form of ABCI-based precheck on blocks before they are voted before. Very likely we'll adopt some form of that.

From a launch perspective then, should we just leave the MaxAge as Height and consider this "attack" a non-issue?

@jaekwon

This comment has been minimized.

Copy link
Contributor

jaekwon commented Nov 14, 2018

Node Based Rejection: nodes will only accept blocks if the timestamp is within some large range of their own local clocks (ie. on the order of hours).

I'm not sure what this is proposing. The only interpretation I can grok subjects ourselves to a consensus failure // network split upon a timing attack, without actually solving the BFT time problem on the Tendermint side, which is what we need to do. We're mixing the short term concerns
of evidence handling w/ the long-term need for BFT time, so maybe it'll help to separate these two discussions.

much weaker subjective criteria - keeping clocks synced within ~hours is much easier and less concerning than keeping them synced within seconds

This proposal for subjective time validity isn't vulnerable to liveness or safety failure w/ at least 1/3 honest validators, or it's broken, it can be fixed with minimal changes to make it so (my claim -- proof or counter-proof left to the reader), and works just as well with ~hour granularity vs second granularity. It can work with globally sync'd clocks, and it can also work with adjustable clocks. One complaint I heard about this proposal is that it requires a globally sync'd clock, unlike current-Tendermint. That's not true. Either we care about global time, or we don't. If we care about global time, then we require nodes to have clocks synchronized with global time anyways (obviously). If we don't care about global time, then just keep a local time-offset by comparing blockchain time to local measured time. It still works with imperfect clocks that drifts w/ some maximum bound, just like current-Tendermint.

The very simple and reasonable solution to evidence is to "kill" a validator consensus pubkey upon any evidence of double-signing. We were going to do this anyways in the SDK -- once a double-spend is detected, the validator's consensus address is no longer usable, and everyone must re-delegate away from that malfunctioning validator consensus pubkey. The validator operator can create another consensus pubkey and rebond, but delegators who aren't active in monitoring are not subjected to 100% slashing upon the theft of both the operator & consensus keys by a hacker, who can force repeated double-signing and thus slashing to 0. This is safer for delegators. This also makes sense for hardware HSM signers that securely generate the consensus privkey internal to the HSM -- if the device is faulty, the solution is to stop using the device (and thus the consensus pubkey), regardless of what block height or time it signed. Similarly, even without an HSM, if the consensus pubkey was hacked/leaked, then the solution is to stop using the pubkey, regardless. All this engineering around the validity of evidence IMO is unnecessary complexity.

On the Tendermint side, it just needs to keep track of dead validator consensus pubkeys. One evidence is sufficient forever; there is no need to consider the "validity" of evidence across time or block heights. The SDK/app side should also keep track of dead consensus pubkeys. If the SDK/app doesn't keep track of dead consensus pubkeys, then Tendermint can submit the same evidence again, or even panic (invalid application).

This also simplifies the logic on the SDK x/slashing logic. There's no need to keep track of which delegator "contributed" to a validator's double-signing attack. If you were delegating to a validator consensus pubkey which had ever double-signed, then you might as well get slashed for it. Anything else IMO is over-engineering.

Premise 1: We're going to address the BFT time problem on the Tendermint side.
Premise 2: "Subjective" node-based rejection is not a valid long-term solution.
Premise 3: Validator consensus pubkeys that ever double-sign are forever un-trustworthy.
Premise 4: Delegators that are offline for a long duration shouldn't be potentially slashed to 0 in the case of operator/consensus pubkey hacking, if a better solution is possible where they only get slashed once, and are forced to re-delegate.


Where to go from here:

  • For GoS, if validators launch a time-warp attack, we should detect it and slash all of them manually, e.g. prevent them from winning any rewards.
  • After GoS launch, we should create a tombstone for dead validator consensus pubkeys, and disallow bonding with those. We should simplify evidence on the Tendermint-side to allow double-signing evidence regardless of height or time.
  • Tendermint should implement subjective time validity. ;)
@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Nov 14, 2018

I'm not sure what this is proposing. The only interpretation I can grok subjects ourselves to a consensus failure // network split upon a timing attack, without actually solving the BFT time problem on the Tendermint side, which is what we need to do

The BFT time is solved by median timestamps. This is just about preventing +1/3 from making their timestamps arbitrarily far in the future by rejecting blocks that are ~hours ahead of our local clock.

there is no need to consider the "validity" of evidence across time or block heights.

This is about preventing DoS on the evidence reactor since we don't want to bother with old useless evidence (ie. its older than the unbonding period).

After further consideration (see the other comments above), we realized the original issue was actually blown out of proportion and we don't really need to make any changes, except possibly changing the MaxAge from height to time so it matches the metric used for unbonding periods. So I think we should probably just close this issue, and either move forward with #2565 (ie. change MaxAge from height to time), or do nothing.

@jaekwon

This comment has been minimized.

Copy link
Contributor

jaekwon commented Nov 14, 2018

The BFT time is solved by median timestamps. This is just about preventing +1/3 from making their timestamps arbitrarily far in the future by rejecting blocks that are ~hours ahead of our local clock.

My point still stands. This subjects ourselves to a consensus failure // network split upon a timing attack, without actually solving the BFT time problem on the Tendermint side. Rejecting a block means the BFT time as agreed by median timestamps is not working.

This is about preventing DoS on the evidence reactor since we don't want to bother with old useless evidence (ie. its older than the unbonding period).

My comment addresses the DoS issue with radical simplicity. No need to bother with even defining what evidence is "valid". Any double-signing evidence is valid at any time.

So I think we should probably just close this issue, and either move forward with #2565 (ie. change MaxAge from height to time), or do nothing.

I suggest that we discuss my proposal in depth, and leave this issue open to figure out the long term solution. Again, we should implement validator tombstones and radically simplify everything.

@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Nov 14, 2018

Rejecting a block means the BFT time as agreed by median timestamps is not working.

This is with +1/3 byzantine. Proposal-based time is equally vulnerable here (the +1/3 cabal can refuse to sign for blocks that don't have timestamps far in the future).

One complaint I heard about this proposal is that it requires a globally sync'd clock, unlike current-Tendermint. That's not true. Either we care about global time, or we don't.

I don't think it's this simple. The question is: do we want Tendermint to work even if the clocks go out of sync. I would say yes. Adopting proposal-based time says no. My ultimate preference would be to find a clean way to support both options, so the user can choose what they want - or at least to be able to disable the subjective-time checks if they want to run Tendermint but don't care about time.

No doubt we're moving to proposal-based time eventually - we need to in order to support signature aggregation. But we should probably make it optional so Tendermint can still run in a clock-independent mode.

Again, we should implement validator tombstones and radically simplify everything.

I didn't quite understand this initially, thanks for re-stating. Is there a proper write up of the tombstone idea? It sounds potentially quite elegant, though it would be a significant imposition on ABCI apps.

In any case, we should probably close this issue as it seems the initial attack was over blow, and open a new issue to discuss the tombstone idea.

@ebuchman

This comment has been minimized.

Copy link
Contributor

ebuchman commented Nov 14, 2018

If we don't care about global time, then just keep a local time-offset by comparing blockchain time to local measured time. It still works with imperfect clocks that drifts w/ some maximum bound, just like current-Tendermint.

Jae explained to me that this satisfies my ultimate preference would be to find a clean way to support both options, so the user can choose what they want. Awesome!

I opened #2839 to discuss the tombstone idea, and #2840 for proposer-based time. Both warrant further analysis.

Are there any other unresolved problems here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment