Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consensus: implement double sign risk reduction adr-051 #5147

Merged

Conversation

dongsam
Copy link
Contributor

@dongsam dongsam commented Jul 23, 2020

Implementation spec of Double Signing Risk Reduction ADR-51 by B-Harvest

  • Add DoubleSignCheckHeight config variable to ConsensusConfig for "How many blocks looks back to check existence of the node's consensus votes when before joining consensus"
  • Add consensus.double_sign_check_height to config.toml and tendermint node as flag for set DoubleSignCheckHeight
  • Set default consensus.double_sign_check_height to 0 ( it could be adjustable in this PR, disable when 0 )

Refs

  • Wrote tests
  • Updated CHANGELOG_PENDING.md
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments
  • Re-reviewed Files changed in the Github PR explorer
  • Applied Appropriate Labels

@auto-comment
Copy link

auto-comment bot commented Jul 23, 2020

👋 Thanks for creating a PR!

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Wrote tests
  • Updated CHANGELOG_PENDING.md
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments
  • Re-reviewed Files changed in the Github PR explorer
  • Applied Appropriate Labels

Thank you for your contribution to Tendermint! 🚀

@codecov
Copy link

codecov bot commented Jul 23, 2020

Codecov Report

Merging #5147 into master will decrease coverage by 0.13%.
The diff coverage is 26.08%.

@@            Coverage Diff             @@
##           master    #5147      +/-   ##
==========================================
- Coverage   62.48%   62.34%   -0.14%     
==========================================
  Files         258      258              
  Lines       27129    27152      +23     
==========================================
- Hits        16951    16929      -22     
- Misses       8699     8735      +36     
- Partials     1479     1488       +9     
Impacted Files Coverage Δ
cmd/tendermint/commands/run_node.go 0.00% <0.00%> (ø)
config/toml.go 65.95% <ø> (ø)
consensus/state.go 71.70% <12.50%> (-0.86%) ⬇️
config/config.go 77.83% <100.00%> (+0.24%) ⬆️
privval/signer_listener_endpoint.go 78.26% <0.00%> (-10.87%) ⬇️
privval/signer_endpoint.go 73.68% <0.00%> (-7.90%) ⬇️
blockchain/v0/reactor.go 64.01% <0.00%> (-6.55%) ⬇️
privval/signer_server.go 91.11% <0.00%> (-4.45%) ⬇️
blockchain/v0/pool.go 78.66% <0.00%> (-4.15%) ⬇️
privval/socket_listeners.go 82.75% <0.00%> (-3.45%) ⬇️
... and 5 more

@dongsam dongsam force-pushed the ADR-51-double_sign_risk_reduction branch from e2a26b4 to 786890e Compare July 23, 2020 19:30
@dongsam dongsam marked this pull request as ready for review July 23, 2020 20:18
Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

cmd/tendermint/commands/run_node.go Outdated Show resolved Hide resolved
cmd/tendermint/commands/run_node.go Outdated Show resolved Hide resolved
config/toml.go Outdated Show resolved Hide resolved
config/toml.go Outdated Show resolved Hide resolved
config/toml.go Outdated Show resolved Hide resolved
consensus/state.go Outdated Show resolved Hide resolved
consensus/state.go Outdated Show resolved Hide resolved
consensus/state.go Outdated Show resolved Hide resolved
consensus/state.go Outdated Show resolved Hide resolved
for _, s := range lastCommit.Signatures {
if s.BlockIDFlag == types.BlockIDFlagCommit && bytes.Equal(s.ValidatorAddress, valAddr.Address()) {
cs.Logger.Error("Error checking double sign risk reduction logic", "err", ErrDoubleSignRiskReduction)
return ErrDoubleSignRiskReduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a place where we panic...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when return error on there func (cs *State) OnStart() error, panic occurs in SwitchToConsensus

log example

I[2020-07-24|08:04:15.056] Started node                                 module=main nodeInfo="{ProtocolVersion:{P2P:9 Block:12 App:1} DefaultNodeID:fe0092406535cb943f5e15e3c6c33718089e1d3a ListenAddr:tcp://0.0.0.0:26656 Network:chain-hsa2Bm Version:0.33.6 Channels:40202122233038606100 Moniker:8CBB29C542D2D308 Other:{TxIndex:on RPCAddress:tcp://0.0.0.0:26657}}"
I[2020-07-24|08:04:15.367] Executed block                               module=state height=37 validTxs=0 invalidTxs=0
I[2020-07-24|08:04:15.367] Committed state                              module=state height=37 txs=0 appHash=0000000000000000
panic: Failed to start consensus state: double sign detected or restarted in DoubleSignCheckHeight

conS:
ConsensusState

conR:
ConsensusReactor

goroutine 45 [running]:
github.com/tendermint/tendermint/consensus.(*Reactor).SwitchToConsensus(0xc00025f880, 0xc, 0x1, 0xc0002a8440, 0x6, 0xc0002a8450, 0xc, 0x25, 0xc00258ebc0, 0x20, ...)
	github.com/tendermint/tendermint/consensus/reactor.go:127 +0x2cb
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000b44700, 0xc0024af800)
	github.com/tendermint/tendermint/blockchain/v0/reactor.go:319 +0x11be
created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart
	github.com/tendermint/tendermint/blockchain/v0/reactor.go:110 +0x89

config/toml.go Outdated Show resolved Hide resolved
consensus/state.go Outdated Show resolved Hide resolved
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice if you could add a small section about this feature in the docs, possibly here: https://docs.tendermint.com/master/tendermint-core/validators.html

consensus/state.go Outdated Show resolved Hide resolved
consensus/state.go Outdated Show resolved Hide resolved
Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tac0turtle
Copy link
Contributor

@zmanian @ebuchman @brapse do you have thoughts on this?

@erikgrinaker
Copy link
Contributor

I'm late to the game here, so don't let me block this, but I was asked for an opinion. Personally I feel like this misses too many corner cases, and that we probably don't want to weaken the incentives for operators to manage their keys and nodes properly in the first place. Ideally, a private key should only be used by a single node.

From what I understand, the primary use-case here is validator failover. Alternative solutions might be to e.g. allow multiple private keys per validator (in prioritized order) or use some sort of distributed lock service to make sure only a single validator replica is attempting to vote for a given height. I haven't given this much thought though, just a couple of ideas.

@tessr
Copy link
Contributor

tessr commented Aug 3, 2020

I'm late to the game here, so don't let me block this, but I was asked for an opinion.

Keys are the kind of thing that are worth getting right the first time, so I think it's OK if we take a step back just to double-check this approach. In this case, we probably should have looped in the folks tagged above at the ADR step of the process rather than at this step. But ah well - we're learning and getting better at that.

Going to tag @tarcieri on this one too, since I bet he might have an opinion on it 😉

@tarcieri
Copy link

tarcieri commented Aug 3, 2020

I'm a fan of a belt-and-suspenders approach here. There's been a lot of discussion of various ways to better mitigate double signing on the KMS side, but given slashing risk, it's nice to also improve checks on the validator side too.

The only thing I might additionally suggest here is a potential optional enhancement (perhaps a configurable one), would be a sort of "warm up" phase. Concretely, if you had two validators, one of them could sign only the odd numbered blocks, and another the even numbered ones (much like the similar trick for database sharding by primary key). Or for 3 validators, it'd be if their "validator index" matched (block_height % 3), (block_height % 3) + 1, (block_height % 3) + 2 respectively.

This would give some time for validators to signal to each other that they're online, and also a way of assigning a priority so if two (or more) validators do happen to go active at the same time, the lowest priority ones (perhaps the ones with the highest "validator index") can observe the highest priority one is signing, at that point deliberately stop signing, and the highest priority one can witness it before it begins signing every block.

All of that seems like a potential next step for this ADR/PR, which otherwise seems like a decent start.

@Hyung-bharvest
Copy link
Contributor

I'm a fan of a belt-and-suspenders approach here. There's been a lot of discussion of various ways to better mitigate double signing on the KMS side, but given slashing risk, it's nice to also improve checks on the validator side too.

The only thing I might additionally suggest here is a potential optional enhancement (perhaps a configurable one), would be a sort of "warm up" phase. Concretely, if you had two validators, one of them could sign only the odd numbered blocks, and another the even numbered ones (much like the similar trick for database sharding by primary key). Or for 3 validators, it'd be if their "validator index" matched (block_height % 3), (block_height % 3) + 1, (block_height % 3) + 2 respectively.

This would give some time for validators to signal to each other that they're online, and also a way of assigning a priority so if two (or more) validators do happen to go active at the same time, the lowest priority ones (perhaps the ones with the highest "validator index") can observe the highest priority one is signing, at that point deliberately stop signing, and the highest priority one can witness it before it begins signing every block.

All of that seems like a potential next step for this ADR/PR, which otherwise seems like a decent start.

Thanks for the opinion. We also agree that allowing multiple validators should be ultimate goal, but achieving it with very high certainty of non-double-signing for any edge case is quite a challenge for us. Because the result of double-signing is catastrophic, we hope the future solution should provide us 100% logical certainty that double-signing will not occur in any edge case.

We will continue to participate on the next step discussion!

@dongsam
Copy link
Contributor Author

dongsam commented Aug 10, 2020

I think Test Coverage / test-coverage-part-* doesn't seem deterministic in ci, is there anything else I need to do before Merge?

config/config.go Outdated Show resolved Hide resolved
@@ -366,6 +367,25 @@ func (cs *State) OnStart() error {
return err
}

// Double Signing Risk Reduction
Copy link
Contributor

@erikgrinaker erikgrinaker Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually sufficient to only do this in OnStart()? If a node has e.g. fast sync disabled, or for whatever other reason isn't caught up with the chain (for example because block replay took a long time), then a different validator can submit votes for heights that this node doesn't know about yet. You should be able to reproduce this simply by starting a new node with a duplicate key and fast sync disabled.

I believe we will have to run this check again before casting our first vote, after consensus has caught up with the chain head, or something similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that would be a more secure way.
What do you think about adding check logic again in gossipDataRoutine(height and round matched) for that? and I could be set the DoubleSignCheckHeight as a flag to zero to make it work only once here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'm not very familiar with the intricacies of the consensus module, but I'd say the check should run right before we try to cast our first vote, probably in State somewhere, not Reactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Erik. This is Hyung from B-Harvest. We are quite confident that the proposed additional check will satisfy raised concern about covering different route of risk.

Because this work has been staled, could you elaborate more detail on the codebase about your concerns? It will be helpful to understand your opinion.

We will wait several days before we move on with our proposed approach.

Thank you for your suggestion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it won't work without fast-sync / state-sync. We can create a separate issue for improving it. I doubt there's a clear point in code where "right before we try to cast our first vote" happens, so it might be difficult to pin point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'll create a separate issue for improving it after merged this fast-sync version

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Aug 25, 2020
@@ -366,6 +367,25 @@ func (cs *State) OnStart() error {
return err
}

// Double Signing Risk Reduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extract this into a function please?

@erikgrinaker erikgrinaker removed the stale for use by stalebot label Aug 25, 2020
@melekes
Copy link
Contributor

melekes commented Aug 25, 2020

@dongsam can you also rebase the PR agaist the latest master?

@dongsam
Copy link
Contributor Author

dongsam commented Aug 25, 2020

@dongsam can you also rebase the PR agaist the latest master?

@melekes of course, you mean rebase/squash with force-push? or merge upstream?

@melekes
Copy link
Contributor

melekes commented Aug 25, 2020

you mean rebase/squash with force-push? or merge upstream?

the former one

@dongsam dongsam force-pushed the ADR-51-double_sign_risk_reduction branch from 13b04b6 to d0c0bd8 Compare August 25, 2020 12:50
@melekes
Copy link
Contributor

melekes commented Aug 26, 2020

Hmm... github still says "This branch is out-of-date with the base branch"

@melekes melekes merged commit e30b125 into tendermint:master Aug 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants