Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus Failure when remote signer connection drops #4275

Closed
ljk662 opened this issue Dec 30, 2019 · 11 comments
Closed

Consensus Failure when remote signer connection drops #4275

ljk662 opened this issue Dec 30, 2019 · 11 comments
Assignees
Labels
T:bug Type Bug (Confirmed) T:validator Type: Validator related

Comments

@ljk662
Copy link

ljk662 commented Dec 30, 2019

Tendermint version

gaiad tendermint version
tendermint: 0.32.7
abci: 0.16.1
blockprotocol: 10
p2pprotocol: 7

Environment:
gaiad in alpine based docker container
tendermint/kms on Ubuntu 16.04

What happened:
Duplicate of closed issue: 2926. I am still experiencing this issue.

What you expected to happen:
If my KMS service temporarily disconnects (which seems to happen at least once a week), my gaiad instance experiences a consensus failure. Afterwards, it does not resolve/reconnect unless I restart my gaiad instance.

Have you tried the latest version: yes/no
I am using latest version of gaia (2.0.3) and tmkms (0.7.0)

How to reproduce it (as minimally and precisely as possible):
Start gaiad, start tmkms, shut off tmkms for a few seconds.

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):

E[2019-12-29|19:14:16.928] SignerListener: Ping timeout                 module=privval
I[2019-12-29|19:14:17.473] Timed out                                    module=consensus dur=4.96053053s height=133946 round=0 step=RoundStepNewHeight
E[2019-12-29|19:14:19.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:22.929] SignerClient::GetPubKey                      module=privval err="endpoint connection timed out"
E[2019-12-29|19:14:22.929] CONSENSUS FAILURE!!!                         module=consensus err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 976 [running]:\nruntime/debug.Stack(0xc000eb9358, 0x102fa00, 0x1d768e0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc000074700, 0x138f910)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:602 +0x57\npanic(0x102fa00, 0x1d768e0)\n\t/usr/local/go/src/runtime/panic.go:679 +0x1b2\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).signAddVote(0xc000074700, 0x20b02, 0xc003f97e00, 0x20, 0x20, 0x1, 0xc003f97e20, 0x20, 0x20, 0xa)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1745 +0x585\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterPrecommit(0xc000074700, 0x20b3a, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1155 +0x12ec\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc000074700, 0xc000d8b400, 0xc000a574d0, 0x28, 0xc000eb9a38, 0xbff162, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1663 +0xdfa\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc000074700, 0xc000d8b400, 0xc000a574d0, 0x28, 0x4922ea91d59af208, 0x109, 0x101)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1527 +0x59\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc000074700, 0x1545be0, 0xc004be4160, 0xc000a574d0, 0x28)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:691 +0x252\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc000074700, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:633 +0x6eb\ncreated by github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:334 +0x13a\n"
I[2019-12-29|19:14:22.929] Stopping baseWAL                             module=consensus wal=/root/.gaiad/data/cs.wal/wal impl=baseWAL
I[2019-12-29|19:14:22.931] Stopping Group                               module=consensus wal=/root/.gaiad/data/cs.wal/wal impl=Group
E[2019-12-29|19:14:25.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:28.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:31.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:34.929] SignerListener: Ping timeout                 module=privval

Config (you can paste only the changes you've made):
n/a

node command runtime flags:
n/a

/dump_consensus_state output for consensus bugs
appears normal

Anything else we need to know:
I've read in previously related issues the consensus failure is related to the timeout_propose flag. I have left that at the default 3s value. Is this supposed to be tweaked when using a remote signer?

@melekes melekes added T:bug Type Bug (Confirmed) T:validator Type: Validator related labels Jan 2, 2020
@franono
Copy link

franono commented Apr 13, 2020

Have been experiencing the same CONSENSUS FAILURE!!! module=consensus err="runtime error: invalid memory address or nil pointer dereference issues with Tendermint v0.32.8 and v0.32.9 based chains, and KMS v0.7.1 and v0.7.2 signers.

@melekes melekes added the P:High label Apr 17, 2020
@alessio
Copy link
Contributor

alessio commented Apr 17, 2020

@franono can you attach here the whole log message you are displayed please?

@ljk662
Copy link
Author

ljk662 commented Apr 17, 2020

If it's the same as the one I experience above

E[2019-12-29|19:14:22.929] CONSENSUS FAILURE!!!                         module=consensus err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 976 [running]:\nruntime/debug.Stack(0xc000eb9358, 0x102fa00, 0x1d768e0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc000074700, 0x138f910)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:602 +0x57\npanic(0x102fa00, 0x1d768e0)\n\t/usr/local/go/src/runtime/panic.go:679 +0x1b2\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).signAddVote(0xc000074700, 0x20b02, 0xc003f97e00, 0x20, 0x20, 0x1, 0xc003f97e20, 0x20, 0x20, 0xa)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1745 +0x585\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterPrecommit(0xc000074700, 0x20b3a, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1155 +0x12ec\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc000074700, 0xc000d8b400, 0xc000a574d0, 0x28, 0xc000eb9a38, 0xbff162, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1663 +0xdfa\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc000074700, 0xc000d8b400, 0xc000a574d0, 0x28, 0x4922ea91d59af208, 0x109, 0x101)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1527 +0x59\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc000074700, 0x1545be0, 0xc004be4160, 0xc000a574d0, 0x28)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:691 +0x252\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc000074700, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:633 +0x6eb\ncreated by github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:334 +0x13a\n"

@melekes
Copy link
Contributor

melekes commented Apr 20, 2020

I believe this is fixed in the current master. PR: #4534

That said, we can still create a backport fix, in which GetPubKey() tries to get the public key indefinitely with some small timeout in between requests (say 100ms). What do folks think?

@melekes
Copy link
Contributor

melekes commented Apr 20, 2020

Refs #4707

@ljk662
Copy link
Author

ljk662 commented Apr 20, 2020

Backporting the fix would be great @melekes

@melekes melekes self-assigned this Apr 20, 2020
@melekes
Copy link
Contributor

melekes commented Apr 21, 2020

Afterwards, it does not resolve/reconnect unless I restart my gaiad instance.

by "I restart" you mean https://en.wikipedia.org/wiki/Process_supervision (systemd, runit, etc.) or you restart gaiad by hand?

@melekes
Copy link
Contributor

melekes commented Apr 21, 2020

@ljk662 want to try #4709 and see if that fixes your problem?

@franono
Copy link

franono commented Apr 27, 2020

I built Gaiad v2.0.8 using Tendermint branch release/v0.32.11 on a testnet validator which is using a KMS signer.

Disrupted the KMS signer by:

  • Restarting the KMS service multiple times consecutively
  • apt-get upgrade on the KMS host, which in our experience has sometimes caused gaiad to crash in the past

Throughout the tests, gaiad did not crash. It had a few signer pong timeouts however it continued signing once the KMS was back available.

All looks good, thanks!

@alessio
Copy link
Contributor

alessio commented Apr 27, 2020

@marbar3778 let's get on with respective releases for both cosmos-sdk and gaia!

tessr pushed a commit that referenced this issue Apr 29, 2020
* privval: retry GetPubKey, SignVote/Proposal indefinitely

Fixes #4275
@melekes
Copy link
Contributor

melekes commented May 5, 2020

Fixed in v0.32.11. Thanks everyone ❤️

Related issue on master: #4707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:bug Type Bug (Confirmed) T:validator Type: Validator related
Projects
None yet
Development

No branches or pull requests

4 participants