Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925

Closed
njmurarka opened this issue May 30, 2020 · 12 comments
Assignees
Labels
C:p2p Component: P2P pkg

Comments

@njmurarka
Copy link

njmurarka commented May 30, 2020

Note:

I filed #4922, #4924, #4925 separately, but they entirely might be related. Apologies in advance if they are all the same issue.

Tendermint version:

Tendermint Core Semantic Version: 0.33.3
P2P Protocol Version: 7
Block Protocol Version: 10

"node_info": {
    "protocol_version": {
      "p2p": "7",
      "block": "10",
      "app": "0"
    },
    "id": "dbc39feecf277f59b4b16ae277e8545c54ac244a",
    "listen_addr": "tcp://0.0.0.0:26656",
    "network": "bluzelle",
    "version": "0.33.3",
    "channels": "4020212223303800",
    "moniker": "daemon-sentry-3",
    "other": {
      "tx_index": "on",
      "rpc_address": "tcp://0.0.0.0:26657"
    }

ABCI app:

Cosmos SDK Version: v0.38.3

"application_version": {
    "name": "BluzelleService",
    "server_name": "blzd",
    "client_name": "blzcli",
    "version": "0.0.0-74-ge1ee575",
    "commit": "e1ee575051ad2ea18ef22fc6bf7a6fc904612a49",
    "build_tags": "ledger,faucet,cosmos-sdk v0.38.3",
    "go": "go version go1.14.3 linux/amd64"
  }

Big Dipper Explorer URL:

http://explorer.testnet.public.bluzelle.com:3000

Instructions to setup a similar node (I'd suggest just setting up a sentry):

https://github.com/bluzelle/curium/blob/devel/docs/public/buildvalidatorsentry.md

Access to genesis file for chain:

http://a.sentry.bluzellenet.bluzelle.com:1317/genesis.json

Sample command to get node info:

curl --location --request GET 'http://a.sentry.testnet.public.bluzelle.com:1317/node_info'

Discord channel invite (in case you want to live chat with me... I am Neeraj one of the admins):

https://discord.gg/BbBZJZJ

Environment:

  • OS (e.g. from /etc/os-release):
Distributor ID: Ubuntu
Description:  Ubuntu 18.04.4 LTS
Release:  18.04
Codename: bionic
  • Install tools:

Using COSMOS SDK. Otherwise, not sure what else to say here.

  • Others:

We are running a testnet with our CRUD database as one of the application modules, in COSMOS.

We currently (as of filing this issue) have 5 "sentries" and 3 validators. To be clear, the sentries have no voting power and are the only peers that the validators talk to (the validators can talk to each other too). Furthermore, the validators are IP firewalled to only be able to talk to the sentries and other validators. The sentries themselves keep the validator node id's private.

Sentry hostnames:

a.sentry.testnet.public.bluzelle.com
b.sentry.testnet.public.bluzelle.com
c.sentry.testnet.public.bluzelle.com
d.sentry.testnet.public.bluzelle.com
e.sentry.testnet.public.bluzelle.com

I am not listing the validator hostnames, since they are inaccessible (due to the firewall) anyways.

The validators are only listening on 26656 to validators and sentries. The sentries are listening on 26656 and 26657 and also each run the cosmos REST server, listening on 1317.

We have opened our network to the public. Members of the public have setup sentries and validators of their own, and are expected to use our five sentries as their P2P peers in config.toml.

What happened:

We are occasionally getting messages as follows on the sentries.

E[2020-05-30|20:58:40.767] MConnection flush failed                     module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 err="write tcp 172.17.0.2:26656->3.7.204.136:40934: write: connection reset by peer"
E[2020-05-30|20:58:40.767] Connection failed @ sendRoutine              module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 conn=MConn{3.7.204.136:40934} err="pong timeout"
E[2020-05-30|20:58:40.767] Stopping peer for error                      module=p2p peer="Peer{MConn{3.7.204.136:40934} 87e575d7f765a965f0518fbb41d34f371afa0fda in}" err="pong timeout"

Note that these messages only happen on our sentries. Our validators never show these messages.

Also note I have removed lines that are considered "clean" from the above output, for brevity. The same sentries showing this "bad" output will also output the "clean" output lines.

This could be related to #4922 and actually, you will see that the last two lines in the above output are the same output from #4922. I filed these separately since they seem related but still different issues. Could also be related to #4924, that I also just filed.

What you expected to happen:

We expect "clean" output, as so:

I[2020-05-30|21:53:04.286] Executed block                               module=state height=20416 validTxs=0 invalidTxs=2
I[2020-05-30|21:53:04.309] Committed state                              module=state height=20416 txs=2 appHash=80D70DC5FF062F34D3F79F15FC85CB367A5A7F9CF39B4EE6C1DC68E9F1958EA1

(The fact that invalidTxs is non-zero is the subject of another investigation)

Have you tried the latest version:

Not sure. I think so. Although in looking at the Tendermint Github, I see there are two minor versions available that are newer than what we have.

How to reproduce it:

Difficult to answer this. We never saw this when we were running a "small" chain with say < 20 nodes (ie: we were getting "clean" output. Once we opened to the public and allowed anyone to join in, we started to see this happen constantly.

I suppose if you setup a network like ours and used the same code, you might reproduce it. The good news is it is happening right now and readily happens.

Logs:

Listed above.

Config:

No specific changes made to Tendermint.

node command runtime flags:

This is all running from within our daemon that was built with the COSMOS SDK.

/dump_consensus_state output for consensus bugs

Not sure how to do this.

Anything else we need to know:

Most details given above.

I did some searching ahead of time to see if I could resolve this myself. I saw some issues related to it but they were closed, yet I am seeing this now.

I have also files some other issues that might be related to this one and have done my best to refer related issues to each other.

@EG-easy
Copy link

EG-easy commented May 31, 2020

I have a same error while using gaia v2.0.8.
Once restart gaiad, then it will connect correctly, but within 1~2 days, node will be disconnected again.

@melekes melekes added the C:p2p Component: P2P pkg label Jun 1, 2020
@njmurarka
Copy link
Author

I have a same error while using gaia v2.0.8.
Once restart gaiad, then it will connect correctly, but within 1~2 days, node will be disconnected again.

Interesting. Have you seen the issues I filed in #4922 and #4924?

@EG-easy
Copy link

EG-easy commented Jun 2, 2020

@njmurarka
Yup, I have also same errors like below.

pong timeout

E[2020-05-31|08:08:29.987] Connection failed @ sendRoutine              module=p2p peer=ba3bacc714817218562f743178228f23678b2873@5.83.160.108:26656 conn=MConn{5.83.160.108:26656} err="pong timeout"
E[2020-05-31|08:08:29.987] Stopping peer for error                      module=p2p peer="Peer{MConn{5.83.160.108:26656} ba3bacc714817218562f743178228f23678b2873 out}" err="pong timeout"

MConnection flush failed

E[2020-05-31|08:24:54.712] MConnection flush failed                     module=p2p peer=63b249b792048fcdd0b300efccb3499f7f82c32e@34.90.74.63:41682 err="write tcp 45.32.115.133:26656->34.90.74.63:41682: use of closed network connection"
E[2020-05-31|08:24:54.713] Connection failed @ recvRoutine (reading byte) module=p2p peer=68dec7a6cd167711b198234cb2b8fbd8d59f895d@3.235.177.177:26656 conn=MConn{3.235.177.177:26656} err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"
E[2020-05-31|08:24:54.713] Stopping peer for error                      module=p2p peer="Peer{MConn{3.235.177.177:26656} 68dec7a6cd167711b198234cb2b8fbd8d59f895d out}" err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"

@erikgrinaker
Copy link
Contributor

Once restart gaiad, then it will connect correctly, but within 1~2 days, node will be disconnected again.

@EG-easy Can you provide some more information about the networking between the nodes? Do you see the same behavior between two nodes on a local network?

@melekes
Copy link
Contributor

melekes commented Jun 2, 2020

E[2020-05-30|20:58:40.767] MConnection flush failed                     module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 err="write tcp 172.17.0.2:26656->3.7.204.136:40934: write: connection reset by peer"
E[2020-05-30|20:58:40.767] Connection failed @ sendRoutine              module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 conn=MConn{3.7.204.136:40934} err="pong timeout"
E[2020-05-30|20:58:40.767] Stopping peer for error                      module=p2p peer="Peer{MConn{3.7.204.136:40934} 87e575d7f765a965f0518fbb41d34f371afa0fda in}" err="pong timeout"

OR

E[2020-05-31|08:24:54.712] MConnection flush failed                     module=p2p peer=63b249b792048fcdd0b300efccb3499f7f82c32e@34.90.74.63:41682 err="write tcp 45.32.115.133:26656->34.90.74.63:41682: use of closed network connection"
E[2020-05-31|08:24:54.713] Connection failed @ recvRoutine (reading byte) module=p2p peer=68dec7a6cd167711b198234cb2b8fbd8d59f895d@3.235.177.177:26656 conn=MConn{3.235.177.177:26656} err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"
E[2020-05-31|08:24:54.713] Stopping peer for error                      module=p2p peer="Peer{MConn{3.235.177.177:26656} 68dec7a6cd167711b198234cb2b8fbd8d59f895d out}" err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"

both signal that one of the nodes connected to this sentry has disconnected (due to failure or restart or something else) => it's not an error per se (i.e. there's nothing wrong with this sentry).

Is there something do you want us to do here that will help avoid confusion? Otherwise, I will close the issue.

@melekes
Copy link
Contributor

melekes commented Jun 2, 2020

Will #4937 suffice?

We have some questions for you in 1 and 2. Feel free to answer them here or in the linked issues.

@EG-easy
Copy link

EG-easy commented Jun 2, 2020

@EG-easy Can you provide some more information about the networking between the nodes? Do you see the same behavior between two nodes on a local network?

@erikgrinaker
Thank you for your comment!
No, I see this in the cosmoshub-3 mainnet.

Once this sentry node failed to sync with other peers cuz of this disconnection, although the node tried to connect many times(dialing failed), after all, it can not catch up with the latest block height.

Is this a normal situation in a public network?

And the only solution is to restart the node regularly ?
Once restart, this node will start syncing and catching up the latest height.

@erikgrinaker
Copy link
Contributor

Once this sentry node failed to sync with other peers cuz of this disconnection, although the node tried to connect many times(dialing failed), after all, it can not catch up with the latest block height.

Is this a normal situation in a public network?

No, this doesn't sound right. However, it is unclear if it is a software problem or a network/configuration problem. It sounds like it loses contact with all other peers? Do you see this on multiple sentries? How many peers do the sentries typically seem to communicate with?

@EG-easy
Copy link

EG-easy commented Jun 3, 2020

No, this doesn't sound right. However, it is unclear if it is a software problem or a network/configuration problem. It sounds like it loses contact with all other peers? Do you see this on multiple sentries? How many peers do the sentries typically seem to communicate with?

@erikgrinaker
I checked log carefully, then found the error below.

E[2020-06-02|21:16:47.997] CONSENSUS FAILURE!!!                         module=consensus err="open /root/.gaiad/data/application.db/76710877.ldb: too many open files" stack="goroutine 440 [running]:\nruntime/debug.Stack(0xc009bc7060, 0x103d880, 0xc00fc51bf0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc013340000, 0x136b190)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:611 +0x57\npanic(0x103d880, 0xc00fc51bf0)\n\t/usr/local/go/src/runtime/panic.go:969 +0x166\ngithub.com/tendermint/tm-db.(*GoLevelDB).Get(0xc00052e1d8, 0xc01417ddd0, 0x2e, 0x30, 0x2e, 0xc01417ddd0, 0xd)\n\t/root/go/pkg/mod/github.com/tendermint/tm-db@v0.2.0/go_level_db.go:51 +0x133\ngithub.com/tendermint/tm-db.(*prefixDB).Get(0xc000dc3ef0, 0xc01417dda0, 0x21, 0x21, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tm-db@v0.2.0/prefix_db.go:60 +0x16b\ngithub.com/tendermint/iavl.(*nodeDB).GetNode(0xc000dc68c0, 0xc010ffa000, 0x20, 0x20, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/nodedb.go:76 +0x2ae\ngithub.com/tendermint/iavl.(*Node).getLeftNode(...)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:349\ngithub.com/tendermint/iavl.(*Node).get(0xc0028bae70, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x0, 0x70, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x211\ngithub.com/tendermint/iavl.(*Node).get(0xc0028bad10, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0x70, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc0028babb0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc0028bab00, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc009c871e0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b66fad0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc001dfcb00, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc002baf550, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc002baf290, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc002baef20, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00cdfa4d0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00cdfa160, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00cdfa0b0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b797ef0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b797ad0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc009d30d10, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc009d30c60, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0x0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00b62f6b0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0x20, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b62f600, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00d75e420, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x20, 0x41837b, 0xc009bc7c30, 0x16abc2ce)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*ImmutableTree).Get(0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x0, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/immutable_tree.go:140 +0x5a\ngithub.com/cosmos/cosmos-sdk/store/iavl.(*Store).Get(0xc00060dce0, 0xc01046fcc0, 0x1d, 0x20, 0x1fb8b80, 0x0, 0x20)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/store/iavl/store.go:171 +0x51\ngithub.com/cosmos/cosmos-sdk/store/cachekv.(*Store).Get(0xc0107c2480, 0xc01046fcc0, 0x1d, 0x20, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/store/cachekv/store.go:60 +0x146\ngithub.com/cosmos/cosmos-sdk/store/gaskv.(*Store).Get(0xc0139afb00, 0xc01046fcc0, 0x1d, 0x20, 0xc01046fcc0, 0x1d, 0x20)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/store/gaskv/store.go:38 +0x8a\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.getValidatorMissedBlockBitArray(0x153d2a0, 0xc000dc8ab0, 0xc0000de0e0, 0x1557d80, 0xc0000bbd90, 0xc0000de0e0, 0x153d2a0, 0xc000dc8ad0, 0x153d2e0, 0xc000dc8af0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/signing_info.go:48 +0x122\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.HandleValidatorSignature(0x153d2a0, 0xc000dc8ab0, 0xc0000de0e0, 0x1557d80, 0xc0000bbd90, 0xc0000de0e0, 0x153d2a0, 0xc000dc8ad0, 0x153d2e0, 0xc000dc8af0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/keeper.go:171 +0x40a\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0x154c120, 0xc000034058, 0x155b680, 0xc009ac4d40, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/abci.go:18 +0x16b\ngithub.com/cosmos/cosmos-sdk/x/slashing.AppModule.BeginBlock(...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/module.go:131\ngithub.com/cosmos/cosmos-sdk/types/module.(*Manager).BeginBlock(0xc0000defc0, 0x154c120, 0xc000034058, 0x155b680, 0xc009ac4d40, 0xa, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/types/module/module.go:290 +0x1ca\ngithub.com/cosmos/gaia/app.(*GaiaApp).BeginBlocker(...)\n\t/root/gaia/app/app.go:234\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc000520900, 0xc00ef329a0, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/baseapp/baseapp.go:675 +0x469\ngithub.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc000601da0, 0xc00ef329a0, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/abci/client/local_client.go:231 +0xf8\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc002949140, 0xc00ef329a0, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/proxy/app_conn.go:69 +0x6b\ngithub.com/tendermint/tendermint/state.execBlockOnProxyApp(0x154cae0, 0xc013303b40, 0x1555c20, 0xc002949140, 0xc00000cd20, 0x155da80, 0xc0006081a8, 0x6, 0xc012934bb0, 0xb)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/state/execution.go:280 +0x3dd\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc01286e850, 0xa, 0x0, 0xc012934b90, 0x6, 0xc012934bb0, 0xb, 0x205445, 0xb5b2d, 0xc00b346320, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/state/execution.go:131 +0x17a\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc013340000, 0x205446)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1426 +0x928\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc013340000, 0x205446)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1348 +0x383\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit.func1(0xc013340000, 0x0, 0x205446)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1283 +0x90\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit(0xc013340000, 0x205446, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1320 +0x61e\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc013340000, 0xc00957cdc0, 0xc0119cb080, 0x28, 0xfb7980, 0xc00b8579e8, 0x136c450)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1793 +0xa39\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc013340000, 0xc00957cdc0, 0xc0119cb080, 0x28, 0xc00975d200, 0xc00975d320, 0x30)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1624 +0x59\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc013340000, 0x152dfe0, 0xc0159c68c0, 0xc0119cb080, 0x28)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:707 +0x525\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc013340000, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:642 +0x709\ncreated by github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:334 +0x13a\n"

After the log, I can see the timeout error like above, so now I understand that once "open too many files" occurs, then the node can not connect with other peers and if I restart the node, it starts connecting again.

Im trying to set upper open file limit, maybe it will solve my problem. Thanks!

@melekes
Copy link
Contributor

melekes commented Jun 3, 2020

we should add the following to our docs: "Tendermint may open a lot of files during its operation. That's why it's recommended to increase the OS max open files limit. On Linux, you can do so by executing ulimit -n 4096. On Windows, there's no upper limit, so you're fine."

@erikgrinaker erikgrinaker self-assigned this Jun 3, 2020
@erikgrinaker
Copy link
Contributor

once "open too many files" occurs, then the node can not connect with other peers

This is correct. On UNIX-based systems, network connections are considered open file sockets ("everything is a file"), so once the limit is reached no new connections can be opened. I'll update the docs.

@erikgrinaker
Copy link
Contributor

I'll close this for now, since I don't believe there's an actual problem here outside of the excessive logging (#4937). Let us know if you find otherwise.

mergify bot pushed a commit that referenced this issue Jun 3, 2020
See e.g. #4925 (comment) for people getting hit by this. Should we document it elsewhere as well?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:p2p Component: P2P pkg
Projects
None yet
Development

No branches or pull requests

4 participants