Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925

njmurarka · 2020-05-30T22:27:30Z

Note:

I filed #4922, #4924, #4925 separately, but they entirely might be related. Apologies in advance if they are all the same issue.

Tendermint version:

Tendermint Core Semantic Version: 0.33.3
P2P Protocol Version: 7
Block Protocol Version: 10

"node_info": {
    "protocol_version": {
      "p2p": "7",
      "block": "10",
      "app": "0"
    },
    "id": "dbc39feecf277f59b4b16ae277e8545c54ac244a",
    "listen_addr": "tcp://0.0.0.0:26656",
    "network": "bluzelle",
    "version": "0.33.3",
    "channels": "4020212223303800",
    "moniker": "daemon-sentry-3",
    "other": {
      "tx_index": "on",
      "rpc_address": "tcp://0.0.0.0:26657"
    }

ABCI app:

Cosmos SDK Version: v0.38.3

"application_version": {
    "name": "BluzelleService",
    "server_name": "blzd",
    "client_name": "blzcli",
    "version": "0.0.0-74-ge1ee575",
    "commit": "e1ee575051ad2ea18ef22fc6bf7a6fc904612a49",
    "build_tags": "ledger,faucet,cosmos-sdk v0.38.3",
    "go": "go version go1.14.3 linux/amd64"
  }

Big Dipper Explorer URL:

http://explorer.testnet.public.bluzelle.com:3000

Instructions to setup a similar node (I'd suggest just setting up a sentry):

https://github.com/bluzelle/curium/blob/devel/docs/public/buildvalidatorsentry.md

Access to genesis file for chain:

http://a.sentry.bluzellenet.bluzelle.com:1317/genesis.json

Sample command to get node info:

curl --location --request GET 'http://a.sentry.testnet.public.bluzelle.com:1317/node_info'

Discord channel invite (in case you want to live chat with me... I am Neeraj one of the admins):

https://discord.gg/BbBZJZJ

Environment:

OS (e.g. from /etc/os-release):

Distributor ID: Ubuntu
Description:  Ubuntu 18.04.4 LTS
Release:  18.04
Codename: bionic

Install tools:

Using COSMOS SDK. Otherwise, not sure what else to say here.

Others:

We are running a testnet with our CRUD database as one of the application modules, in COSMOS.

We currently (as of filing this issue) have 5 "sentries" and 3 validators. To be clear, the sentries have no voting power and are the only peers that the validators talk to (the validators can talk to each other too). Furthermore, the validators are IP firewalled to only be able to talk to the sentries and other validators. The sentries themselves keep the validator node id's private.

Sentry hostnames:

a.sentry.testnet.public.bluzelle.com
b.sentry.testnet.public.bluzelle.com
c.sentry.testnet.public.bluzelle.com
d.sentry.testnet.public.bluzelle.com
e.sentry.testnet.public.bluzelle.com

I am not listing the validator hostnames, since they are inaccessible (due to the firewall) anyways.

The validators are only listening on 26656 to validators and sentries. The sentries are listening on 26656 and 26657 and also each run the cosmos REST server, listening on 1317.

We have opened our network to the public. Members of the public have setup sentries and validators of their own, and are expected to use our five sentries as their P2P peers in config.toml.

What happened:

We are occasionally getting messages as follows on the sentries.

E[2020-05-30|20:58:40.767] MConnection flush failed                     module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 err="write tcp 172.17.0.2:26656->3.7.204.136:40934: write: connection reset by peer"
E[2020-05-30|20:58:40.767] Connection failed @ sendRoutine              module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 conn=MConn{3.7.204.136:40934} err="pong timeout"
E[2020-05-30|20:58:40.767] Stopping peer for error                      module=p2p peer="Peer{MConn{3.7.204.136:40934} 87e575d7f765a965f0518fbb41d34f371afa0fda in}" err="pong timeout"

Note that these messages only happen on our sentries. Our validators never show these messages.

Also note I have removed lines that are considered "clean" from the above output, for brevity. The same sentries showing this "bad" output will also output the "clean" output lines.

This could be related to #4922 and actually, you will see that the last two lines in the above output are the same output from #4922. I filed these separately since they seem related but still different issues. Could also be related to #4924, that I also just filed.

What you expected to happen:

We expect "clean" output, as so:

I[2020-05-30|21:53:04.286] Executed block                               module=state height=20416 validTxs=0 invalidTxs=2
I[2020-05-30|21:53:04.309] Committed state                              module=state height=20416 txs=2 appHash=80D70DC5FF062F34D3F79F15FC85CB367A5A7F9CF39B4EE6C1DC68E9F1958EA1

(The fact that invalidTxs is non-zero is the subject of another investigation)

Have you tried the latest version:

Not sure. I think so. Although in looking at the Tendermint Github, I see there are two minor versions available that are newer than what we have.

How to reproduce it:

Difficult to answer this. We never saw this when we were running a "small" chain with say < 20 nodes (ie: we were getting "clean" output. Once we opened to the public and allowed anyone to join in, we started to see this happen constantly.

I suppose if you setup a network like ours and used the same code, you might reproduce it. The good news is it is happening right now and readily happens.

Logs:

Listed above.

Config:

No specific changes made to Tendermint.

node command runtime flags:

This is all running from within our daemon that was built with the COSMOS SDK.

/dump_consensus_state output for consensus bugs

Not sure how to do this.

Anything else we need to know:

Most details given above.

I did some searching ahead of time to see if I could resolve this myself. I saw some issues related to it but they were closed, yet I am seeing this now.

I have also files some other issues that might be related to this one and have done my best to refer related issues to each other.

The text was updated successfully, but these errors were encountered:

EG-easy · 2020-05-31T08:38:02Z

I have a same error while using gaia v2.0.8.
Once restart gaiad, then it will connect correctly, but within 1~2 days, node will be disconnected again.

njmurarka · 2020-06-01T22:15:12Z

I have a same error while using gaia v2.0.8.
Once restart gaiad, then it will connect correctly, but within 1~2 days, node will be disconnected again.

Interesting. Have you seen the issues I filed in #4922 and #4924?

EG-easy · 2020-06-02T00:38:57Z

@njmurarka
Yup, I have also same errors like below.

pong timeout

E[2020-05-31|08:08:29.987] Connection failed @ sendRoutine              module=p2p peer=ba3bacc714817218562f743178228f23678b2873@5.83.160.108:26656 conn=MConn{5.83.160.108:26656} err="pong timeout"
E[2020-05-31|08:08:29.987] Stopping peer for error                      module=p2p peer="Peer{MConn{5.83.160.108:26656} ba3bacc714817218562f743178228f23678b2873 out}" err="pong timeout"

MConnection flush failed

E[2020-05-31|08:24:54.712] MConnection flush failed                     module=p2p peer=63b249b792048fcdd0b300efccb3499f7f82c32e@34.90.74.63:41682 err="write tcp 45.32.115.133:26656->34.90.74.63:41682: use of closed network connection"
E[2020-05-31|08:24:54.713] Connection failed @ recvRoutine (reading byte) module=p2p peer=68dec7a6cd167711b198234cb2b8fbd8d59f895d@3.235.177.177:26656 conn=MConn{3.235.177.177:26656} err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"
E[2020-05-31|08:24:54.713] Stopping peer for error                      module=p2p peer="Peer{MConn{3.235.177.177:26656} 68dec7a6cd167711b198234cb2b8fbd8d59f895d out}" err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"

erikgrinaker · 2020-06-02T09:59:34Z

Once restart gaiad, then it will connect correctly, but within 1~2 days, node will be disconnected again.

@EG-easy Can you provide some more information about the networking between the nodes? Do you see the same behavior between two nodes on a local network?

melekes · 2020-06-02T12:43:16Z

E[2020-05-30|20:58:40.767] MConnection flush failed                     module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 err="write tcp 172.17.0.2:26656->3.7.204.136:40934: write: connection reset by peer"
E[2020-05-30|20:58:40.767] Connection failed @ sendRoutine              module=p2p peer=87e575d7f765a965f0518fbb41d34f371afa0fda@3.7.204.136:40934 conn=MConn{3.7.204.136:40934} err="pong timeout"
E[2020-05-30|20:58:40.767] Stopping peer for error                      module=p2p peer="Peer{MConn{3.7.204.136:40934} 87e575d7f765a965f0518fbb41d34f371afa0fda in}" err="pong timeout"

OR

E[2020-05-31|08:24:54.712] MConnection flush failed                     module=p2p peer=63b249b792048fcdd0b300efccb3499f7f82c32e@34.90.74.63:41682 err="write tcp 45.32.115.133:26656->34.90.74.63:41682: use of closed network connection"
E[2020-05-31|08:24:54.713] Connection failed @ recvRoutine (reading byte) module=p2p peer=68dec7a6cd167711b198234cb2b8fbd8d59f895d@3.235.177.177:26656 conn=MConn{3.235.177.177:26656} err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"
E[2020-05-31|08:24:54.713] Stopping peer for error                      module=p2p peer="Peer{MConn{3.235.177.177:26656} 68dec7a6cd167711b198234cb2b8fbd8d59f895d out}" err="read tcp 45.32.115.133:57486->3.235.177.177:26656: use of closed network connection"

both signal that one of the nodes connected to this sentry has disconnected (due to failure or restart or something else) => it's not an error per se (i.e. there's nothing wrong with this sentry).

Is there something do you want us to do here that will help avoid confusion? Otherwise, I will close the issue.

melekes · 2020-06-02T12:49:25Z

Will #4937 suffice?

We have some questions for you in 1 and 2. Feel free to answer them here or in the linked issues.

EG-easy · 2020-06-02T14:01:19Z

@EG-easy Can you provide some more information about the networking between the nodes? Do you see the same behavior between two nodes on a local network?

@erikgrinaker
Thank you for your comment!
No, I see this in the cosmoshub-3 mainnet.

Once this sentry node failed to sync with other peers cuz of this disconnection, although the node tried to connect many times(dialing failed), after all, it can not catch up with the latest block height.

Is this a normal situation in a public network?

And the only solution is to restart the node regularly ?
Once restart, this node will start syncing and catching up the latest height.

erikgrinaker · 2020-06-02T14:30:22Z

Once this sentry node failed to sync with other peers cuz of this disconnection, although the node tried to connect many times(dialing failed), after all, it can not catch up with the latest block height.

Is this a normal situation in a public network?

No, this doesn't sound right. However, it is unclear if it is a software problem or a network/configuration problem. It sounds like it loses contact with all other peers? Do you see this on multiple sentries? How many peers do the sentries typically seem to communicate with?

EG-easy · 2020-06-03T03:22:30Z

No, this doesn't sound right. However, it is unclear if it is a software problem or a network/configuration problem. It sounds like it loses contact with all other peers? Do you see this on multiple sentries? How many peers do the sentries typically seem to communicate with?

@erikgrinaker
I checked log carefully, then found the error below.

E[2020-06-02|21:16:47.997] CONSENSUS FAILURE!!!                         module=consensus err="open /root/.gaiad/data/application.db/76710877.ldb: too many open files" stack="goroutine 440 [running]:\nruntime/debug.Stack(0xc009bc7060, 0x103d880, 0xc00fc51bf0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc013340000, 0x136b190)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:611 +0x57\npanic(0x103d880, 0xc00fc51bf0)\n\t/usr/local/go/src/runtime/panic.go:969 +0x166\ngithub.com/tendermint/tm-db.(*GoLevelDB).Get(0xc00052e1d8, 0xc01417ddd0, 0x2e, 0x30, 0x2e, 0xc01417ddd0, 0xd)\n\t/root/go/pkg/mod/github.com/tendermint/tm-db@v0.2.0/go_level_db.go:51 +0x133\ngithub.com/tendermint/tm-db.(*prefixDB).Get(0xc000dc3ef0, 0xc01417dda0, 0x21, 0x21, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tm-db@v0.2.0/prefix_db.go:60 +0x16b\ngithub.com/tendermint/iavl.(*nodeDB).GetNode(0xc000dc68c0, 0xc010ffa000, 0x20, 0x20, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/nodedb.go:76 +0x2ae\ngithub.com/tendermint/iavl.(*Node).getLeftNode(...)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:349\ngithub.com/tendermint/iavl.(*Node).get(0xc0028bae70, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x0, 0x70, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x211\ngithub.com/tendermint/iavl.(*Node).get(0xc0028bad10, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0x70, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc0028babb0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc0028bab00, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc009c871e0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b66fad0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc001dfcb00, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc002baf550, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc002baf290, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc002baef20, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00cdfa4d0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00cdfa160, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00cdfa0b0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b797ef0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b797ad0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc009d30d10, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0xc000dc68c0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc009d30c60, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0x1, 0x0, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00b62f6b0, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0x20, 0xc000dc68c0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:174 +0x269\ngithub.com/tendermint/iavl.(*Node).get(0xc00b62f600, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x1d, 0xffffffffffffffff, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*Node).get(0xc00d75e420, 0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x20, 0x41837b, 0xc009bc7c30, 0x16abc2ce)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/node.go:171 +0x1a6\ngithub.com/tendermint/iavl.(*ImmutableTree).Get(0xc00f321d20, 0xc01046fcc0, 0x1d, 0x20, 0x0, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/iavl@v0.12.4/immutable_tree.go:140 +0x5a\ngithub.com/cosmos/cosmos-sdk/store/iavl.(*Store).Get(0xc00060dce0, 0xc01046fcc0, 0x1d, 0x20, 0x1fb8b80, 0x0, 0x20)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/store/iavl/store.go:171 +0x51\ngithub.com/cosmos/cosmos-sdk/store/cachekv.(*Store).Get(0xc0107c2480, 0xc01046fcc0, 0x1d, 0x20, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/store/cachekv/store.go:60 +0x146\ngithub.com/cosmos/cosmos-sdk/store/gaskv.(*Store).Get(0xc0139afb00, 0xc01046fcc0, 0x1d, 0x20, 0xc01046fcc0, 0x1d, 0x20)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/store/gaskv/store.go:38 +0x8a\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.getValidatorMissedBlockBitArray(0x153d2a0, 0xc000dc8ab0, 0xc0000de0e0, 0x1557d80, 0xc0000bbd90, 0xc0000de0e0, 0x153d2a0, 0xc000dc8ad0, 0x153d2e0, 0xc000dc8af0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/signing_info.go:48 +0x122\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.HandleValidatorSignature(0x153d2a0, 0xc000dc8ab0, 0xc0000de0e0, 0x1557d80, 0xc0000bbd90, 0xc0000de0e0, 0x153d2a0, 0xc000dc8ad0, 0x153d2e0, 0xc000dc8af0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/keeper.go:171 +0x40a\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0x154c120, 0xc000034058, 0x155b680, 0xc009ac4d40, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/abci.go:18 +0x16b\ngithub.com/cosmos/cosmos-sdk/x/slashing.AppModule.BeginBlock(...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/x/slashing/module.go:131\ngithub.com/cosmos/cosmos-sdk/types/module.(*Manager).BeginBlock(0xc0000defc0, 0x154c120, 0xc000034058, 0x155b680, 0xc009ac4d40, 0xa, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/types/module/module.go:290 +0x1ca\ngithub.com/cosmos/gaia/app.(*GaiaApp).BeginBlocker(...)\n\t/root/gaia/app/app.go:234\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc000520900, 0xc00ef329a0, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.37.9/baseapp/baseapp.go:675 +0x469\ngithub.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc000601da0, 0xc00ef329a0, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/abci/client/local_client.go:231 +0xf8\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc002949140, 0xc00ef329a0, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/proxy/app_conn.go:69 +0x6b\ngithub.com/tendermint/tendermint/state.execBlockOnProxyApp(0x154cae0, 0xc013303b40, 0x1555c20, 0xc002949140, 0xc00000cd20, 0x155da80, 0xc0006081a8, 0x6, 0xc012934bb0, 0xb)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/state/execution.go:280 +0x3dd\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc01286e850, 0xa, 0x0, 0xc012934b90, 0x6, 0xc012934bb0, 0xb, 0x205445, 0xb5b2d, 0xc00b346320, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/state/execution.go:131 +0x17a\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc013340000, 0x205446)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1426 +0x928\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc013340000, 0x205446)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1348 +0x383\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit.func1(0xc013340000, 0x0, 0x205446)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1283 +0x90\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit(0xc013340000, 0x205446, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1320 +0x61e\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc013340000, 0xc00957cdc0, 0xc0119cb080, 0x28, 0xfb7980, 0xc00b8579e8, 0x136c450)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1793 +0xa39\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc013340000, 0xc00957cdc0, 0xc0119cb080, 0x28, 0xc00975d200, 0xc00975d320, 0x30)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:1624 +0x59\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc013340000, 0x152dfe0, 0xc0159c68c0, 0xc0119cb080, 0x28)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:707 +0x525\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc013340000, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:642 +0x709\ncreated by github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.32.10/consensus/state.go:334 +0x13a\n"

After the log, I can see the timeout error like above, so now I understand that once "open too many files" occurs, then the node can not connect with other peers and if I restart the node, it starts connecting again.

Im trying to set upper open file limit, maybe it will solve my problem. Thanks!

melekes · 2020-06-03T05:48:38Z

we should add the following to our docs: "Tendermint may open a lot of files during its operation. That's why it's recommended to increase the OS max open files limit. On Linux, you can do so by executing ulimit -n 4096. On Windows, there's no upper limit, so you're fine."

erikgrinaker · 2020-06-03T08:59:10Z

once "open too many files" occurs, then the node can not connect with other peers

This is correct. On UNIX-based systems, network connections are considered open file sockets ("everything is a file"), so once the limit is reached no new connections can be opened. I'll update the docs.

erikgrinaker · 2020-06-03T09:18:37Z

I'll close this for now, since I don't believe there's an actual problem here outside of the excessive logging (#4937). Let us know if you find otherwise.

See e.g. #4925 (comment) for people getting hit by this. Should we document it elsewhere as well?

melekes added the C:p2p Component: P2P pkg label Jun 1, 2020

erikgrinaker self-assigned this Jun 3, 2020

erikgrinaker mentioned this issue Jun 3, 2020

docs: document open file limit in production guide #4945

Merged

erikgrinaker closed this as completed Jun 3, 2020

mergify bot pushed a commit that referenced this issue Jun 3, 2020

docs: document open file limit in production guide (#4945)

395d58c

See e.g. #4925 (comment) for people getting hit by this. Should we document it elsewhere as well?

Olshansk mentioned this issue Mar 15, 2022

Log filling with non-actionable entries at "error" log level pokt-network/pocket-core#1400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925

Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925

njmurarka commented May 30, 2020 •

edited

EG-easy commented May 31, 2020

njmurarka commented Jun 1, 2020

EG-easy commented Jun 2, 2020

erikgrinaker commented Jun 2, 2020

melekes commented Jun 2, 2020

melekes commented Jun 2, 2020

EG-easy commented Jun 2, 2020

erikgrinaker commented Jun 2, 2020

EG-easy commented Jun 3, 2020

melekes commented Jun 3, 2020 •

edited

erikgrinaker commented Jun 3, 2020

erikgrinaker commented Jun 3, 2020

Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925

Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925

Comments

njmurarka commented May 30, 2020 • edited

EG-easy commented May 31, 2020

njmurarka commented Jun 1, 2020

EG-easy commented Jun 2, 2020

erikgrinaker commented Jun 2, 2020

melekes commented Jun 2, 2020

melekes commented Jun 2, 2020

EG-easy commented Jun 2, 2020

erikgrinaker commented Jun 2, 2020

EG-easy commented Jun 3, 2020

melekes commented Jun 3, 2020 • edited

erikgrinaker commented Jun 3, 2020

erikgrinaker commented Jun 3, 2020

njmurarka commented May 30, 2020 •

edited

melekes commented Jun 3, 2020 •

edited