-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-validating nodes getting "MConnection flush failed" "Connection failed @ sendRoutine" in P2P module #4925
Comments
I have a same error while using gaia v2.0.8. |
@njmurarka pong timeout
MConnection flush failed
|
@EG-easy Can you provide some more information about the networking between the nodes? Do you see the same behavior between two nodes on a local network? |
OR
both signal that one of the nodes connected to this sentry has disconnected (due to failure or restart or something else) => it's not an error per se (i.e. there's nothing wrong with this sentry). Is there something do you want us to do here that will help avoid confusion? Otherwise, I will close the issue. |
@erikgrinaker Once this sentry node failed to sync with other peers cuz of this disconnection, although the node tried to connect many times( Is this a normal situation in a public network? And the only solution is to restart the node regularly ? |
No, this doesn't sound right. However, it is unclear if it is a software problem or a network/configuration problem. It sounds like it loses contact with all other peers? Do you see this on multiple sentries? How many peers do the sentries typically seem to communicate with? |
@erikgrinaker
After the log, I can see the timeout error like above, so now I understand that once "open too many files" occurs, then the node can not connect with other peers and if I restart the node, it starts connecting again. Im trying to set upper open file limit, maybe it will solve my problem. Thanks! |
we should add the following to our docs: "Tendermint may open a lot of files during its operation. That's why it's recommended to increase the OS max open files limit. On Linux, you can do so by executing |
This is correct. On UNIX-based systems, network connections are considered open file sockets ("everything is a file"), so once the limit is reached no new connections can be opened. I'll update the docs. |
I'll close this for now, since I don't believe there's an actual problem here outside of the excessive logging (#4937). Let us know if you find otherwise. |
See e.g. #4925 (comment) for people getting hit by this. Should we document it elsewhere as well?
Note:
I filed #4922, #4924, #4925 separately, but they entirely might be related. Apologies in advance if they are all the same issue.
Tendermint version:
Tendermint Core Semantic Version: 0.33.3
P2P Protocol Version: 7
Block Protocol Version: 10
ABCI app:
Cosmos SDK Version: v0.38.3
Big Dipper Explorer URL:
Instructions to setup a similar node (I'd suggest just setting up a sentry):
Access to genesis file for chain:
Sample command to get node info:
Discord channel invite (in case you want to live chat with me... I am Neeraj one of the admins):
Environment:
Using COSMOS SDK. Otherwise, not sure what else to say here.
We are running a testnet with our CRUD database as one of the application modules, in COSMOS.
We currently (as of filing this issue) have 5 "sentries" and 3 validators. To be clear, the sentries have no voting power and are the only peers that the validators talk to (the validators can talk to each other too). Furthermore, the validators are IP firewalled to only be able to talk to the sentries and other validators. The sentries themselves keep the validator node id's private.
Sentry hostnames:
I am not listing the validator hostnames, since they are inaccessible (due to the firewall) anyways.
The validators are only listening on 26656 to validators and sentries. The sentries are listening on 26656 and 26657 and also each run the cosmos REST server, listening on 1317.
We have opened our network to the public. Members of the public have setup sentries and validators of their own, and are expected to use our five sentries as their P2P peers in config.toml.
What happened:
We are occasionally getting messages as follows on the sentries.
Note that these messages only happen on our sentries. Our validators never show these messages.
Also note I have removed lines that are considered "clean" from the above output, for brevity. The same sentries showing this "bad" output will also output the "clean" output lines.
This could be related to #4922 and actually, you will see that the last two lines in the above output are the same output from #4922. I filed these separately since they seem related but still different issues. Could also be related to #4924, that I also just filed.
What you expected to happen:
We expect "clean" output, as so:
(The fact that invalidTxs is non-zero is the subject of another investigation)
Have you tried the latest version:
Not sure. I think so. Although in looking at the Tendermint Github, I see there are two minor versions available that are newer than what we have.
How to reproduce it:
Difficult to answer this. We never saw this when we were running a "small" chain with say < 20 nodes (ie: we were getting "clean" output. Once we opened to the public and allowed anyone to join in, we started to see this happen constantly.
I suppose if you setup a network like ours and used the same code, you might reproduce it. The good news is it is happening right now and readily happens.
Logs:
Listed above.
Config:
No specific changes made to Tendermint.
node command runtime flags:
This is all running from within our daemon that was built with the COSMOS SDK.
/dump_consensus_state
output for consensus bugsNot sure how to do this.
Anything else we need to know:
Most details given above.
I did some searching ahead of time to see if I could resolve this myself. I saw some issues related to it but they were closed, yet I am seeing this now.
I have also files some other issues that might be related to this one and have done my best to refer related issues to each other.
The text was updated successfully, but these errors were encountered: