New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p: persistent - redial if first dial fails #1404
Conversation
Not sure if handshake failure should be handled too but my gut feeling says this is mainly for dial failures. |
|
d5bdd2a
to
dc99bd9
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1404 +/- ##
===========================================
- Coverage 62.22% 61.51% -0.72%
===========================================
Files 114 114
Lines 11093 11004 -89
===========================================
- Hits 6903 6769 -134
- Misses 3547 3589 +42
- Partials 643 646 +3
|
dc99bd9
to
6516a3f
Compare
6516a3f
to
5ef639f
Compare
p2p/peer.go
Outdated
@@ -87,6 +87,8 @@ func newPeer(pc peerConn, nodeInfo NodeInfo, | |||
type PeerConfig struct { | |||
AuthEnc bool `mapstructure:"auth_enc"` // authenticated encryption | |||
|
|||
Dial func(addr *NetAddress, config *PeerConfig) (net.Conn, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have functions inside config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can Have badAddr
, which returns error on DialTimeout
. This will require to inroduce NetAddress interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better would be to set a flag on the remotePeer to allow it to reject the the connection in its accept routine. We should try to avoid putting methods in config structs like this, and I'm not sure we want the NetAddress to be an interface (though perhaps I could be convinced, just feels like not the right way to solve the problem of writing these tests, even if its after all a good idea).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have a the Dial
func somewhere else? It seems more convenient to have a configurable dial behavior for testing, for e.g it could be used for slow connections / timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Switch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR Javed!!
p2p/switch.go
Outdated
@@ -470,6 +476,7 @@ func (sw *Switch) addOutboundPeerWithConfig(addr *NetAddress, config *PeerConfig | |||
peerConn, err := newOutboundPeerConn(addr, config, persistent, sw.nodeKey.PrivKey) | |||
if err != nil { | |||
sw.Logger.Error("Failed to dial peer", "address", addr, "err", err) | |||
go sw.reconnectToPeer(addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only want to do this if the persistent == true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
p2p/peer.go
Outdated
@@ -87,6 +87,8 @@ func newPeer(pc peerConn, nodeInfo NodeInfo, | |||
type PeerConfig struct { | |||
AuthEnc bool `mapstructure:"auth_enc"` // authenticated encryption | |||
|
|||
Dial func(addr *NetAddress, config *PeerConfig) (net.Conn, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better would be to set a flag on the remotePeer to allow it to reject the the connection in its accept routine. We should try to avoid putting methods in config structs like this, and I'm not sure we want the NetAddress to be an interface (though perhaps I could be convinced, just feels like not the right way to solve the problem of writing these tests, even if its after all a good idea).
Yes we want any kind of connection failure on a persistent peer, dial or otherwise, to cause a reconnect attempt |
7450cfd
to
54adb79
Compare
Trying a fix using a flag instead of func. |
why is this PR still blocked? |
@kidinamoto01 this it not ideal fix, I'm not sure if this even a complete fix since it doesn't handle handshake related errors (see comments above). I'll be closing this PR if that's OK, don't want to keep it lingering. |
@tuxcanfly Thanks for the update, it's very helpful. Hope we can get this fixed soon O(∩_∩)O~ |
hey @tuxcanfly sorry for the delay here but I think you're right - we only want the reconnect on the dial, not on the handshake |
On second thought, the handshake includes a timeout, which we might want to try again on, and even if we fail to connect due to compatibility issues, the remote persistent node might get restarted with correct software and then we can connect to it. So I think unless we do more work on the typing of errors, its fine to just always run reconnectToPeer. Also note that in reconnectToPeer itself, we keep trying no matter what the error. Ill fix this up and merge, thanks! |
Nope, what you've got is fine and straight forward. For now we'll only run the reconnect if we fail to dial the persistent peers. If we lost a connection though, we'll keep retrying even if eg the handshake fails |
merged plus some minor changes: 0cbbb61 |
Awesome! 💯 |
Fixes #1401