Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: persistent - redial if first dial fails #1404

Merged
merged 3 commits into from Apr 28, 2018

Conversation

tuxcanfly
Copy link
Contributor

@tuxcanfly tuxcanfly commented Apr 2, 2018

Fixes #1401

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG.md

@tuxcanfly
Copy link
Contributor Author

Not sure if handshake failure should be handled too but my gut feeling says this is mainly for dial failures.

@tuxcanfly
Copy link
Contributor Author

tuxcanfly commented Apr 2, 2018

  • The tests are racy, a proper fix would probably require injecting the dialer to the functions.

@codecov-io
Copy link

codecov-io commented Apr 2, 2018

Codecov Report

Merging #1404 into develop will decrease coverage by 0.71%.
The diff coverage is 61.11%.

@@             Coverage Diff             @@
##           develop    #1404      +/-   ##
===========================================
- Coverage    62.22%   61.51%   -0.72%     
===========================================
  Files          114      114              
  Lines        11093    11004      -89     
===========================================
- Hits          6903     6769     -134     
- Misses        3547     3589      +42     
- Partials       643      646       +3
Impacted Files Coverage Δ
p2p/peer.go 65.03% <100%> (+0.74%) ⬆️
p2p/switch.go 51.18% <56.25%> (-2%) ⬇️
types/event_bus.go 0% <0%> (-36.24%) ⬇️
types/events.go 11.11% <0%> (-16.67%) ⬇️
p2p/pex/addrbook.go 63.82% <0%> (-5.27%) ⬇️
p2p/netaddress.go 69.23% <0%> (-2.47%) ⬇️
consensus/reactor.go 71.74% <0%> (-2.45%) ⬇️
types/part_set.go 63.23% <0%> (-1.16%) ⬇️
state/txindex/kv/kv.go 73.76% <0%> (-1.02%) ⬇️
... and 18 more

@melekes melekes changed the base branch from master to develop April 2, 2018 08:30
p2p/peer.go Outdated
@@ -87,6 +87,8 @@ func newPeer(pc peerConn, nodeInfo NodeInfo,
type PeerConfig struct {
AuthEnc bool `mapstructure:"auth_enc"` // authenticated encryption

Dial func(addr *NetAddress, config *PeerConfig) (net.Conn, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have functions inside config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can Have badAddr, which returns error on DialTimeout. This will require to inroduce NetAddress interface.

Copy link
Contributor

@ebuchman ebuchman Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better would be to set a flag on the remotePeer to allow it to reject the the connection in its accept routine. We should try to avoid putting methods in config structs like this, and I'm not sure we want the NetAddress to be an interface (though perhaps I could be convinced, just feels like not the right way to solve the problem of writing these tests, even if its after all a good idea).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a the Dial func somewhere else? It seems more convenient to have a configurable dial behavior for testing, for e.g it could be used for slow connections / timeouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Switch?

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Javed!!

p2p/switch.go Outdated
@@ -470,6 +476,7 @@ func (sw *Switch) addOutboundPeerWithConfig(addr *NetAddress, config *PeerConfig
peerConn, err := newOutboundPeerConn(addr, config, persistent, sw.nodeKey.PrivKey)
if err != nil {
sw.Logger.Error("Failed to dial peer", "address", addr, "err", err)
go sw.reconnectToPeer(addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only want to do this if the persistent == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

p2p/peer.go Outdated
@@ -87,6 +87,8 @@ func newPeer(pc peerConn, nodeInfo NodeInfo,
type PeerConfig struct {
AuthEnc bool `mapstructure:"auth_enc"` // authenticated encryption

Dial func(addr *NetAddress, config *PeerConfig) (net.Conn, error)
Copy link
Contributor

@ebuchman ebuchman Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better would be to set a flag on the remotePeer to allow it to reject the the connection in its accept routine. We should try to avoid putting methods in config structs like this, and I'm not sure we want the NetAddress to be an interface (though perhaps I could be convinced, just feels like not the right way to solve the problem of writing these tests, even if its after all a good idea).

@ebuchman
Copy link
Contributor

ebuchman commented Apr 3, 2018

Not sure if handshake failure should be handled too but my gut feeling says this is mainly for dial failures.

Yes we want any kind of connection failure on a persistent peer, dial or otherwise, to cause a reconnect attempt

@tuxcanfly
Copy link
Contributor Author

@ebuchman I was wondering about what type of error would cause a handshake failure and would we want to have the same exponential backoff tracking as general failures (such as dial).

Talked to @jaekwon and he seemed to agree. I still need to read about the handshake process so I might be wrong.

@tuxcanfly
Copy link
Contributor Author

Trying a fix using a flag instead of func.

@kidinamoto01
Copy link
Contributor

why is this PR still blocked?

@tuxcanfly
Copy link
Contributor Author

@kidinamoto01 this it not ideal fix, I'm not sure if this even a complete fix since it doesn't handle handshake related errors (see comments above). I'll be closing this PR if that's OK, don't want to keep it lingering.

@kidinamoto01
Copy link
Contributor

@tuxcanfly Thanks for the update, it's very helpful. Hope we can get this fixed soon O(∩_∩)O~

@ebuchman
Copy link
Contributor

hey @tuxcanfly sorry for the delay here but I think you're right - we only want the reconnect on the dial, not on the handshake

@ebuchman
Copy link
Contributor

On second thought, the handshake includes a timeout, which we might want to try again on, and even if we fail to connect due to compatibility issues, the remote persistent node might get restarted with correct software and then we can connect to it.

So I think unless we do more work on the typing of errors, its fine to just always run reconnectToPeer. Also note that in reconnectToPeer itself, we keep trying no matter what the error. Ill fix this up and merge, thanks!

@ebuchman
Copy link
Contributor

Nope, what you've got is fine and straight forward. For now we'll only run the reconnect if we fail to dial the persistent peers. If we lost a connection though, we'll keep retrying even if eg the handshake fails

@ebuchman ebuchman merged commit 5d8767e into tendermint:develop Apr 28, 2018
@ebuchman
Copy link
Contributor

merged plus some minor changes: 0cbbb61

@tuxcanfly
Copy link
Contributor Author

Awesome! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants