Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: peer state init too late and pex message too soon #3634

Merged
merged 27 commits into from
May 27, 2019
Merged

Conversation

melekes
Copy link
Contributor

@melekes melekes commented May 7, 2019

Original issue:

fix two issue #3346 and
#3338
1、fix reconnection pex send too fast,
error is caused lastReceivedRequests is still
not deleted when a peer reconnected

2、Peer does not have a state yet. We set it in AddPeer.
We need an new interface before mconnection is started

Replaced #3361

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

unclezoro and others added 8 commits April 26, 2019 18:55
Peer does not have a state yet. We set it in AddPeer.
We need an new interface before mconnection is started
fix reconnection pex send too fast,
error is caused lastReceivedRequests is still
not deleted when a peer reconnected
Co-Authored-By: guagualvcha <baifudong@lancai.cn>
Co-Authored-By: guagualvcha <baifudong@lancai.cn>
@codecov-io
Copy link

codecov-io commented May 7, 2019

Codecov Report

Merging #3634 into develop will increase coverage by 0.02%.
The diff coverage is 85%.

@@             Coverage Diff             @@
##           develop    #3634      +/-   ##
===========================================
+ Coverage    63.42%   63.45%   +0.02%     
===========================================
  Files          217      217              
  Lines        18169    18157      -12     
===========================================
- Hits         11524    11521       -3     
+ Misses        5684     5671      -13     
- Partials       961      965       +4
Impacted Files Coverage Δ
evidence/reactor.go 64.94% <ø> (+0.66%) ⬆️
p2p/test_util.go 63.97% <0%> (ø) ⬆️
p2p/base_reactor.go 81.81% <100%> (+21.81%) ⬆️
p2p/switch.go 70.86% <100%> (+2.13%) ⬆️
p2p/pex/pex_reactor.go 83.38% <100%> (-1.19%) ⬇️
consensus/reactor.go 70.77% <71.42%> (-2.08%) ⬇️
p2p/pex/errors.go 11.11% <0%> (-11.12%) ⬇️
proxy/client.go 25% <0%> (-0.81%) ⬇️
... and 9 more

p2p/base_reactor.go Outdated Show resolved Hide resolved
p2p/pex/pex_reactor.go Outdated Show resolved Hide resolved
consensus/reactor.go Outdated Show resolved Hide resolved
@melekes
Copy link
Contributor Author

melekes commented May 7, 2019

Ok, I think the correct fix for #3338 is to remove the peer from the Switch last (after we've stopped it and removed from all reactors). This will prevent it from reconnecting to us before we've removed it as it will be rejected by the filterPeer function.

p2p/base_reactor.go Outdated Show resolved Hide resolved
p2p/switch.go Show resolved Hide resolved
@melekes melekes self-assigned this May 7, 2019
CHANGELOG_PENDING.md Outdated Show resolved Hide resolved
CHANGELOG_PENDING.md Outdated Show resolved Hide resolved
@melekes melekes mentioned this pull request May 7, 2019
36 tasks
@melekes melekes marked this pull request as ready for review May 8, 2019 11:40
@melekes melekes requested review from ebuchman and xla as code owners May 8, 2019 11:40
}

// see stopAndRemovePeer
func TestSwitchInitPeerIsNotCalledBeforeRemovePeer(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes even on develop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test passes on develop because there's no InitPeer method
if you replace InitPeer with AddPeer
it will fail on develop

reactor.AddPeer(peer)
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs the InitPeer method to pass 👍

Copy link
Contributor

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the last test TestSwitchInitPeerIsNotCalledBeforeRemovePeer (not sure it is meant to test the changes but it passes without the changes in this PR). The rest of the changes LGTM. I would like a 2nd pair of eyes by someone more familiar with the touched packages though @ebuchman @xla @ancazamfir

@liamsi
Copy link
Contributor

liamsi commented May 8, 2019

Thanks for removing the timeouts 👍 The test in question paniced now though:

click to see log
=== RUN   TestSwitchInitPeerIsNotCalledBeforeRemovePeer
I[2019-05-08|14:10:17.511] Starting P2P Switch                          switch=1 impl="P2P Switch"
I[2019-05-08|14:10:17.519] Starting Peer                                switch=1 peer=a986846cedd66c558e89c05c6cea9a18af27c141@127.0.0.1:56950 impl="Peer{MConn{127.0.0.1:56950} a986846cedd66c558e89c05c6cea9a18af27c141 in}"
I[2019-05-08|14:10:17.519] Starting MConnection                         switch=1 peer=a986846cedd66c558e89c05c6cea9a18af27c141@127.0.0.1:56950 impl=MConn{127.0.0.1:56950}
I[2019-05-08|14:10:17.519] Added peer                                   switch=1 peer="Peer{MConn{127.0.0.1:56950} a986846cedd66c558e89c05c6cea9a18af27c141 in}"
E[2019-05-08|14:10:17.573] Connection failed @ recvRoutine (reading byte) switch=1 peer=a986846cedd66c558e89c05c6cea9a18af27c141@127.0.0.1:56950 conn=MConn{127.0.0.1:56950} err=EOF
I[2019-05-08|14:10:17.573] Stopping MConnection                         switch=1 peer=a986846cedd66c558e89c05c6cea9a18af27c141@127.0.0.1:56950 impl=MConn{127.0.0.1:56950}
E[2019-05-08|14:10:17.573] Stopping peer for error                      switch=1 peer="Peer{MConn{127.0.0.1:56950} a986846cedd66c558e89c05c6cea9a18af27c141 in}" err=EOF
I[2019-05-08|14:10:17.574] Stopping Peer                                switch=1 peer=a986846cedd66c558e89c05c6cea9a18af27c141@127.0.0.1:56950 impl="Peer{MConn{127.0.0.1:56950} a986846cedd66c558e89c05c6cea9a18af27c141 in}"
D[2019-05-08|14:10:17.574] Stopping MConnection (ignoring: already stopped) switch=1 peer=a986846cedd66c558e89c05c6cea9a18af27c141@127.0.0.1:56950 impl=MConn{127.0.0.1:56950}
E[2019-05-08|14:10:22.521] Stopping peer for error                      switch=1 peer=null err=test
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0xbee68c]

goroutine 736 [running]:
github.com/tendermint/tendermint/p2p.(*MultiplexTransport).Cleanup(0xc00064c900, 0x0, 0x0)
	/go/src/github.com/tendermint/tendermint/p2p/transport.go:304 +0x4c
github.com/tendermint/tendermint/p2p.(*Switch).stopAndRemovePeer(0xc00027e6c0, 0x0, 0x0, 0xc7f500, 0xf56860)
	/go/src/github.com/tendermint/tendermint/p2p/switch.go:327 +0xac
github.com/tendermint/tendermint/p2p.(*Switch).StopPeerForError(0xc00027e6c0, 0x0, 0x0, 0xc7f500, 0xf56860)
	/go/src/github.com/tendermint/tendermint/p2p/switch.go:300 +0x1e0
created by github.com/tendermint/tendermint/p2p.TestSwitchInitPeerIsNotCalledBeforeRemovePeer
	/go/src/github.com/tendermint/tendermint/p2p/switch_test.go:659 +0x574
FAIL	github.com/tendermint/tendermint/p2p	10.069s
Exited with code 1

Looks like with the last change introduced (or rather revealed) a non-deterministic test 🤔 (as waitUntilSwitchHasAtLeastNPeers repeatedly sleeps 250ms and checks if there are the number of expected peers.

@liamsi
Copy link
Contributor

liamsi commented May 8, 2019

I see the test TestSwitchInitPeerIsNotCalledBeforeRemovePeer passes because it only ensures that reactor.InitPeer is always only called after reactor.RemovePeer by checking if it was called at all. This passes on develop because there is no InitPeer method to call.

@melekes
Copy link
Contributor Author

melekes commented May 8, 2019

I brought back timeouts, sorry. Maybe there's a way to get rid of them, need to think about it.

Copy link
Contributor

@brapse brapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By exposing InitPeer it appears that we are making the p2p.Switch responsible for lifecycle management of reactor specific peer state. Although this fixes #3346 and #3338. does it not expose us to the same class of bug? Would it possible to limit reactor specific peer lifecycle management to the current public interface (AddPeer, RemovePeer) to limit the burden on the Switch knowing what internal state the reactors need from peers?

@brapse
Copy link
Contributor

brapse commented May 27, 2019

Looking back over the previous iteration of this PR it appears that I'm a bit of a broken record. It looks like a better solution might be outside to scope of this particular issue and could be better serviced in a follow up PR. I've tried to capture my reservations in a separate issue #3679 such that at we can get the current bug fixed. 👍

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted (eg. #3679) there's still lots of work to do here, but this should be good for now.

Thanks!

@ebuchman ebuchman merged commit bcf10d5 into develop May 27, 2019
@ebuchman ebuchman deleted the 3346-p2perror branch May 27, 2019 18:40
This was referenced May 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants