Skip to content

Conversation

@pompon0
Copy link
Contributor

@pompon0 pompon0 commented Nov 7, 2025

Rewrite of the PeerManager logic to make it simple, predictable and race-condition resistant. There are further improvements to be done, but need to keep compatibility with the current p2p implementation. With this pr:

  • the scoring system will be scraped entirely, in favor of separately managing persistent/unconditional/blocksync peers (trusted), and peers learned from the network (not trusted)
  • peerstore will change its purpose: it used to be used to just store known peer addresses. Now it will be used to store addresses we successfully connected to. The only goal of peerstore now will be to make a node able to reestablish its recent connections on restart
  • node will only broadcast addresses of nodes it has successfully connected to (so that it does not broadcast unverified data any more)
  • the number of addresses per peer that peermanager keeps will be bounded (it is unbounded currently)
  • PeerManager will be an implementation detail of the Router.

@pompon0 pompon0 marked this pull request as draft November 7, 2025 19:03
@pompon0 pompon0 changed the title Gprusak disconnects PeerManager rewrite Nov 12, 2025
@pompon0 pompon0 marked this pull request as ready for review November 12, 2025 11:42
@github-actions
Copy link

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedNov 13, 2025, 3:15 PM

@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 63.34340% with 364 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.20%. Comparing base (5ae1ad8) to head (4d788b9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-tendermint/libs/utils/testonly.go 0.00% 43 Missing ⚠️
sei-tendermint/internal/p2p/peerdb.go 59.61% 23 Missing and 19 partials ⚠️
sei-tendermint/internal/p2p/router.go 68.21% 32 Missing and 9 partials ⚠️
sei-tendermint/internal/p2p/testonly.go 44.92% 36 Missing and 2 partials ⚠️
sei-tendermint/internal/p2p/peermanager.go 86.48% 33 Missing and 2 partials ⚠️
sei-tendermint/libs/utils/im/im.go 0.00% 33 Missing ⚠️
sei-tendermint/libs/utils/tcp/tcp.go 0.00% 21 Missing ⚠️
sei-tendermint/internal/p2p/transport.go 58.69% 14 Missing and 5 partials ⚠️
sei-tendermint/internal/consensus/reactor.go 34.61% 4 Missing and 13 partials ⚠️
sei-tendermint/node/setup.go 58.53% 14 Missing and 3 partials ⚠️
... and 16 more

❌ Your patch status has failed because the patch coverage (63.34%) is below the target coverage (70.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (48.39%) is below the target coverage (50.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2539      +/-   ##
==========================================
- Coverage   43.34%   43.20%   -0.14%     
==========================================
  Files        1576     1579       +3     
  Lines      137842   137345     -497     
==========================================
- Hits        59744    59343     -401     
+ Misses      72678    72624      -54     
+ Partials     5420     5378      -42     
Flag Coverage Δ
sei-chain 31.10% <ø> (ø)
sei-cosmos 52.55% <ø> (ø)
sei-db 47.60% <ø> (ø)
sei-tendermint 48.38% <63.34%> (-0.47%) ⬇️
sei-wasmd 46.27% <ø> (ø)
sei-wasmvm 40.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/mempool/mempool.go 67.84% <100.00%> (ø)
sei-tendermint/internal/mempool/types.go 0.00% <ø> (ø)
sei-tendermint/internal/p2p/address.go 90.74% <100.00%> (+0.74%) ⬆️
sei-tendermint/internal/p2p/channel.go 78.37% <100.00%> (-2.58%) ⬇️
sei-tendermint/internal/p2p/metrics.gen.go 15.71% <ø> (+0.12%) ⬆️
sei-tendermint/internal/p2p/metrics.go 100.00% <ø> (ø)
sei-tendermint/internal/rpc/core/env.go 17.88% <ø> (ø)
sei-tendermint/internal/blocksync/pool.go 81.30% <90.00%> (+2.61%) ⬆️
sei-tendermint/internal/consensus/ticker.go 97.29% <92.30%> (-2.71%) ⬇️
sei-tendermint/internal/evidence/reactor.go 71.30% <66.66%> (+1.81%) ⬆️
... and 23 more

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -540,26 +522,6 @@

Check warning

Code scanning / CodeQL

Calling the system time Warning

Calling the system time may be a possible source of non-determinism
Comment on lines +131 to +133
for _, r := range byNodeID {
byLastConnected.ReplaceOrInsert(r)
}

Check warning

Code scanning / CodeQL

Iteration over map Warning

Iteration over map may be a possible source of non-determinism
if err := m.store.Set(peer); err != nil {
return err
// Record the failure time.
now := time.Now()

Check warning

Code scanning / CodeQL

Calling the system time Warning

Calling the system time may be a possible source of non-determinism
Comment on lines +306 to +315
for id := range s.last {
if _, ok := conns.Get(id); !ok {
delete(s.last, id)
update = PeerUpdate{
NodeID: id,
Status: PeerStatusDown,
}
return true
}
if _, ok := m.dynamicPrivatePeers[nodeAddr.NodeID]; ok {
continue
}

Check warning

Code scanning / CodeQL

Iteration over map Warning

Iteration over map may be a possible source of non-determinism
Comment on lines +506 to +508
for id := range inner.persistentAddrs {
ids = append(ids, id)
}

Check warning

Code scanning / CodeQL

Iteration over map Warning

Iteration over map may be a possible source of non-determinism
Comment on lines +524 to +526
for addr := range pa.addrs {
addrs = append(addrs, addr)
}

Check warning

Code scanning / CodeQL

Iteration over map Warning

Iteration over map may be a possible source of non-determinism
for {
for db := range r.peerDB.Lock() {
// Mark connections as still available.
now := time.Now()

Check warning

Code scanning / CodeQL

Calling the system time Warning

Calling the system time may be a possible source of non-determinism
if !ok || m == 0 {
return 0
}
return float64(r.peerManager.Conns().Len()) / float64(m)

Check notice

Code scanning / CodeQL

Floating point arithmetic Note

Floating point arithmetic operations are not associative and a possible source of non-determinism
BlockSyncPeers string `mapstructure:"blocksync-peers"`

// UPNP port forwarding
// UPNP port forwarding. UNUSED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is UNUSED, do we want to do something about it (eg. remove it) or will that affect existing network compatibility? if we want to do so down the road, lets add a TODO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just removing it will break config parsing afaiu. I don't have any long term deprecation strategy here. Just wanted to document the fact for clarity.

// * allow inbound conn to override outbound iff peerID > selfID.
// This resolves the situation when peers try to connect to each other
// at the same time.
oldDir := old.Info().DialAddr.IsPresent()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be oldAddr or something instead of oldDir? (same for newDir)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dir is an abbreviation of "direction" (see comment above). I'll expand the names to old/newDirection to make it more clear.

// Add new peerAddrs if missing.
if !ok {
// Prune some peer if maxPeers limit has been reached.
if len(i.addrs) == i.options.maxPeers() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we intend this to behave if we have the len(i.addrs) == maxPeers (eg. 128) but none of the peers have failed. In this case, we would just drop this peer, just want to make sure this is the intended behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we just drop it. The rough idea here is that we want to dial at least once every address we accept. This doesn't give us any nice network properties, it just makes the semantics predictable (bounds the churn). I would like to move to some more useful algorithm soon (once I have time to implement it).

func NewAtomicSend[T any](value T) (w AtomicSend[T]) {
w.ptr.Store(newVersion(value))
// nolint:nakedret
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want the naked return and the nolint?

Copy link
Contributor Author

@pompon0 pompon0 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicSend is (intentionally) a no-copy object, because it embeds atomic.Pointer. Naked return is the canonical way of initializing no-copy objects (return w would be already a copy of w).

// OnStop implements service.Service.
func (r *Reactor) OnStop() {}

func (r *Reactor) sendRoutine(ctx context.Context) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any concerns on whether this will cause the network adapter to be utilized/blocked as we broadcast to a batch of peers (eg. 100). Is it possible here that every 10 seconds, we cause a slowdown to other network communications (eg. consensus voting - maybe this specific message type isnt as problematic bc priority) or should it be fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a couple of aspects to that:

  • sendRoutine just puts msgs into peers' outbound queues, which is non-blocking and doesn't do any IO.
  • pex msgs have lower priority than consensus messages, so they will be deprioritized anyway
  • pex requests are very small so it actually doesn't matter at all
  • pex responses are desynchronized due to different roundtrip latency to different peers.
  • eventually I would like to get rid of pex requests altogether (make peers proactively push pex responses) and make pex response sizes negligible (limit the size, perhaps sending just diffs).

Copy link
Collaborator

@udpatil udpatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just had a few comments to improve my understanding further.

@pompon0 pompon0 enabled auto-merge (squash) November 13, 2025 15:15
@pompon0 pompon0 merged commit 0860fba into main Nov 13, 2025
44 of 49 checks passed
@pompon0 pompon0 deleted the gprusak-disconnects branch November 13, 2025 16:10
yzang2019 added a commit that referenced this pull request Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants