Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(peer_db)!: more accurate peer stats per address #5142

Merged

Conversation

stringhandler
Copy link
Collaborator

@stringhandler stringhandler commented Jan 26, 2023

Description

Removes a lot of duplicate stats and online/offline fields on peers.

Motivation

Due to a recent bug, wallets would get a new address every time they restarted. There was some handling of peers with multiple addresses, but in general only the latest address signed by the peer was used. This prevents a node from being a bridge between tor and IP or even IP and DNS. This PR controversially (I'm sure) removes the signed address and instead allows a node to try an address to see if the correct public key is on the other end. If so, the peer can choose the best latency according to communicate with it.

IMO, this is no less secure, since the worst case is that a node can force another node to contact an address that is not correct, only to find out the node at the other end has a different public key, This already can be done by generating a new peer id, using the address and signing it with the new peer id (i.e a sybil address).

Previously a peer may have been banned for answering an address with the wrong public key, but this can also be manipulated. Now, only the address is marked as bad.

Also reworked the general sorting of peer addresses, although in most previous cases, there would only be one address.

Also reworked the order in which the peer tries neighbours. In order to make it more predictable and prevent looping, the connectivity service will try to find it's neighbours in a strict distance away from it. Previously it took into account a last_attempted_at or last_seen_at, which I am not convinced was working.

You'll note that there are some tokio improvements as well.

During the testing, I noticed that the peer query was, and still is, extremely heavy. In tokio console, queries took in the order of 10s to 100s. Moving this to a block_in_place frees up the tokio worker threads. In a future PR I will improve this even further by removing unnecessary calls.

Also:

  • Removed some dead code OptionalBoundedExecutor and replaced the re-export of runtime from tokio
  • Used standard tokio spawns for BoundedExecutor instead of holding a runtime reference.

BREAKING CHANGE: Note that because the Peer struct has changed so dramatically, a migration is not really pragmatic. The peer_db for wallets and base nodes must be deleted on before starting up

@stringhandler stringhandler changed the title WIP - better stats on peer addresses feat!: more accurate peer stats per address Jan 27, 2023
@stringhandler stringhandler marked this pull request as ready for review January 27, 2023 14:37
@stringhandler stringhandler added the P-controversial Process - This PR or Issue is controversial and/or requires more attention that simpler issues label Jan 27, 2023
Copy link
Collaborator

@SWvheerden SWvheerden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, brings much-needed address functionality.
I tried to mark all commented out code there is.

I do however thing we should not remove address singing. I think each address needs to be signed. The reason for this is, it will allow a peer to poison the address of some other peer by spamming unsigned addresses to the network obfuscating the peer's true address and make it harder to connect to it.
If we force peers to sign each address, then peers can only poison their own addresses and not those of other peers.

@@ -48,7 +48,7 @@ message Peer{
/// Features supported by the peer
uint64 features = 9;
/// Connection statics for the peer
google.protobuf.Timestamp last_connected_at = 10; /// Protocols supported by the peer. This should not be considered a definitive list of supported protocols and is
// google.protobuf.Timestamp last_connected_at = 10; /// Protocols supported by the peer. This should not be considered a definitive list of supported protocols and is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Comment on lines 109 to 111
// peer.connection_stats.set_connection_success();
peer.addresses.update_addresses(addresses);
// peer.set_offline(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments?
Or put these lines back?

address,
dial_state.peer().node_id.short_str()
);
for address in addresses {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it not be possible to make this a parallel loop to make it execute faster? This could get quite slow if the peer has multiple old addresses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps with something like: rayon::par_iter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did consider that. In this case tokio would be better than rayon, because it would wait on the network dial. Rayon would be a better choice if it were CPU bound. The dials already, for a few cases, are parallelized during a call to each peer, but you are correct in saying that if a peer had many addresses that it would take a while.

Some things to note though:

  1. The addresses for a peer are always sorted by their quality score, so the best addresses will be at the front.
  2. Addresses that have failed previously are moved to the back of the list, so will be tried after all others.

So from these, it makes sense to keep it trying addresses in synch and to exit on the first dialled address

@@ -472,17 +480,20 @@ impl ConnectivityManagerActor {
fn mark_peer_succeeded(&mut self, node_id: NodeId) {
let entry = self.get_connection_stat_mut(node_id);
entry.set_connection_success();
// self.peer_manager.mark_last_seen(node_id.clone(), Some(addr_used));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

}

fn mark_peer_failed(&mut self, node_id: NodeId) -> usize {
let entry = self.get_connection_stat_mut(node_id);
entry.set_connection_failed();
// self.peer_manager.mark_peer_failed(node_id, addr_tried);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

return false;
}

let is_excluded = excluded.contains(&peer.node_id);
if is_excluded {
excluded_count += 1;
// excluded_count += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

.unwrap_or(false)
{
connect_ineligable_count += 1;
// connect_ineligable_count += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

return false;
}

if connected.contains(&&peer.node_id) {
already_connected += 1;
// already_connected += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Comment on lines 83 to 110
// let can_update = can_update && {
// // Update/insert peer if newer
// // unreachable panic: can_update is true only is identity_signature is present and valid
// // let new_dt = new_peer
// // .identity_signature
// // .as_ref()
// // .map(|i| i.updated_at())
// // .expect("unreachable panic");
//
// // Update if new_peer has newer timestamp than current_peer, and if the newer timestamp is after
// the // added date
// current_peer
// .identity_signature
// .as_ref()
// .map(|i| i.updated_at() < new_dt && (
// !current_peer.is_seed() ||
// current_peer.added_at < new_dt.naive_utc()))
// // If None, update to peer with valid signature
// .unwrap_or(true)
// };

// if !can_update {
// debug!(
// target: LOG_TARGET,
// "Peer `{}` already exists or is up to date and will not be updated", new_peer.node_id
// );
// return Ok(false);
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

// current_peer
// .update_addresses(new_peer.addresses.into_vec())
// .set_features(new_peer.features);
// if let Some(sig) = new_peer.identity_signature {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should remove this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the notes in the motivation about this

Copy link
Member

@sdbondi sdbondi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goals of this PR are to:

  1. improve currently incomplete support for multiple peer addresses (perhaps to allow support later for WebRTC addresses)
    • This has been greatly improved in the PR. Although more work needs to happen, I like the addition of an address score and the improvements in the collection of address stats.
  2. improve the performance and selection strategy for maintaining connections to neighbouring peers
    • As far as I can see, there are no improvements made in this area but are TODO for another PR.
    • IMO one way to improve this is by maintaining a list of neighbouring and available peers in the peer manager DB. This list can be used and maintained by the DhtConnectivity component.
  3. Improve the propagation of updated address information across the network
    • This may have been made worse by removing the ability to validate peer sync information.
    • Was there a confirmed bug that caused wallet addresses not to eventually update over the network over time?

Other non-trivial changes:

  1. Removal of PeerIdentitySignature

Previously a peer may have been banned for answering an address with the wrong public key, but this can also be manipulated.

Is this not only true because the peer no longer signs its addresses?

since the worst case is that a node can force another node to contact an address that is not correct

This removes the invariant that each assigned address was in fact provided by the peer, allowing another node to provide arbitrary addresses for a peer on their behalf.

As you said, a peer may provide other self-generated identities. I'd argue this is a distinct case from allowing a peer to modify another peer's address set, even if that is only to add addresses to the set.

Perhaps the worst case is wasting network resources (memory, time, disk, bandwidth), perhaps there are other attacks/abuses not thought of (is a coordinated DDoS possible?).

Could you provide more details on the specific reasons for removing this invariant?

I took a look into what libp2p does
https://github.com/libp2p/rust-libp2p/blob/d344406235c779922e6ffec85f0a94181f3efecc/core/src/peer_record.rs#L62

, but in general only the latest address signed by the peer was used

Minor correction, the address set is signed by the peer not individual addresses. The handling of multiple addresses was primitive and even incorrect, this PR greatly improves this. This improvement is not precluded by the removal of signed address sets.

comms/core/src/connection_manager/dialer.rs Outdated Show resolved Hide resolved
@@ -363,6 +364,10 @@ impl DhtConnectivity {
}

async fn refresh_neighbour_pool(&mut self) -> Result<(), DhtConnectivityError> {
if self.is_refreshing_neighbour_pool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_refreshing_neighbour_pool will always be false here. There is no concurrency in DhtConnectivity (it runs on a single thread at a time) so this function has to exit before it can be called again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will remove this. It's easy to see in tokio::console whether this is in fact the case, but I believe you are correct and will confirm

.sort_by(PeerQuerySortBy::DistanceFromLastConnected(node_id))
// Fetch double here so that there is a bigger closest peer set that can be ordered by last seen
.limit(n * 2);
// TODO: Check if this logic is correct. At the moment I want them in the same order every time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for the purposes of evaluating performance. The previous method would include a larger number of peers so that there is a better chance of selecting a peer that was recently available. This call was notably heavy and not an adequate or good strategy, a symptom of not having a good way to maintain an optimised/cached list of highly available neighbours in the peer manager (or some similar solution e.g. k-buckets).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem with the previous approach was that it could end up in a stale loop, if the peers came online again before new peers were tried. This approach is more predictable and ensures that all the peers are tried. It may be a symptom of the current network state, but the majority of these nodes are actually offline.

Given that the offline data is more accurate now, it may be much of a muchness. I think I would however lobby for selecting the same n (instead of n*2) peers from the DB and returning them directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n == 8 by default.
The goal was to fetch peers that we recently connected to, ordered by distance. When a peer comes back online, it will send a join message, which will be propagated to neighbours, and/or directly connect. This will update the last_seen which will put them back on the top of the list for dialing if ordered by last seen.

This change will select the same 8 peers (unless a nearer one joins) which may never come online again. Currently "culling" of offline peers is a slow process, I think we have to attempt and fail to connect to them at least once and then we have to have not connected to them for 24 hours before we remove them from the peer list entirely.

I think a good solution would be to maintain a list that can be loaded that has available peers. Essentially just optimising this peer query from an LMDB perspective.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, as soon as an address fails it is immediately marked offline. If all addresses are offline, the peer is offline

// .set_features(new_peer.features);
// if let Some(sig) = new_peer.identity_signature {
// current_peer.set_valid_identity_signature(sig);
// }
self.peer_manager.add_peer(current_peer).await?;
Copy link
Member

@sdbondi sdbondi Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that add_peer is changed to only append new addresses to mitigate the fact that addresses are no longer signed. It also merges stats for addresses however that does not apply here because the peer sync message cannot and should not contain this data.

Should a remote peer F be allowed to add any amount (1000s, 10_000, ?) of addresses for any peer? If not, perhaps we should make the maximum e.g. 10. If the remote peer F provides 10 new addresses for peer A which already has 5 in our peer manager, we have no choice but to discard 5 of the addresses (at random?) provided by peer F. If peer A already has 10, we cannot add any addresses as all of them could be false.

Comparison of happy path case:

Setup: Peer A has an updated address set, peer B is made aware of the change via an existing authenticated mechanism (DHT join, DHT discovery, direct connection). There are no malicious peers.

This PR:
Peer X later connects to peer B and appends new addresses given by peer B for peer A. Peer X dials Peer A using out-of-date addresses that have a higher score first, but eventually uses one of the updated addresses which have no score after some time.

Previous with signed peer record:
Peer X later connects to peer B and sets the new addresses given by peer B for peer A. Peer X can be sure that Peer B is giving the correct address set for peer A and therefore can use that address set (discarding addresses that it knows that peer A is not using). Peer B dials Peer A using the correct address set without having to first discover the previous addresses are out of date.

@stringhandler
Copy link
Collaborator Author

I see there is a lot of contention about removing the address signature. Probably not unfounded. I will look into adding the signed address back, but the structure will need to change. Currently, it is difficult to reassemble the challenge when multiple addresses are signed, so I will change the identity message to sign each address in a way that makes it easy to re-verify and store with each address and see how that looks

@hansieodendaal
Copy link
Contributor

This PR touches many things and is a bit big to comprehend what the actual system-wide effects would be without proper stress testing.

I agree that it is an improvement to mark an address as bad and not ban the peer straight away, although I think that eventually, a peer should be medium-term banned if none of its addresses is working, however, I do not agree that addresses should not be signed anymore.

Maybe we can implement just one thing from this PR at a time and evaluate it, leaving the changes with the biggest system impact for last?

@stringhandler stringhandler changed the title feat!: more accurate peer stats per address feat(peer_db)!: more accurate peer stats per address Feb 22, 2023
Copy link
Collaborator Author

@stringhandler stringhandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self reviewed and removed all commented code

@stringhandler stringhandler merged commit fdad1c6 into tari-project:development Mar 3, 2023
@stringhandler stringhandler deleted the st-comms-addresses-data branch March 3, 2023 13:59
stringhandler pushed a commit that referenced this pull request Mar 8, 2023
### ⚠ BREAKING CHANGES

* **peer_db:** more accurate peer stats per address (5142)
* use consensus hashing API for validator node MMR (5207)
* **consensus:** add balanced binary merkle tree (5189)

### Features

* add favourite flag to contact ([5217](#5217)) ([0371b60](0371b60))
* add indexer config ([5210](#5210)) ([cf95601](cf95601))
* add merge proof for balanced binary merkle tree ([5193](#5193)) ([8962909](8962909))
* **consensus:** add balanced binary merkle tree ([5189](#5189)) ([8d34e8a](8d34e8a))
* log to base dir ([#5197](#5197)) ([5147b5c](5147b5c))
* **peer_db:** more accurate peer stats per address ([5142](#5142)) ([fdad1c6](fdad1c6))


### Bug Fixes

* add grpc commitment signature proto type ([5200](#5200)) ([d523f1e](d523f1e))
* peer seeds for esme/igor ([5202](#5202)) ([1bc226c](1bc226c))
* remove panics from merged BBMT verification ([5221](#5221)) ([a4c5fce](a4c5fce))
* source coverage ci failure ([5209](#5209)) ([80294a1](80294a1))
* use consensus hashing API for validator node MMR ([5207](#5207)) ([de28115](de28115))
* wallet reuse existing tor address ([5092](#5092)) ([576f44e](576f44e))
* **wallet:** avoids empty addresses in node identity ([5224](#5224)) ([1a66312](1a66312))
@hnidoaht-101
Copy link
Contributor

hi team, I know this PR was already merged. Just curious how @stringhandler used tokio-console to troubleshoot the long-running queries. Is it possible to do a demo for it?

@stringhandler
Copy link
Collaborator Author

Sure, will try put something together

@ghpbot-tari-project ghpbot-tari-project added the P-acks_required Process - Requires more ACKs or utACKs label Mar 28, 2023
@hnidoaht-101
Copy link
Contributor

Sure, will try put something together

Thanks a lot. 💯 I'll be waiting for it.

@hnidoaht-101
Copy link
Contributor

hnidoaht-101 commented Mar 31, 2023

Sure, will try put something together

@stringhandler Can you just tag my name once you have something? Thank you.

@stringhandler
Copy link
Collaborator Author

will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-acks_required Process - Requires more ACKs or utACKs P-controversial Process - This PR or Issue is controversial and/or requires more attention that simpler issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants