bug: discv5 not returning enough peers #2810

chaitanyaprem · 2024-06-14T07:31:23Z

Problem

We had noticed in status fleets that discv5 is not returning enough number of peers. (20 peers in 15 minutes eventhough bootnodes show 40-60 connected peers themselves)
We had also noticed that many peers that are returned are having rs and rsv value set to 0 which indicates they are not part of status cluster or TWN.
It has been noticed that same peers are being returned

Impact

This doesn't give enough chance in status desktop to form a healthy mesh as most of the instances are only connected to fleet nodes. Peer-count never goes beyond 15-20 which i am thinking are all fleet peers.

More discussion regarding this https://discord.com/channels/1110799176264056863/1239858809359306762/1250836260109422652

@richard-ramos wrote a discv5 test tool that constantly queries fleet nodes and prints number of unique peers reported and their ENR along with rs and rsv value.

To reproduce

Run the tool https://github.com/waku-org/test-discv5

Clone the repo
Build docker image with docker build -t discv5:latest .
Run the tool by passing fleet nodes as discv5 query nodes with below command

docker run --rm discv5:latest --bootnodes="enr:-QEKuECA0zhRJej2eaOoOPddNcYr7-5NdRwuoLCe2EE4wfEYkAZhFotg6Kkr8K15pMAGyUyt0smHkZCjLeld0BUzogNtAYJpZIJ2NIJpcISnYxMvim11bHRpYWRkcnO4WgAqNiVib290LTAxLmRvLWFtczMuc2hhcmRzLnRlc3Quc3RhdHVzLmltBnZfACw2JWJvb3QtMDEuZG8tYW1zMy5zaGFyZHMudGVzdC5zdGF0dXMuaW0GAbveA4Jyc40AEAUAAQAgAEAAgAEAiXNlY3AyNTZrMaEC3rRtFQSgc24uWewzXaxTY8hDAHB8sgnxr9k8Rjb5GeSDdGNwgnZfg3VkcIIjKIV3YWt1Mg0,enr:-QEcuEAX6Qk-vVAoJLxR4A_4UVogGhvQrqKW4DFKlf8MA1PmCjgowL-LBtSC9BLjXbb8gf42FdDHGtSjEvvWKD10erxqAYJpZIJ2NIJpcIQI2hdMim11bHRpYWRkcnO4bAAzNi5ib290LTAxLmFjLWNuLWhvbmdrb25nLWMuc2hhcmRzLnRlc3Quc3RhdHVzLmltBnZfADU2LmJvb3QtMDEuYWMtY24taG9uZ2tvbmctYy5zaGFyZHMudGVzdC5zdGF0dXMuaW0GAbveA4Jyc40AEAUAAQAgAEAAgAEAiXNlY3AyNTZrMaEDP7CbRk-YKJwOFFM4Z9ney0GPc7WPJaCwGkpNRyla7mCDdGNwgnZfg3VkcIIjKIV3YWt1Mg0,enr:-QEcuEAgXDqrYd_TrpUWtn3zmxZ9XPm7O3GS6lV7aMJJOTsbOAAeQwSd_eoHcCXqVzTUtwTyB4855qtbd8DARnExyqHPAYJpZIJ2NIJpcIQihw1Xim11bHRpYWRkcnO4bAAzNi5ib290LTAxLmdjLXVzLWNlbnRyYWwxLWEuc2hhcmRzLnRlc3Quc3RhdHVzLmltBnZfADU2LmJvb3QtMDEuZ2MtdXMtY2VudHJhbDEtYS5zaGFyZHMudGVzdC5zdGF0dXMuaW0GAbveA4Jyc40AEAUAAQAgAEAAgAEAiXNlY3AyNTZrMaECxjqgDQ0WyRSOilYU32DA5k_XNlDis3m1VdXkK9xM6kODdGNwgnZfg3VkcIIjKIV3YWt1Mg0" > output.txt

If we notice logs of output we can see a lot of peers returned with below info

2 - NEW - enr:-OC4QLnoc6lfObAO4S45yaN-sbGGjPzLw5ujuTEMPdac-yY3Hf9PfNwgBof7GJSjdjHhe86O3Q8X8B5iRhB6sAZnqteGAY1auGaAgmlkgnY0gmlwhJ1agQ2KbXVsdGlhZGRyc7QAMgQv8so7BnZfpQMnACUIAhIhAgwDLi50TMXqxWTCnR_rjQP3Eeznjs645ZNBoRh5B3jhiXNlY3AyNTZrMaEC00rTXzQe5ukmXNABacx7gtZsEJjcMPWkgiaKtlxcIHGDdGNwgupgg3VkcIJu44R1ZHA2giMohXdha3UyAQ
peerID 16Uiu2HAm9eUBRE6NKCPrsVe41bbyQ4esomF6fW5KAE3J2yGwghZz
multiaddr: [/ip4/157.90.129.13/tcp/60000/p2p/16Uiu2HAm9eUBRE6NKCPrsVe41bbyQ4esomF6fW5KAE3J2yGwghZz /ip4/47.242.202.59/tcp/30303/p2p/16Uiu2HAkvEZgh3KLwhLwXg95e5ojM8XykJ4Kxi2T7hk22rnA7pJC/p2p-circuit/p2p/16Uiu2HAm9eUBRE6NKCPrsVe41bbyQ4esomF6fW5KAE3J2yGwghZz]
ip 157.90.129.13:60000
rs: field contains no value
rsv: field contains no value

Expected behavior

Peers stored should not be of different cluster. Also it is taking a lot of time for peers to be returned.
I ran the above tool for 15 minutes i just had 20 peers.

Whereas when i ran peerExchangeClient against same fleet nodes, i see close to 80 unique peers returned in 2-3 minutes.

nwaku version/commit hash

v0.28.0-2-ga96a6b94

Additional things to verify/confirm in nwaku

Verify if waku discv5 protocol ID is being used to filter out peers that are not of type wakuNode as per https://rfc.vac.dev/waku/standards/core/33/discv5/
In case metadata protocol disconnects a peer due to mismatch of cluster and shard, is it removed from discv5 cache?
How long after a peer is deemed not reachable is it removed from discv5 cache? (Low priority)
Why are peers being not cycled in discv5 rather same peers are being returned?

The text was updated successfully, but these errors were encountered:

Ivansete-status · 2024-06-14T19:03:49Z

Thanks for submitting this @chaitanyaprem !
Is there a nwaku version where it behaved as expected?

Answering some points:

Verify if waku discv5 protocol ID is being used to filter out peers that are not of type wakuNode as per https://rfc.vac.dev/waku/standards/core/33/discv5/

Taking the following setup as a reference:

Status-Desktop - 2.28.1 ==> status-go - 0.177.0 ==> go-waku commit dd81e1d469716328f05c05f0526de2adbcd9a4ea

https://github.com/waku-org/go-waku/blob/dd81e1d469716328f05c05f0526de2adbcd9a4ea/waku/v2/discv5/discover.go#L63C5-L63C15

On the other hand, nwaku uses the following:
https://github.com/status-im/nim-eth/blob/d66a29db7ca4372dba116928f979e92cb7f7661f/eth/p2p/discoveryv5/encoding.nim#L38

...which can lead into a wrong result in case of mismatch: https://github.com/status-im/nim-eth/blob/d66a29db7ca4372dba116928f979e92cb7f7661f/eth/p2p/discoveryv5/encoding.nim#L350

So yes, it seems that might cause an issue

In case metadata protocol disconnects a peer due to mismatch of cluster and shard, is it removed from discv5 cache?

<How long after a peer is deemed not reachable is it removed from discv5 cache? (Low priority)

We don't handle any cache in discv5, AFAIK.

( cc @gabrielmer @richard-ramos )

richard-ramos · 2024-06-14T19:10:46Z

nwaku in different nim.cfg files overrides that value to d5waku via compilation flag: -d:discv5_protocol_id=d5waku.
https://github.com/waku-org/nwaku/blob/master/apps/wakunode2/nim.cfg#L2

chaitanyaprem · 2024-06-17T06:12:48Z

nwaku in different nim.cfg files overrides that value to d5waku via compilation flag: -d:discv5_protocol_id=d5waku. https://github.com/waku-org/nwaku/blob/master/apps/wakunode2/nim.cfg#L2

So, this means atleast the discv5 protocol-id is matched in both go and nim implementations of waku.
Minor suggestion, why is this a compilation flag, shouldn't this be default value used in waku?

gabrielmer · 2024-06-17T08:23:36Z

Looking at https://github.com/status-im/nim-eth/blob/d66a29db7ca4372dba116928f979e92cb7f7661f/eth/p2p/discoveryv5/encoding.nim#L34-L35 and the whole file in general, it does seem that nim-eth designed overriding the value by use of a compilation flag.

I'm not finding another place where it allows to set a value by passing a parameter. Reading the file and how all the logic is built by using discv5_protocol_id as a global const, I'm not sure there's another way to do it from Waku side.

fryorcraken · 2024-06-21T06:02:02Z

From using https://github.com/waku-org/test-discv5 I can see that I found a lot of TWN nodes, or nodes with no rs or rsv but they do have a waku2 field set to relay.
Another example:

▶ ./waku-enr-decoder --enr=enr:-Kq4QHQhNnT5sQ11JquH6Tk3hYaOZhSXJtTw2S_C9arAOOh1aloWok6O6AhMde1_nFVWX3XyCQ-bMcrlZg6XNVe_WIABgmlkgnY0gmlwhDEMSqOKbXVsdGlhZGRyc4wACgQxDEqjBh9A3gOJc2VjcDI1NmsxoQLEQyKhM2tKAn_kXbylUNZ_QMZ4Rx6M-S7lZqUxJOm9k4N0Y3CCfJyDdWRwgiMohXdha3UyDQ
Decoded ENR:
seq: 1
signature: 0x74213674f9b10d7526ab87e9393785868e66149726d4f0d92fc2f5aac038e8756a5a16a24e8ee8084c75ed7f9c55565f75f2090f9b31cae5660e973557bf5880
peer-id:  16Uiu2HAm8doGEnyoGFn9sLN6XZkSzy81pxHT9MH6fgFmSnkG4unA
ipv4: field has no value
tcp-port: 31900
cluster-id: not available
shards: not available
Wakuv2 Protocols Supported:
/vac/waku/relay/2.0.0
/vac/waku/filter-subscribe/2.0.0-beta1
/vac/waku/lightpush/2.0.0-beta1
multiaddresses:
/ip4/49.12.74.163/tcp/31900/p2p/16Uiu2HAm8doGEnyoGFn9sLN6XZkSzy81pxHT9MH6fgFmSnkG4unA
/ip4/49.12.74.163/tcp/8000/wss/p2p/16Uiu2HAm8doGEnyoGFn9sLN6XZkSzy81pxHT9MH6fgFmSnkG4unA

chaitanyaprem · 2024-06-21T08:48:17Z

@gabrielmer I was wondering if we can add a admin REST API to query and print the discv5 and px cache.
This can then be used in fleet nodes to get an idea of the cache details at any point in time. This would help in debugging issues as well.

Also, wondering if there can be some discv5 metrics added that would be helpful such as new peers discovered (which can be plotted over time) and peers that are reachable etc.

gabrielmer · 2024-06-21T13:54:42Z

@gabrielmer I was wondering if we can add a admin REST API to query and print the discv5 and px cache. This can then be used in fleet nodes to get an idea of the cache details at any point in time. This would help in debugging issues as well.

Yes! I think we can add an admin endpoint with any info that might help.

For peer exchange, I understand that the cache you're referring to is the following:

nwaku/waku/waku_peer_exchange/protocol.nim

Line 52 in 733edae

enrCache*: seq[enr.Record]

For discv5 however I'm not sure what you're referring to. We don't keep any internal cache, we just get random peers from discv5 and add them to the peer manager.

We can either return the peers we have saved which came from discv5 as a source, or get into discv5's inner workings such as its routing table (or some cache you might be referring to) and return them. Could you please point more specifically which data structure you're referring to as the discv5 cache?

Also, wondering if there can be some discv5 metrics added that would be helpful such as new peers discovered (which can be plotted over time) and peers that are reachable etc.

Yes! If you can please specify the metrics you find useful and their exact definitions and we can open an issue and get into it asap :))

chaitanyaprem · 2024-06-24T06:00:25Z

For discv5 however I'm not sure what you're referring to. We don't keep any internal cache, we just get random peers from discv5 and add them to the peer manager.

We can either return the peers we have saved which came from discv5 as a source, or get into discv5's inner workings such as its routing table (or some cache you might be referring to) and return them. Could you please point more specifically which data structure you're referring to as the discv5 cache?

discv5 uses a DHT internally, so there must be a local cache of nodes discovered so far. Maybe you can check discv5 code to see if there is such a cache.
When a node queries local node, discv5 must be returning peers that are cached somewhere right.

chaitanyaprem · 2024-06-24T06:02:26Z

Yes! If you can please specify the metrics you find useful and their exact definitions and we can open an issue and get into it asap :))

DST team indicating few items that may be useful, posting link to discord.
https://discord.com/channels/1110799176264056863/1252582625055473674/1253848139841015930

gabrielmer · 2024-06-24T08:36:28Z

discv5 uses a DHT internally, so there must be a local cache of nodes discovered so far. Maybe you can check discv5 code to see if there is such a cache. When a node queries local node, discv5 must be returning peers that are cached somewhere right.

Mmm I might be mistaken but I think what you might be referring to is the routing table

Ivansete-status · 2024-07-02T14:10:54Z

In a nwaku pm session, we commented that @SionoiS could help us to understand better how Discv5 should work and whether we are using it properly or not

SionoiS · 2024-07-02T14:41:24Z

Couple of points I'd like to make;

Discv5 does not discriminate between fleets, versions, protocol used, cluster or shard used.
Discv5 find peers by randomly "walking" the DHT.
Discv5 store only "close" peers, Waku store all peers found if they match some conditions.

Status fleet use cluster 16 and what is this ratio to total number of peers in the DHT? Total network size is around ~1k nodes AFAIK. This problem is exactly why Waku use a separate discovery network than Ethereum. Discv5 is not built to find specific peers. Further more, the random walk is not even totally random I suspect that some "paths" are more likely than other which result in the same peer being returned even more often.

I'm afraid none of those problems are bugs.

richard-ramos · 2024-07-02T14:44:17Z

Would it make sense for the status fleet to use a different protocol ID than d5waku? that way the dht would only contain 'status' nodes

SionoiS · 2024-07-02T14:59:52Z

Would it make sense for the status fleet to use a different protocol ID than d5waku? that way the dht would only contain 'status' nodes

No need, just don't connect to outside nodes.

It may happen that nodes connect to outside and "pollute" the status fleet discovery network with other nodes, in that case yes another protocol Id would prevent that.

richard-ramos · 2024-07-02T15:19:26Z

Hm I think this is already the case. We did see some nwaku nodes from other fleets (like waku.test)

SionoiS · 2024-07-12T12:51:09Z

I investigated a bit here are my findings;

The routing table is ordered by last seen. Only a PING or FINDNODE response update this order by placing the node that responded at the top of the order.
An updated ENR is NOT eagerly pushed to other nodes. Would be best to minimize Waku triggered ENR updates.
We could segregate the networks via protocol ids; Status, Waku, etc…
We could take full control of what Discv5 does if we want more control.
Do we need to enable ENR auto update? Do we not find the ip:port of the local node already?
The PING at interval is only done for nodes in a bucket not the replacement cache.
discv5 auto refresh loop is redundant with Waku’s.
Buckets are split when full but never merged when empty. I'm not sure it matters.

Strategies we could try by taking full control but without modifying discv5 in the case of nwaku.

Filter peers by cluster/shard and prioritize the routing table for the supported cluster/shard
- Can fracture the network if filtering is too aggressive.
- We can get as fancy as we want with the choice of nodes.
- Light version of the composable DHT idea.
Create a new liveness check. Ping nodes but increase time between ping for each success.

Ivansete-status · 2024-07-23T13:58:35Z

@chaitanyaprem - in a nwaku pm session we are commenting to close it because we consider that the discv5 is working correctly.
Kindly mention if if agree to close that :)

@SionoiS analyzed discv5 and didn't see issues there.

chaitanyaprem · 2024-07-24T13:17:00Z

@chaitanyaprem - in a nwaku pm session we are commenting to close it because we consider that the discv5 is working correctly. Kindly mention if if agree to close that :)

@SionoiS analyzed discv5 and didn't see issues there.

Sure,Thanks.
Once 2.30 is deployed we will have better metrics to look and get an idea what is happening.
We can open another specific issue if required.

chaitanyaprem added the bug Something isn't working label Jun 14, 2024

gabrielmer self-assigned this Jun 14, 2024

This was referenced Jun 19, 2024

feat: enable pxClient and increase max-peer-connections for relay status-im/status-go#5378

Closed

bug: peer exchange returns nodes that no longer exist. #2414

Closed

gabrielmer mentioned this issue Jun 24, 2024

chore: add logs when discovering and connecting to discv5 peers #2841

Closed

2 tasks

This was referenced Jun 25, 2024

chore: adding origin to peers admin endpoint #2848

Merged

chore: adding discv5 logs #2811

Merged

gabrielmer added the effort/weeks Estimated to be completed in a few weeks label Jul 24, 2024

gabrielmer closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: discv5 not returning enough peers #2810

bug: discv5 not returning enough peers #2810

chaitanyaprem commented Jun 14, 2024 •

edited

Loading

Ivansete-status commented Jun 14, 2024

richard-ramos commented Jun 14, 2024

chaitanyaprem commented Jun 17, 2024

gabrielmer commented Jun 17, 2024

fryorcraken commented Jun 21, 2024 •

edited

Loading

chaitanyaprem commented Jun 21, 2024

gabrielmer commented Jun 21, 2024

chaitanyaprem commented Jun 24, 2024 •

edited

Loading

chaitanyaprem commented Jun 24, 2024

gabrielmer commented Jun 24, 2024

Ivansete-status commented Jul 2, 2024

SionoiS commented Jul 2, 2024 •

edited

Loading

richard-ramos commented Jul 2, 2024

SionoiS commented Jul 2, 2024

richard-ramos commented Jul 2, 2024

SionoiS commented Jul 12, 2024

Ivansete-status commented Jul 23, 2024

chaitanyaprem commented Jul 24, 2024

bug: discv5 not returning enough peers #2810

bug: discv5 not returning enough peers #2810

Comments

chaitanyaprem commented Jun 14, 2024 • edited Loading

Problem

Impact

To reproduce

Expected behavior

nwaku version/commit hash

Additional things to verify/confirm in nwaku

Ivansete-status commented Jun 14, 2024

Answering some points:

richard-ramos commented Jun 14, 2024

chaitanyaprem commented Jun 17, 2024

gabrielmer commented Jun 17, 2024

fryorcraken commented Jun 21, 2024 • edited Loading

chaitanyaprem commented Jun 21, 2024

gabrielmer commented Jun 21, 2024

chaitanyaprem commented Jun 24, 2024 • edited Loading

chaitanyaprem commented Jun 24, 2024

gabrielmer commented Jun 24, 2024

Ivansete-status commented Jul 2, 2024

SionoiS commented Jul 2, 2024 • edited Loading

richard-ramos commented Jul 2, 2024

SionoiS commented Jul 2, 2024

richard-ramos commented Jul 2, 2024

SionoiS commented Jul 12, 2024

Ivansete-status commented Jul 23, 2024

chaitanyaprem commented Jul 24, 2024

chaitanyaprem commented Jun 14, 2024 •

edited

Loading

fryorcraken commented Jun 21, 2024 •

edited

Loading

chaitanyaprem commented Jun 24, 2024 •

edited

Loading

SionoiS commented Jul 2, 2024 •

edited

Loading