feat(networking): integrate gossipsub scoring #1769

alrevuelta · 2023-05-30T12:27:04Z

Changes:

Only allow handling messages containing a valid WakuMessage. Just one type of handler. No more handling seq[byte] and parsing. Waku relay message handler takes directly a WakuMesssage.
Add a default validator to all subscribed topics, which enforces being able to decode the message into a valid WakuMessage. If ok the message is accepted, otherwise rejected. With this, "upper layers" now always get a WakuMessage.
Integrate gossipsub scoring + trigger dissconnections from bad peers.
Set some default safe parameters for gossipsub scoring to use.
Write some tests to ensure the feature is integrated ok, where disconnections based on score are triggered.
Change the in/out peer relation, enforcing 50/50%. 10% of outgoing peers seems to have some issues in simulations, to be continued.
Test in sandbox machine with 200 nodes.

vpavlin · 2023-06-01T07:33:21Z

apps/wakubridge/wakubridge.nim

@@ -264,7 +264,7 @@ proc start*(bridge: WakuBridge) {.async.} =

  # Always mount relay for bridge.
  # `triggerSelf` is false on a `bridge` to avoid duplicates
-  await bridge.nodev2.mountRelay(triggerSelf = false)
+  await bridge.nodev2.mountRelay()


Is the bridge.nodev2.wakuRelay.triggerSelf = false missing here?

done b1df0dd

I wonder if we could set triggerSelf = false inside the mountRelay() itself so that we don't potentially miss that in another places.

by now i would say triggerSelf = true should be the default value. the bridge is a particular case.

waku/v2/waku_relay/protocol.nim

jm-clius

Thanks. Minor nits below, but LGTM! Not approving as I know you want to test the selected parameters first. One more uncertainty is what the performance/memory knock would be if we decode all messages twice, i.e. for both validation and handling. Assumption here is negligible/nothing, but may be worth double checking in simulation.

jm-clius · 2023-06-01T12:54:29Z

waku/v2/waku_relay/protocol.nim

+      fut.complete()
+      return fut
+    else:
+      return handler(pubsubTopic, decMsg.get)


Nit:

Suggested change

return handler(pubsubTopic, decMsg.get)

return handler(pubsubTopic, decMsg.get())

sure b1df0dd

jm-clius · 2023-06-01T12:56:08Z

waku/v2/waku_relay/protocol.nim

-          debug "message decode failure", pubsubTopic=pubsubTopic, error=decodeRes.error
-          return
+  # rejects messages that are not WakuMessage
+  proc validator(pubsubTopic: string, message: messages.Message): Future[ValidationResult] {.async.} =


Do we need a separate validator for each subscribed pubsubTopic? Why not define this outside the subscribe block, in other words?

tldr: indeed, done b1df0dd

longer version. related to your other comment here I had the validator inside since I was trying to optimize this, to avoid decoding the message twice (or as many times as validators we have). This optimization was similar to the one in nimbus , where the validator acts also as the handler. Meaning that inside the validator, you also call the handler, and then you dont have to register the handle (just pass nil in subscribe) because the validator calls it for you. Like this you dont have to decode the message i) in the validation stage and ii) in the handler stage, just once.

something like:

proc validator(pubsubTopic: string, message: messages.Message): Future[ValidationResult] {.async.} = let msg = WakuMessage.decode(message.data) if msg.isOk(): yourHandler(yyy) <---- return ValidationResult.Accept return ValidationResult.Reject

Buut, at the end I discarded this solution as it has some implications. For example, when having multiple validators (rln, signed topic,e tc). This hacked validator containing also the handler would need to be executed always the last, and this hack on top of a hack seemed too much.

I haven't done a detailed analysis on the impact of decoding the message twice, but haven't seen a significant increase in the simulations on any parameter. So at this stage I would say its overoptimizing. Leaving this idea here in case we want to implement it later on.

We have an issue somewhere (status-im/nimbus-eth2#3043) about adding a "user data" field to the Message type, to let the app decode a single time and store the decoded version there.
That will require some trickery since libp2p doesn't know the type of the user data, but feel free to pick that up

Ok, very interesting about both possible approaches to avoiding double decoding. Agree with doing what is simple and supported in libp2p for now.

jm-clius · 2023-06-01T12:57:33Z

waku/v2/waku_relay/protocol.nim

+  )
+
+# see: https://rfc.vac.dev/spec/29/#gossipsub-v10-parameters
+const gossipsubParams = GossipSubParams(


Nit: any chance to make this and the other const PascalCase? I know this may lead to symbol clash, so feel free to ignore.

missed the comment, done b1df0dd

jm-clius

Thanks. Minor nits below, but LGTM! Not approving as I know you want to test the selected parameters first. One more uncertainty is what the performance/memory knock would be if we decode all messages twice, i.e. for both validation and handling. Assumption here is negligible/nothing, but may be worth double checking in simulation.

alrevuelta · 2023-06-02T11:18:54Z

Thanks. Minor nits below, but LGTM! Not approving as I know you want to test the selected parameters first. One more uncertainty is what the performance/memory knock would be if we decode all messages twice, i.e. for both validation and handling. Assumption here is negligible/nothing, but may be worth double checking in simulation.

@jm-clius thanks for the comments. Replied to this here. Indeed, seems neglible and left an idea in that comment in case in the future we want to optimize this.

Ivansete-status

LGTM! This PR is very cool! I just have a doubt re the unsubscribe change.

Ivansete-status · 2023-06-02T12:32:23Z

tests/v2/waku_relay/test_wakunode_relay.nim

+      discard await nodes[0].wakuRelay.publish(topic, urandom(1*(10^3)))
+
+    # long wait, must be higher than the configured decayInterval (how often score is updated)
+    await sleepAsync(20.seconds)


I wonder if we somehow could enforce a shorter decayInterval so that we can have a faster test

yep, this is a high delay but would like to leave it as it is, so that we dont end up testing something different than how it looks in reality.

tests/v2/waku_relay/test_wakunode_relay.nim

Ivansete-status · 2023-06-02T12:54:15Z

waku/v2/node/waku_node.nim


  let wakuRelay = node.wakuRelay
-  wakuRelay.unsubscribe(@[(topic, handler)])
+  wakuRelay.unsubscribe(topic)


Sorry but I don't quite see why we don't need handler now.

good catch. this was a leftover. i have replaced unsubscribeAll by unsubcribe. So whenever we call now unsubscribe, we unsubscribe all handlers from the topic, meaning 100% unsubscribing to the topic.

unsubscribing individual handlers seemed very specific and since now we wrap the handler to only accept a WakuMessage, it wasnt trivial to adapt it.

If in the future we need to unsubscribe individual handlers, we can revisit this, but imho it would be better to incentivice people to use a single handler that does everything.

fixed dc70df9

Ivansete-status · 2023-06-02T13:13:13Z

apps/wakubridge/wakubridge.nim

@@ -264,7 +264,7 @@ proc start*(bridge: WakuBridge) {.async.} =

  # Always mount relay for bridge.
  # `triggerSelf` is false on a `bridge` to avoid duplicates
-  await bridge.nodev2.mountRelay(triggerSelf = false)
+  await bridge.nodev2.mountRelay()


I wonder if we could set triggerSelf = false inside the mountRelay() itself so that we don't potentially miss that in another places.

Ivansete-status · 2023-06-02T13:20:30Z

waku/v2/waku_relay/protocol.nim

+    fanoutTTL: chronos.minutes(1),
+    seenTTL: chronos.minutes(2),
+
+    # no gossip is send to peers below this score


tiny typo

Suggested change

# no gossip is send to peers below this score

# no gossip is sent to peers below this score

thanks dc70df9

Ivansete-status · 2023-06-02T13:25:17Z

waku/v2/waku_relay/protocol.nim

+
+    # p6: penalizes peers sharing more than threshold ips
+    ipColocationFactorWeight: -50.0,
+    ipColocationFactorThreshold: 5.0,


Does that cover in an implicit way the action that we perform to prune peers with big # of ips?

this indeed lowers the peer score based on the ip (with more than 5 ips it kicks in). but note that if 6 peers are behind the same ip and they perform great in other scores, we wont really prune any peer.

note that this doesnt replace our peer prunning based on ips, since this score just applies to peers participating in gossipsub.

Ivansete-status · 2023-06-02T13:45:01Z

waku/v2/waku_relay/protocol.nim

+    let decMsg = WakuMessage.decode(data)
+    if decMsg.isErr():
+      # fine if triggerSelf enabled, since validators are bypassed
+      error "failed to decode WakuMessage, validator passed a wrong message"


Suggested change

error "failed to decode WakuMessage, validator passed a wrong message"

error "failed to decode WakuMessage, validator passed a wrong message", error = decMsg.error

thanks dc70df9

Ivansete-status · 2023-06-02T14:09:20Z

waku/v2/waku_relay/protocol.nim


+  ok(w)


Not a big deal at all but for I personally prefer being explicit and always use return. Just for new incomers to Nim :)

Suggested change

ok(w)

return ok(w)

yep, i actually prefer to be explicit with this
dc70df9

alrevuelta · 2023-06-02T15:05:53Z

Did some simulations and everything seems fine. No disconnections were triggered and everything was stable.

network with 150 nodes, with discv5.
injecting valid msgs at a constant rate.

However, when creating 20 attackers (nodes publishing wrong messages affeecting the p4 score) disconnections were triggered from them (as expected) but these nodes were not entirely kicked out of the network. Guess its expected since afaik nim-libp2p doesn't check the score before a peer tries to connect (meaning that a peer gets low score, then its kicked out, then can connect again, kicked out, etc). But would be better to entirely kick these peers out, perhaps out of scope for this PR but something to take into account.

see libp2p_gossipsub_bad_score_disconnection metric:

LNSD · 2023-06-03T00:09:47Z

waku/v2/waku_relay/protocol.nim


 proc unsubscribeAll*(w: WakuRelay, pubsubTopic: PubsubTopic) =
  debug "unsubscribeAll", pubsubTopic=pubsubTopic

  procCall GossipSub(w).unsubscribeAll(pubsubTopic)

-
 proc publish*(w: WakuRelay, pubsubTopic: PubsubTopic, message: WakuMessage|seq[byte]): Future[int] {.async.} =


If the idea is not to return a seq[byte] only WakuMessages in the subscription handler anymore, then to be consistent, you should not allow publishing a seq[byte], only WakuMessages.

yup, done dc70df9

LNSD · 2023-06-03T00:11:28Z

waku/v2/waku_relay/protocol.nim

+    let decMsg = WakuMessage.decode(data)
+    if decMsg.isErr():
+      # fine if triggerSelf enabled, since validators are bypassed
+      error "failed to decode WakuMessage, validator passed a wrong message"


This error level can be extremely noisy. Think of a "DoS" scenario. It will blow up the logging system.

Use a debug or a trace log level instead.

not really, since we now enforce that every subscribed topic only gets valid WakuMessages in its handler, we should never enter in if decMsg.isErr(), even in a DoS scenario. thats what the validator was for.

LNSD · 2023-06-03T00:19:26Z

waku/v2/waku_relay/protocol.nim

@@ -120,6 +179,12 @@ method stop*(w: WakuRelay) {.async.} =
  debug "stop"
  await procCall GossipSub(w).stop()

+# rejects messages that are not WakuMessage
+proc validator(pubsubTopic: string, message: messages.Message): Future[ValidationResult] {.async.} =
+  let msg = WakuMessage.decode(message.data)


Optimization note:

You don't need to decode a message to know if the message is correct. Although typically Protocol buffer libraries perform both at the same time when decoding, validation and deserialization can be performed independently. For example, using nim-libp2p's protobuf library, you can check if the required fields are present without allocating memory.

interesting, added a note here: dc70df9 for future optimizations.

jm-clius

LGTM, thanks!

alrevuelta force-pushed the integrate-libp2p-scoring branch from 264f792 to b7ec0b1 Compare June 1, 2023 07:15

alrevuelta marked this pull request as ready for review June 1, 2023 07:21

alrevuelta requested review from jm-clius, vpavlin and Ivansete-status June 1, 2023 07:21

vpavlin reviewed Jun 1, 2023

View reviewed changes

Menduist reviewed Jun 1, 2023

View reviewed changes

waku/v2/waku_relay/protocol.nim Outdated Show resolved Hide resolved

jm-clius reviewed Jun 1, 2023

View reviewed changes

alrevuelta force-pushed the integrate-libp2p-scoring branch from 3d8ba07 to b1df0dd Compare June 2, 2023 11:02

Ivansete-status approved these changes Jun 2, 2023

View reviewed changes

LNSD reviewed Jun 3, 2023

View reviewed changes

alrevuelta added 11 commits June 6, 2023 11:24

draft

4a78c40

add tests

26572c6

more changes

d3de7a1

some fixes

bd361f7

update gossipsub params

a5c1a2a

fix more broken things

074f5f6

fix networkmonitor

b718f54

minor fix in test

3964bfa

fix rln tests

312a9f1

fix comments

0c1874a

fix comments

dc70df9

alrevuelta force-pushed the integrate-libp2p-scoring branch from b1df0dd to dc70df9 Compare June 6, 2023 09:25

alrevuelta requested a review from jm-clius June 6, 2023 14:22

jm-clius approved these changes Jun 6, 2023

View reviewed changes

alrevuelta merged commit 34a9263 into master Jun 6, 2023
15 checks passed

alrevuelta deleted the integrate-libp2p-scoring branch June 6, 2023 17:28

	return handler(pubsubTopic, decMsg.get)
	return handler(pubsubTopic, decMsg.get())

	# no gossip is send to peers below this score
	# no gossip is sent to peers below this score

	error "failed to decode WakuMessage, validator passed a wrong message"
	error "failed to decode WakuMessage, validator passed a wrong message", error = decMsg.error

feat(networking): integrate gossipsub scoring #1769

feat(networking): integrate gossipsub scoring #1769

Conversation

alrevuelta commented May 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrevuelta Jun 2, 2023 • edited

Choose a reason for hiding this comment

Ivansete-status Jun 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jm-clius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jm-clius left a comment

Choose a reason for hiding this comment

alrevuelta commented Jun 2, 2023

Ivansete-status left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivansete-status Jun 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrevuelta commented Jun 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LNSD Jun 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jm-clius left a comment

Choose a reason for hiding this comment

alrevuelta commented May 30, 2023 •

edited

alrevuelta Jun 2, 2023 •

edited

Ivansete-status Jun 2, 2023 •

edited

Ivansete-status Jun 2, 2023 •

edited

LNSD Jun 3, 2023 •

edited