Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: best practice to handle disconnection when sending messages over relay #921

Closed
1 task
fryorcraken opened this issue Nov 23, 2023 · 15 comments
Closed
1 task
Labels
documentation Improvements or additions to documentation

Comments

@fryorcraken
Copy link
Collaborator

Background

The Waku Relay protocol does not provide feedback in terms of message sending. The protocol is robust thanks to the built-in redundancy and scoring mechanism to ensure that a node is connected to quality peers.

However, when the local node lose access to the internet, the redundancy does not apply as no remote peer is reachable.

Details

Provide best practices in terms of connection and peer management to help an application handle this scenario.

Most likely:

  • Discuss with Vac p2p/Nimbus teams to understand what are the current best practices around gossipsub.
  • Clarify the tool available to applications: store, ping (ipfs), etc
  • Propose one or several strategies to handle the scenario (documented or as a test case).
  • Highlight feature Waku improvements that will assist with this topic, such as store sync

Acceptance Criteria

  • Mitigation strategies are proposed to Status team and implementable.

Notes

status-im/status-desktop#12813

@fryorcraken fryorcraken added the documentation Improvements or additions to documentation label Nov 23, 2023
@fryorcraken fryorcraken changed the title chore: best practice to handle disconnection when sending messages over relay doc: best practice to handle disconnection when sending messages over relay Nov 23, 2023
@fryorcraken
Copy link
Collaborator Author

Some idea:

  • use regular ipfs/ping for some or all relay peers. might need to do per pubsub topic
  • if all pings fail (for a given pubsub topic?), then refer to last successful ping to understand the possible down time
  • application needs to decide what messages are worth resending (e.g. "online presence" message should probably not be rebroadcast)
  • Could do a store query for given plausible "down time" to understand what messages were actually never sent.
  • Resend messages if applicable.

@chaitanyaprem chaitanyaprem self-assigned this Nov 23, 2023
@fryorcraken
Copy link
Collaborator Author

fryorcraken commented Nov 28, 2023

Please also include recommendation about messages not being received during connection is lost and it is realized that connection is lost.

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Nov 29, 2023

Since Waku relay uses underlying libp2p gossipsub for communication, there is no way to know if a message sent by application has actually been sent out in the network to atleast 1 peer. This is by gossipsub's design which is kind of like fire and forget model.
In most of the scenarios this works well if there are enough connections to peers that are interested in the pubsubTopic a message is being sent to. This ensures that message gets relayed and reaches all required destinations that are interested in the message.

Problem Areas

There can be abnormal cases where the message sent by application is not sent out into the network. Following are a few:

No peers connected for pubsubTopic

There are no peers connected for the pubsubTopic that we are sending message to. This can be mitigated by specifying the config option of minPeersToPublish while initializing the Waku relay.

Network disconnections and their identification

Since there is no ack mechanism when it comes to gossipsub, it is hard to know if a message has actually been sent out to the network or not. To identify early network disconnections a TCP keepalive can be enabled on all connections with a timeout and also mark connection as down after 3 timeouts.

Per @richard-ramos In status-go we set an aggresive interval check of 10s, so theoretically in a max of 30s we should disconnect a peer (we ping at 10s, it fails, ping again at 20s, it fails, ping a third time, and fails again, since it exceeds the max of 2, we disconnect).

Few scenarios wrt network disconnection

  • Application has identifified when the network is disconnected. Following are possible solutions to handle this scenario
    • Start indicating messages which are being sent by user so that they can be resent. This is already done in status app.
    • Maintain a message cache and automatically retry once network is connected.
  • Application has not yet identfiied that network is disconnected

Possible approach to reliably send messages via Waku relay/gossipsub

Approach-1:
Have an acknowledgement in application messages to ensure a message is sent. e.g: This is already done for 1:1 messaging in status app.

Approach-2:
Where message ordering/sequencing is not gauranteed or hard to coordinate e.g: status communities, then following approach could be taken.

  1. Enable TCP keep-alives (probabbly with interval 10 secs - little aggressive) and mark connections as down after 3 failures which is roughly 30 seconds.
  2. Cache the messages that are to be sent out in a local cache(for a perioid of 1 minute) per pubsubTopic
  3. Periodically clear older messages from the cache i. e send messages with timestamp older than 1 minute as there must have been delivered.
  4. In case a network disconnection is identified at t1, don't clear the messages from the cache and rather wait till network is connected again.
  5. Once network is connected at t2 send a query to a store-node with fromTimestamp as oldest message in the cache and to timestamp as t2 to fetch messages.
  6. Check if the message-IDs in cache are present in store-query response, if not mark them as not sent and either automatically resend them or mark to be shown to the user. Note: Here application logic has to be used to determine if a message is really required to be resent or not (i.e if message is no longer relevant after time t has passed)

Note: The TCP keep-alive can be increased, which would then also increase the message cache size to be stored locally and the processing of store query and response once network is connected again. This is something that can be fine-tuned based on app behvaiour.

Also note that this is little inefficient as current Store protocol only supports querying messages based on time duration and response includes complete messages.
This can be further improved upon once the store protocol is enhanced to be queried with a list of message-IDs or a store-sync request can be sent to compare local and remote store(if local store is enabled). cc @jm-clius , @ABresting is this something that is being considered as part of store-sync implementation?

@jm-clius
Copy link

This can be further improved upon once the store protocol is enhanced to be queried with a list of message-IDs or a store-sync request can be sent to compare local and remote store(if local store is enabled).

Indeed, we are certainly planning to extend Store to allow querying only for message IDs/hashes. A further step would be to allow for comparison with something akin to the IHAVE, IWANT mechanism in GossipSub. @ABresting what could also be useful here could be a lightweight DOYOUHAVE mechanism, that allows the client to send a list of message hashes to the Store and the Store responds with the subset of hashes that it has stored.


I think your proposal re a short cache and using a store query to "resume" publishing after detecting a disconnect is a reasonable short term workaround.
I would suggest also investigating the impact of always using lightpush as the publishing mechanism, even for relay nodes. Currently the impact might be:

  1. the publishing node (lightpush client, but also a normal relay node) will receive a duplicate of all its messages on the Waku Relay mesh as the lightpush service nodes would not consider the publisher as the "origin" on the Waku Relay layer.
  2. the lightpush service node would be considered the first publisher of the message into the Relay network, in which case it will likely use flood publish (publish the message to all known peers in the message). This would have a significant impact on the total bandwidth usage of the network.

I'd say (1) and (2) are both prohibitive in the long term, although there may be ways to minimise the impact (such as disabling flood publish on the relay layer or finding some way to ensure that on a gossipsub/Relay level the lightpush client does not receive a duplicate of the message it has just published).

@fryorcraken
Copy link
Collaborator Author

Approach-2:

Looks good to me and sounds reasonable to encourage the proposed solution at first step.

Would be keen to better understand what nimbus or other libp2p-gossipsub users do before going down the path of systematic light push usage considering the caveats.

@chaitanyaprem
Copy link
Collaborator

I think your proposal re a short cache and using a store query to "resume" publishing after detecting a disconnect is a reasonable short term workaround.

Yes, the idea is that this would be a short-term work-around which can either be enhanced or modified at a later stage.

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Nov 30, 2023

Would be keen to better understand what nimbus or other libp2p-gossipsub users do before going down the path of systematic light push usage considering the caveats.

Per @arnetheduck,nimbus has its own app layer protocol to detect messages are lost and remedies it.
@arnetheduck Can you point to a code ref or any documentation detailing the mechanism of how this is remedied in nimbus?

The best practice is that the protocol that loses messages (in this case the status app) when using an unreliable message transport (in this case gossipsub) has its own in-protocol way to detect that messages are lost and provides the mechanism to remedy the situation - the question is a bit akin to asking whether there's a way to know if a network packet sent on the internet arrived at the destination - there is not and various mechanisms exist to work around this, all of which require application protocol support.

Gossipsub is not a reliable message transport (just like TCP and UDP aren't reliable message transports) - this needs a dedicated layer "somewhere" and that somewhere sits above gossipsub. 

Nimbus/ethereum faces the same problem and has in-protocol support for detecting lost messages (blocks in this case) and a recovery mechanism to re-request them on demand (not via gossipsub). 

Reference discussion in discord https://discord.com/channels/613988663034118151/636230707831767054/1179360108979949628

@chaitanyaprem
Copy link
Collaborator

Please also include recommendation about messages not being received during connection is lost and it is realized that connection is lost.

Possible approach to handle messages not received due to connection loss

  • Network loss detection would be the same as explained above
  • As soon as a network disconnection is detected, a query can be made to a store node with lastReceivedMsg timestamp to currentTimstamp.
  • Note if a node is working with multiple shards/pubsubTopic this query will have to made per pubsubTopic. Because it is possible that last message received for a shard may have different timestamp than another shard. Also, Store nodes could serve only a specific shard which means the query might have to made to different store nodes.
  • Use the store query response to compare messages already received and process messages that are new in the store response

Note: This approach will have to be used when using relay or Filter protocol to receive messages.

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Nov 30, 2023

I would suggest also investigating the impact of always using lightpush as the publishing mechanism, even for relay nodes. Currently the impact might be:

1. the publishing node (lightpush client, but also a normal relay node) will receive a duplicate of all its messages on the Waku Relay mesh as the lightpush service nodes would not consider the publisher as the "origin" on the Waku Relay layer.

2. the lightpush service node would be considered the first publisher of the message into the Relay network, in which case it will likely use flood publish (publish the message to all known peers in the message). This would have a significant impact on the total bandwidth usage of the network.

I'd say (1) and (2) are both prohibitive in the long term, although there may be ways to minimise the impact (such as disabling flood publish on the relay layer or finding some way to ensure that on a gossipsub/Relay level the lightpush client does not receive a duplicate of the message it has just published).

This is also a good idea considering light-push has acknowledgement for each message. But the in-built redundancy that relay provides will need to be artificially replicated with light push. Also, if every nodes use light push then we also loose the redundancy and robustness of using other status desktop apps to deliver messages.
With light push, then only the fleet delivers messages making it dependent only on the fleet.
IMO it would be better if we can solve the issues which are noticed only during disconnection/no peers with alternative approaches.

Also, looks like nimbus employs an application level protocol to handle such issues with gossipsub. I do also think we can look at other clients like prysm as well to find out what other mechanisms they use.

@arnetheduck
Copy link

The key point from the discord discussions is that message loss in gossipsub is a predetermined outcome - you can improve delivery rates by employing various tricks like relying on IHAVE/IWANT, sending pings and pongs, changing timeouts and so on but all that is ultimately mostly pointless effort with small returns - the sooner this point is recognized, the sooner the problem can actually be solved.

Gossipsub is not a reliable transport - it does not have the features necessary to solve this problem - all the inventive mitigations ("detect offline and resend", "use ping", "use ipfs", "use an ack" etc) cited above are just that: mitigations and optimizations that merely kick the can down the road by providing some tiny improvement in some special case at high cost - at the end of the day, the protocol using gossipsub needs to have its own mechanism for detecting, and potentially dealing with message loss - "dealing with" might involve notifying the user that a message was lost and it might include a recovery/resend mechanism, but the important point here is that gossipsub on its own cannot solve this problem - a separate layer that contains a reliable messaging mechanism (sequence numbering / message dag building / eventual consistency protocol / crdt's / etc) is needed, and preferably one that is integrated with the E2E encryption used (which gossipsub also does not offer).

Gossipsub is really really good at not losing messages in general thanks to all the mechanisms of redundancy and resend that it already has - this is a problem because it makes users of it assume that it is perfect and actually contains a reliability mechanism. Perhaps the best feature that could be added to gossipsub / waku would be the option to deliberately drop 10% of all messages - this would force application developers to solve this problem early on in their design process - it would be hard to notice such a message loss rate in a protocol that includes a recovery mechanism, but one that doesn't would immediately and obviously be noticed.

Barring such drastic measures, next steps here include:

  • Documenting clearly the effects of an unreliable transport, ie that message loss is expected and that it is the responsibility of the application layer to solve this problem
    • It is important to make this point abundantly clear, because for some reason, it keeps getting lost in discussions: Reliability cannot be solved at the gossipsub layer - it needs a separate mechanism designed for that purpose.
  • Provide example solutions built on top of gossipsub of various well-known protocols that deal with this problem (sequence numbers, vector clocks and so on - there's plenty of literature of how this problem is solved over say .. TCP or UDP and similar constraints apply to gossipsub in this regard)
  • Provide examples of how to do the above within the E2E encryption mechanism - either in 1:1 or group chats

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Dec 1, 2023

Weekly Update

  • achieved: provided a short-term approach to address message send/receive issues during disconnections, intiated discussions in nimbus and vac channels to find out possible approaches being used in other protocols using gossipsub

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Dec 1, 2023

Based on suggestion above by @arnetheduck and further discussions in discord channel following is the summary:

  • Waku relay (in turn gossipsub) similar to TCP or any other transport protocol cannot gaurantee 100% reliability and applications or users should not assume the same
  • Approaches documented above (timeouts+using store or using lightpush nodes) will only mitigate certain scenarios and cannot solve various other scenarios that can occur (at peer nodes or router etc especially in a trustless p2p network. So these can only be viewed as short term solution that may only reduce the chance of occurrence of issues.
  • The best practice in this case is the application/user of Waku relay should have some sort of data integrity check(e2e synchronization) using a protocol of their own when. This will ensure all and any other cases or scenarios are addressed.

@kaichaosun
Copy link
Contributor

kaichaosun commented Dec 19, 2023

Draft a doc about the potential solutions https://www.notion.so/Messages-Over-Waku-Relay-3ded1783ecc743a4b8d0f3fd3ccb306d

@kaichaosun
Copy link
Contributor

Built a demo application using Store.Find() to retrieve a message and republish the message if it's not fount in store.

@fryorcraken
Copy link
Collaborator Author

Descoped, part of waku-org/pm#184 now to implement this directly on go-waku/status-go.

@fryorcraken fryorcraken closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Done
Development

No branches or pull requests

5 participants