Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lightpush): peer management for protocols #2003

Merged
merged 20 commits into from
Jun 19, 2024

Conversation

danisharora099
Copy link
Collaborator

@danisharora099 danisharora099 commented May 8, 2024

Problem

For Filter and LightPush, every time a new LP request/new subscription is created, getPeers() is called -- this is called every time which gives us a new set of peers to use for each request/subscription.

Further, we wish to discard a peer if it fails, and renew it with another peer. This is not trivial to do if we rely on getting a new set of peers each time.

Solution

This PR tackles it for LightPush:

  • A list of peers is constantly maintained for lightpush equal to numPeersToUse
  • If there is a node that fails for whatever reason, that peer is disconnected, and a new peer is introduced to the list

Notes

Contribution checklist:

  • covered by unit tests;
  • covered by e2e test;
  • add ! in title if breaks public API;

Copy link

github-actions bot commented May 8, 2024

size-limit report 📦

Path Size Loading time (3g) Running time (snapdragon) Total time
Waku node 183.21 KB (+0.87% 🔺) 3.7 s (+0.87% 🔺) 13.1 s (-21.67% 🔽) 16.8 s
Waku Simple Light Node 183.19 KB (+0.9% 🔺) 3.7 s (+0.9% 🔺) 21 s (+17.38% 🔺) 24.7 s
ECIES encryption 23.12 KB (0%) 463 ms (0%) 3.4 s (-40.64% 🔽) 3.8 s
Symmetric encryption 22.58 KB (0%) 452 ms (0%) 4 s (-31.2% 🔽) 4.5 s
DNS discovery 72.49 KB (0%) 1.5 s (0%) 7.7 s (-24.07% 🔽) 9.1 s
Peer Exchange discovery 74.15 KB (0%) 1.5 s (0%) 10.7 s (+37.17% 🔺) 12.1 s
Local Peer Cache Discovery 67.68 KB (0%) 1.4 s (0%) 11.5 s (-5.72% 🔽) 12.9 s
Privacy preserving protocols 38.87 KB (0%) 778 ms (0%) 8.5 s (-0.85% 🔽) 9.2 s
Waku Filter 112.58 KB (+0.6% 🔺) 2.3 s (+0.6% 🔺) 15.3 s (+17.96% 🔺) 17.6 s
Waku LightPush 111.1 KB (+0.65% 🔺) 2.3 s (+0.65% 🔺) 14.9 s (-15.58% 🔽) 17.1 s
History retrieval protocols 111.67 KB (+0.69% 🔺) 2.3 s (+0.69% 🔺) 24.2 s (+113.55% 🔺) 26.4 s
Deterministic Message Hashing 7.29 KB (0%) 146 ms (0%) 1.1 s (-37.26% 🔽) 1.2 s

@danisharora099 danisharora099 force-pushed the feat/peer-reconnection branch 2 times, most recently from 3b96c37 to b947aa9 Compare May 14, 2024 11:50
@danisharora099 danisharora099 marked this pull request as ready for review May 14, 2024 12:21
@danisharora099 danisharora099 requested a review from a team as a code owner May 14, 2024 12:21
packages/sdk/src/waku.ts Outdated Show resolved Hide resolved
@@ -315,7 +321,8 @@ export class StoreSDK extends BaseProtocolSDK implements IStoreSDK {
}

export function wakuStore(
connectionManager: ConnectionManager,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's include it into init and maybe make it of a type ProtocolCreateOptions & ProtocolServices or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. This currently becomes non-trivial considering there is no type/interfaces for ConnectionManager in @waku/interfaces -- this becomes a nice followup for #1969

@@ -43,15 +50,19 @@ class LightPushSDK extends BaseProtocolSDK implements ILightPushSDK {
};
}

const peers = await this.protocol.getPeers();
if (!peers.length) {
const peersFound = await this.hasPeers();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me in this case we should get peers as fast as possible and only in background support validity and quality of the pool

because if we make it part of send or subscribe (or for Store) - then it increases delay for the application built on top

instead I think we should be lazy in this approach and prevent / make consumer wait only if service doesn't work (error being thrown)

edge case is if the only peer is in the pool and this peer is bad - I believe it is a rare case for which we should optimize only after seeing it happens often for end users.
So we should measure it now: add a log line for it for later use in telemetry etc, also a point for dogfood app we are preparing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your approach in theory and that is how I implemented it.

However, this hasPeers() check makes itself useful considering we have a recurring interval that maintains peers and thus we need to ensure that maintainPeers() was called at least once.

Looking at the function:

 protected hasPeers = async (): Promise<boolean> => {
    let success = await this.maintainPeers();
    let attempts = 0;
    while (!success || this.peers.length === 0) {
      attempts++;
      success = await this.maintainPeers();
      if (attempts > 3) {
        if (this.peers.length === 0) {
          this.log.error("Failed to find peers to send message to");
          return false;
        } else {
          this.log.warn(
            `Found only ${this.peers.length} peers, expected ${this.numPeersToUse}`
          );
          return true;
        }
      }
      await new Promise((resolve) => setTimeout(resolve, 1000));
    }
    return true;
  };

It will return if peers.length > 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and thus we need to ensure that maintainPeers() was called at least once
it seems it is initiated from the constructor so there is no need to make sure of it anymore, considering ctor was initiated successfully

let's not have blocking routines for ad-hoc operations, if we cannot send a message now - we should throw (maybe later we can implement improvement of re-sending messages but this is out of scope)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your POV and agree to it.

This is non-trivial because if we don't provide this reliability abstraction part of the SDK send() functionality, it will have to be implemented as a wrapper by library consumers.

For example: if the interval set is 30s (default), users will have to wait 30 seconds after node initialisation to be able to register new connected peers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea:

Instead of relying solely on an interval, let's trigger maintainPeers() on a new peer:connect event.
This will significantly improve efficiency of the routine.

Copy link
Collaborator Author

@danisharora099 danisharora099 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that definitely will improve the time it takes for maintenance to update.

Tested it out with tests.

However, because of the nature of it being non-blocking, send() will fail way more often when users try to send a lightpush request right after starting their node.

This is most reproducible in our tests, where we do a lightpush.send() right after we start a node and connect to peers (as will also happen in most user applications)

IMO: it is a good default/reliability abstraction to automatically try to wait for peers IF there are no peers available.
If peers are available, well and good and we don't wait.

Since this is especially a SDK level feature, I think this is a nice to have.
We can wrap this as an argument autoRetry from user with default true @weboko

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented some improvements to your comment here: 8e89b24

Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of autoRetry and we can enrich it within #2032 (feel free to take if you want and when free and let's discuss exact design, we can change it from what I described there)

and I think it is a good way to use peer:identify or some other event (maybe connected is good for that purpose too)

but I would still not make consumer wait for some other operation during .send unless it is explicitly configured

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private async findAdditionalPeers(numPeers: number): Promise<Peer[]> {
this.log.info(`Finding ${numPeers} additional peers`);
try {
let newPeers = await this.core.getPeers({
Copy link
Collaborator

@weboko weboko May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general - getPeers from BaseProtocolCore will get peer only those we are connected to
is it sufficient? it seems that in some cases we need to initiate new connections, if existing are not enough or not good

or we rely on connection manager to automatically add one after it being dropped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectionManager keeps on trying to connect to as many peers, as we keep discovering this.

The upper limit on max connections is 100: https://github.com/libp2p/js-libp2p/blob/169c9d85e7c9cd65be964b5d08bd618d950f70ee/doc/LIMITS.md?plain=1#L39

This is more than enough. We can assume that when a connection is dropped, we can find new peers (if they were discovered), or we will connect to them as soon as we discover them.

There does not seem to be an apparent action js-waku can take to connect to new peers, other than what's already happening through discoveries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems to be 300

I think we can document it somewhere, at least, in comment section - mentioning that assumption is to have connection manager to populate connected peers.

* This is useful to make the process of finding and adding new peers more dynamic and efficient,
* instead of relying on the interval only.
*/
this.core.addLibp2pEventListener("peer:identify", (evt) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am honestly a bit lost here, help please :)

  • when "peer:identify" fired - maintainPeers will be called;
  • but then we anyway do this in startMaintainPeersInterval and even do this later in interval;

Why do we need to use event still?

Copy link
Collaborator Author

@danisharora099 danisharora099 Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we discussed this in detail in the sync session earlier today, I'll recap here to document it:

Intervals are ran as a maintenance cycle of peers, to fetch state of active connections from the ConnectionManager. The latter is responsible of actively connecting to nodes that are discovered. The former is responsible of finding the relevant peers, based on the multicodec and the configured pubsub topic for the protocol .
Since the PeerManager, that runs these intervals, does not have (and shouldn't) this information, it in a way polls the ConnectionManager for this information.

peer:identify is used to shorten the period between connection of a new node, and the interval that is run. It decreases the potential time for the protocol to identify that a new peer is now available for this protocol.

Copy link
Collaborator

@weboko weboko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We derived follow-ups and before merging, please, make sure it works e2e within some of our exampels.

@danisharora099 danisharora099 merged commit 93e78c3 into master Jun 19, 2024
11 checks passed
@danisharora099 danisharora099 deleted the feat/peer-reconnection branch June 19, 2024 05:52
@weboko weboko mentioned this pull request Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants