Skip to content
This repository has been archived by the owner before Nov 9, 2022. It is now read-only.

thaliNotificationClient doesn't emit when peers go away so thaliReplication doesn't know when to stop looking for them #734

yaronyg opened this issue Jun 9, 2016 · 3 comments


Copy link

yaronyg commented Jun 9, 2016

Right now thaliNotificationClient will only emit when a new peer is found or when an existing peer changes its network address. But it currently doesn't pass on any notifications of when a peer has disappeared. This is partially because we still don't really trust the logic we use to decide when a peer has disappeared and partially because we are trying to do less work for the incubation and this is a perf/battery issue more than a correctness issue.

So right now thaliPullReplicationFromNotification and really the thaliPeerPoolInterface have to decide how often to retry when connecting to a peer after a failure. They don't get any real hints that thaliMobile (via thaliNotificationClient) thinks the peer is gone. Fixing this is pretty easy so we should probably do it. If we trust the notification this will let us know that we can stop trying to connect to the peer. And even if we don't trust the notification it is a useful piece of data for a heurstic that the thaliPeerPoolInterface implementation can use to decide when to stop re-trying a connection.

Copy link
Member Author

yaronyg commented Oct 7, 2016

I really don't like the idea of making the thali replication logic depend on the notification logic to decide when a peer is gone. It makes sense for thali replication logic to depend on notification logic to know when a peer is present because we first have to identify if the peer is anyone we want to replicate with and that is what the notification logic exists to do. But deciding if the peer is gone isn't something that the notification logic has any magic about.

So I think we should connect thaliPullReplicationFromNotification to thaliMobile's peerAvailabilityChanged event and when we get an event that a peer is gone then we can kill any replication we have associated with that peer.

In thaliPullReplicationFromNotification its peerDictionary will only keep track of an event for as long as it is in created state. Once it is started or killed it is no longer tracked. So if we get a peerAvailabilityChanged peer not available event then we can only remove the associated replication event IFF it hasn't been started or killed.

This means that the thaliPeerPool manager has to also track thaliPeerAvailabilityChanged event and for any replication actions it hasn't called kill on it has to kill them since the replication action will no longer be monitored by thaliPullReplicationFromNotification once start is called on it.

The same isn't true however for thaliNotificationAction. thaliNotificationClient will track all thaliNotificationActions until they get a peer not available notification. So thaliPeerPool doesn't need to worry about thaliNotificationActions going away because thaliNotificationClient will kill an action (even if it has started) if the peer disappears.

Copy link
Member Author

yaronyg commented Oct 8, 2016

So a complication is that for the thali replication code or for the peer pool manager to do their thing they have to be able to associate a peer ID and connection type with a peer action. Now that should be easy since all peer actions, regardless of type, are required to provide that information. But it turns out we have a bug in thaliNotificationClient in that it doesn't send that info. What we have to do is update thaliNotificationClient.PeerAdvertisesDataForUs to include the peerID as a field.

Copy link
Member Author

yaronyg commented Oct 18, 2016

So I'm not even sure this is actually a problem.

Let's say we get a peerAvailabilityChanged event for some peer with peerAvailable = true. This will trigger a thaliNotificationAction which will try to connect to that peer. If the action fails then it will automatically retry. If it succeeds then it will cause a replicationAction to be created. Now the replication action has its own retry logic but if that logic eventually fails out then by default we do nothing. It's up to the peer policy manager to decide if it wants to retry again or if it wants to wait for another peerAvailabilityChanged event. But if we could get #733 then it would give us a strong idea if a retry on replication is even worth it. Since we would know that if there is a peerAvailable = false for the peer we will hear about it from #733 but to be fair we could just handle this by subscribing to peerAvailabilityChanged events directly and matching on the peerID. Which I rather like. And it's simpler.

@yaronyg yaronyg closed this as completed Nov 17, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet

No branches or pull requests

4 participants