Fixes race condition in #84 #89

sr-gi · 2022-08-11T14:36:28Z

The current implementation of the cln-plugin had a race condition where some pending appointments may be missed if they were added while the Retrier was trying to send some data to the tower.

The issue was discovered in #84, and goes as follows:

The Retrier used to load all the data from the database (in bulk) when starting a new retry and stored a copy of it to do the retry. This is somehow necessary since we cannot hold a reference of the data across futures. Therefore, if some data was added after the pending appointments were loaded, it would be missed.

The solution goes through different stages:

First, the data is checked in memory and loaded one by one instead of in bulk.
Second, the channel shared between the WTClient and the Retrier shares a (tower_id, locator) pair instead of simply a tower_id so the Retrier is aware of all data that is sent to it. This means that the Retrier now cannot miss data as long as it checks it back with the WTClient one locator at a time.
- The Retrier still works in batches for each tower, however, a collection of pending_appointments is now hold by it so it can keep track of what is missing.
Finally, instead of simply iterating over the loaded data, we check that the pending_appointments for the given tower is empty before considering that all the data have been sent to the tower. If that is not the case, the newly added data is pulled and the retry continues until the condition holds.

sr-gi · 2022-08-11T14:37:31Z

Also notice this builds on top of #83, so mainly the last commit needs to be reviewed.

mariocynicys

I'm kinda lost in this.
Doesn't the retry_notify(ExponentialBackoff { ... }).await block the manage retry thread?, so the tower being retried at the moment will miss all the appointments sent while it's being retried??
That's beside, when the tower is flagged as TemporaryUnreachable, we don't send pending appointments to it.

mariocynicys · 2022-08-16T15:05:04Z

watchtower-plugin/src/main.rs

@@ -379,8 +387,10 @@ async fn on_commitment_revocation(
                            let mut state = plugin.state().lock().unwrap();
                            state.set_tower_status(tower_id, TowerStatus::TemporaryUnreachable);
                            state.add_pending_appointment(tower_id, &appointment);
-
-                            state.unreachable_towers.send(tower_id).unwrap();
+                            state


But won't the TemporaryUnreachable towers miss any new appointments we send later?
So the race condition is still there?

You actually reviewed an old version of this where unreachable_towers was still holding only tower_id instead of (tower_id, locator). My bad for not flagging this as a draft.

watchtower-plugin/src/retrier.rs

+            self.wt_client
+                .lock()
+                .unwrap()
+                .set_tower_status(tower_id, crate::TowerStatus::TemporaryUnreachable);


sr-gi · 2022-08-17T11:53:37Z

I'm kinda lost in this. Doesn't the retry_notify(ExponentialBackoff { ... }).await block the manage retry thread?, so the tower being retried at the moment will miss all the appointments sent while it's being retried?? That's beside, when the tower is flagged as TemporaryUnreachable, we don't send pending appointments to it.

Kind of, but I don't think it actually worked like that. retry_notify called add_appointment (formerly retry_tower) which in turn loaded the pending appointment list for the given tower from the database. This means that, if an appointment was appended to the database while the retrier was looping, that appointment would have been missed.

As I commented here I think the issue comes from this being still a draft when you looked at it. Feel free to give a look at it now, it may make more sense.

watchtower-plugin/src/dbm.rs

mariocynicys

I think I need to do another round of review since I might have missed some corner cases.

Also one thing I see this approach struggles with is when a tower is unreachable for so long and a user manually retries it. If the tower is still unreachable, the retrier will keep trying to send every single pending appointment to that tower and timeout every time, causing the retrier to do no useful work for so long.

I support having the channel to send (tower, appointment) instead of just (tower), but I think the retrying should be based on (tower)s and not (tower, appointment) permutation.

mariocynicys · 2022-08-31T07:16:56Z

watchtower-plugin/tests/test.py

@@ -111,7 +111,6 @@ def test_unreachable_watchtower(node_factory, bitcoind, teosd):
        time.sleep(1)

    assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
-    assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"]


 def test_retry_watchtower(node_factory, bitcoind, teosd):


Just to clarify, this test doesn't auto retry because we have watchtower-max-retry-time = 0 right?
It tests manual retry.

You mean for test_retry_watchtower? That's the goal, yeah, make it so the retrier gives up straight away so we can manually retry.

mariocynicys · 2022-08-31T07:17:04Z

watchtower-plugin/tests/test.py

    assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
-    assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"]


I think assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable" is redundant. Since we are waiting on it just two lines before.
I would support keeping assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"] instead.

I'll replace it by:

while l2.rpc.gettowerinfo(tower_id)["pending_appointments"]: time.sleep(1) assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"

Which makes sure both that all pending appointments have been sent and that the tower is reachable after all

mariocynicys · 2022-08-31T08:15:12Z

watchtower-plugin/src/retrier.rs

+                if !wt_client.towers.contains_key(&tower_id) {
+                    continue;
+                }


I think you wrote this before rebasing on master. This does the same logic as the if condition beneath it:

if wt_client.towers.get(&tower_id).is_none() { log::info!("Skipping retrying abandoned tower {}", tower_id); continue; }

I assume using contains_key would have better performance though.

Yeah, I think some things got messed up after rebasing :S

mariocynicys · 2022-08-31T10:11:47Z

watchtower-plugin/src/retrier.rs

-                                wt_client.remove_pending_appointment(tower_id, appointment.locator);
+                            AddAppointmentError::ApiError(e) => match e.error_code {
+                                errors::INVALID_SIGNATURE_OR_SUBSCRIPTION_ERROR => {
+                                    log::warn!("There is a subscription issue with {}", tower_id);


There is a case that would mis-classify a tower state, goes as follows:
1- tower A was temp unreachable and sent to be retried
2- tower A is waiting its turn to be retried and more appointments accumulate
3- tower A started being retried and at halfway through sending all the pending appointment, it started giving back subscription errors.
Such a tower would have temp unreachable status, and then unreachable but never classified as subscription error.

mariocynicys · 2022-08-31T10:11:54Z

watchtower-plugin/src/retrier.rs

+                            AddAppointmentError::ApiError(e) => match e.error_code {
+                                errors::INVALID_SIGNATURE_OR_SUBSCRIPTION_ERROR => {
+                                    log::warn!("There is a subscription issue with {}", tower_id);
+                                    return Err(Error::transient("Subscription error"));


Why not make a subscription error permanent? next trails will fail anyway.

I really cannot find any good reason why not

watchtower-plugin/src/retrier.rs

mariocynicys · 2022-08-31T10:30:03Z

watchtower-plugin/src/retrier.rs

+            .await;
+
+            let mut state = self.wt_client.lock().unwrap();
+            self.pending_appointments.lock().unwrap().remove(&tower_id);


I think we have a race condition here. Its scenario would be:
1- Tower A was temp unreachable and sent to be retried for more than once (say for 3 appointments)
2- It comes the turn for tower A to be retried for the first appointment and it fails and marks as unreachable (the problem here is that the appointment we just retried was never sent and also is removed from the pending appointments)
3- Tower A gets retried again (for the second appointment) and it succeeds (maybe it/we had a temporary internet issue), marking it as reachable.

In this scenario, the first appointment will never get sent to the tower (unless the user manually retry it).

Umm, I don't follow. Here the whole tower record is removed from pending_appointments. This only happens if the whole retry strategy fails, so the data in the retrier is wipped.

mariocynicys · 2022-08-31T11:48:48Z

watchtower-plugin/src/retrier.rs

+            self.add_pending_appointment(tower_id, locator);
+
            log::info!("Retrying tower {}", tower_id);
-            match retry_notify(
+            let r = retry_notify(
                ExponentialBackoff {
                    max_elapsed_time: Some(Duration::from_secs(self.max_elapsed_time_secs as u64)),
                    max_interval: Duration::from_secs(self.max_interval_time_secs as u64),
                    ..ExponentialBackoff::default()
                },
-                || async { self.add_appointment(tower_id).await },
+                || async { self.retry_tower(tower_id).await },
                |err, _| {
                    log::warn!("Retry error happened with {}. {}", tower_id, err);
                },
            )
-            .await
-            {
+            .await;
+
+            let mut state = self.wt_client.lock().unwrap();
+            self.pending_appointments.lock().unwrap().remove(&tower_id);


The missed appointment is added in line +57 and all the pending appointments for that tower are cleared in line +74.
This means there is only one appointment in the pending appointments at any given time?
Or maybe I am missing something :/

Not, appointments get appended to a collection if the tower is already being retried. I just realized that here something may have been messed up since we were creating a different async task for every (tower_id, locator) pair while there should only have been a single task per tower and we should have been appending data to it

sr-gi · 2022-09-04T09:04:20Z

@meryacine let see if this makes more sense now.

Look like some things got messed up after rebasing several PR over this one. I fixed it so there is only one task per tower, I don't know how it ended up being one task per (tower, locator) pair.

Now, if add_pending_appointment creates a new entry in the pending_appointments collection, then a new task is created (this means the tower had no pending appointments, so we need to spin up a task to deal with those) meanwhile if add_pending_appointment appends the locator to an existing entry, then no new task is created (the existing task will pick up that data at some point).

Also I've made subscription errors permanent. I feel like I had a good reason for them to be temporary, but I cannot find any at the moment so it may not really be the case.

mariocynicys · 2022-09-08T11:48:07Z

Concept ACK ba2c444

mariocynicys · 2022-09-08T14:03:57Z

watchtower-plugin/src/retrier.rs

+                    .await;
+
+                    let mut state = wt_client.lock().unwrap();
+                    retriers.lock().unwrap().remove(&tower_id);


This can bring a race condition, if the retry_notify just finished and also the main retry manager loop got an appointment A from that tower that needs to be retried, the appointment might be added and then the retrier for that tower removed right after. deleting the newly added appointment in the way.

Indeed. I've been trying to find a way to fix it but I'm unsure what sync primitive would work here, haven't managed to do so so far :S

Not sure if this is the most elegant way, but I think cfdd3fb should have fixed the race condition

mariocynicys · 2022-09-08T14:05:21Z

watchtower-plugin/src/retrier.rs

+                .pending_appointments
+                .lock()
+                .unwrap()
+                .insert(locator);


Relating to the race condition above, this insert might get invoked after the tower have finished being retried, thus never being retried at all.

I think this cannot happen though. Both this method and the manage_retry lock retriers, so if the tower has finished retrying, the if branch would be hit instead of this one.

In any case, the aforementioned race condition needs fixing.

mariocynicys · 2022-09-14T16:07:47Z

watchtower-plugin/src/main.rs

+            state
+                .unreachable_towers
+                .send((tower_id, appointment.locator))
+                .unwrap();


I missed that one before, but, we should not send an unreachable tower (not temporarily) to the retrier, right?

Yeah, I guess we shouldn't. That also covers subscription errors btw.

So we should add them to pending (db) but not send them to the retrier if they are not temporarily unreachable

Fixed it in 56ec2eb alongside other minor things regarding the status. The only one worth mentioning is splitting is_unreachable in two: is_unreachable and is_temporary_unreachable given we have a good use case for it now.

mariocynicys

I think we fixed all the corner cases in here.

I won't be surprised to see more though, as this has come to be fairly complicated.

sr-gi · 2022-09-15T09:42:10Z

I think we fixed all the corner cases in here.

I won't be surprised to see more though, as this has come to be fairly complicated.

Agreed. I think we should revisit the retrier at some point and try to simplify it.

Are you happy with the commits being squashed and this being merged?

mariocynicys · 2022-09-15T09:53:55Z

Are you happy with the commits being squashed and this being merged?

Yeah I think they are good to go now. But I have played with this PR a little bit in the seek of simplifying it. You might want to take a look at my branch first.

This is an attempt to rework the retrier logic to simplify how it works and make it less error prone. This is done by making the retry manager object responsible for both: 1- adding new retriers and extending current ones 2- removing retriers when they finish their work This way, we don't need a mutex to gaurd the retriers hashmap & we are sure there is no adding/extending retriers and removing them happending at the same time, because only the retry manager does it and not single retriers (i.e. retriers can't remove themselves from the retriers hashmap). The retry manager logic goes as follows: 1- drain the unreachable towers channel till it's empty, and store the pending appointments (locators to be exact) in the pending appointments set for each retrier. 2- remove any finished retrier (ones that succeeded and have no more pending appointments) and failed retriers (ones that failed to send their appointments). 3- start all the non-running retriers left after removing failed and finished retrieres. Retriers will signal thier status so that the retry manager could determine which retriers to keep, which to remove, and which to re-start. We also set tower as unreachable when destroying the tower's retrier and not after completing backoff. This makes it so that the tower is unreachable until its retrier is destroyed, thus manual tower retry by the user will fail with an error till the tower's retrier is destroyed. If we were to set the unreachable tower status after the backoff, then manual user retries might get discarded completely without an error because retrier set the tower state to unreachable too early thus allowing the user to perform manual retries, but if the user does manual retry, it won't get carried out, since the retry manager will remove that retrier anyway as it failed to deliver its pending appointments.

One think I really dislike about talaia-labs#89 was having a method that cloned it's caller to be able to work around spawning a task inside it that called a method of the same class. Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the amount of things being cloned.

One thing I really disliked about talaia-labs#89 was having a method that cloned its caller to be able to work around spawning a task inside it that called a method of the same class. Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the number of things being cloned.

sr-gi force-pushed the 84-unreachable-towers branch from d520f72 to 1a2e5a7 Compare August 12, 2022 11:56

sr-gi added this to the v.0.1.2 milestone Aug 15, 2022

mariocynicys reviewed Aug 16, 2022

View reviewed changes

sr-gi force-pushed the 84-unreachable-towers branch 3 times, most recently from e45ab73 to 5be19f2 Compare August 17, 2022 11:00

sr-gi added Seeking Code Review review me pls bug Something isn't working cln-plugin Stuff related to watchtower-plugin labels Aug 23, 2022

lightyear15 reviewed Aug 27, 2022

View reviewed changes

watchtower-plugin/src/dbm.rs Outdated Show resolved Hide resolved

sr-gi force-pushed the 84-unreachable-towers branch from 5be19f2 to 1ac0823 Compare August 29, 2022 14:28

sr-gi closed this Aug 29, 2022

sr-gi force-pushed the 84-unreachable-towers branch from 1ac0823 to f2e99fa Compare August 29, 2022 15:33

sr-gi reopened this Aug 29, 2022

sr-gi mentioned this pull request Aug 29, 2022

Avoid using "SELECT *" in the codebase #105

Closed

sr-gi added the hard to review sharpen your review knife label Aug 30, 2022

sr-gi requested a review from mariocynicys August 30, 2022 08:19

mariocynicys reviewed Aug 31, 2022

View reviewed changes

sr-gi force-pushed the 84-unreachable-towers branch from 75bb616 to 4fb7083 Compare September 4, 2022 09:00

sr-gi mentioned this pull request Sep 7, 2022

watchtower-plugin: get rid of unnecessary Arc<Mutex< in WTClient #117

Merged

mariocynicys reviewed Sep 8, 2022

View reviewed changes

mariocynicys reviewed Sep 14, 2022

View reviewed changes

mariocynicys reviewed Sep 15, 2022

View reviewed changes

sr-gi and others added 2 commits September 16, 2022 23:44

Fix talaia-labs#84 and revamps the retrier

c5f9c3a

sr-gi force-pushed the 84-unreachable-towers branch from 12a64d8 to f6a60a9 Compare September 16, 2022 21:48

sr-gi merged commit 3c19619 into talaia-labs:master Sep 17, 2022

sr-gi mentioned this pull request Sep 19, 2022

fix: Arc retriers so we reduce the cloning + antipatern #124

Merged

sr-gi removed the Seeking Code Review review me pls label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes race condition in #84 #89

Fixes race condition in #84 #89

sr-gi commented Aug 11, 2022 •

edited

sr-gi commented Aug 11, 2022

mariocynicys left a comment

mariocynicys Aug 16, 2022

sr-gi Aug 17, 2022

This comment was marked as resolved.

sr-gi commented Aug 17, 2022 •

edited

mariocynicys left a comment

mariocynicys Aug 31, 2022

sr-gi Sep 4, 2022

mariocynicys Aug 31, 2022

sr-gi Sep 4, 2022

mariocynicys Aug 31, 2022

sr-gi Sep 4, 2022

mariocynicys Aug 31, 2022

mariocynicys Aug 31, 2022

sr-gi Sep 4, 2022

mariocynicys Aug 31, 2022

sr-gi Sep 4, 2022

mariocynicys Aug 31, 2022

sr-gi Sep 4, 2022

sr-gi commented Sep 4, 2022

mariocynicys commented Sep 8, 2022

mariocynicys Sep 8, 2022

sr-gi Sep 13, 2022

sr-gi Sep 14, 2022

mariocynicys Sep 8, 2022

sr-gi Sep 13, 2022

mariocynicys Sep 14, 2022

sr-gi Sep 14, 2022 •

edited

sr-gi Sep 14, 2022

mariocynicys left a comment

sr-gi commented Sep 15, 2022

mariocynicys commented Sep 15, 2022 •

edited

		assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
		assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"]

Fixes race condition in #84 #89

Fixes race condition in #84 #89

Conversation

sr-gi commented Aug 11, 2022 • edited

sr-gi commented Aug 11, 2022

mariocynicys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

sr-gi commented Aug 17, 2022 • edited

mariocynicys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi commented Sep 4, 2022

mariocynicys commented Sep 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi Sep 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariocynicys left a comment

Choose a reason for hiding this comment

sr-gi commented Sep 15, 2022

mariocynicys commented Sep 15, 2022 • edited

sr-gi commented Aug 11, 2022 •

edited

sr-gi commented Aug 17, 2022 •

edited

sr-gi Sep 14, 2022 •

edited

mariocynicys commented Sep 15, 2022 •

edited