Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes race condition in #84 #89

Merged
merged 2 commits into from Sep 17, 2022
Merged

Conversation

sr-gi
Copy link
Member

@sr-gi sr-gi commented Aug 11, 2022

The current implementation of the cln-plugin had a race condition where some pending appointments may be missed if they were added while the Retrier was trying to send some data to the tower.

The issue was discovered in #84, and goes as follows:

The Retrier used to load all the data from the database (in bulk) when starting a new retry and stored a copy of it to do the retry. This is somehow necessary since we cannot hold a reference of the data across futures. Therefore, if some data was added after the pending appointments were loaded, it would be missed.

The solution goes through different stages:

  • First, the data is checked in memory and loaded one by one instead of in bulk.
  • Second, the channel shared between the WTClient and the Retrier shares a (tower_id, locator) pair instead of simply a tower_id so the Retrier is aware of all data that is sent to it. This means that the Retrier now cannot miss data as long as it checks it back with the WTClient one locator at a time.
    • The Retrier still works in batches for each tower, however, a collection of pending_appointments is now hold by it so it can keep track of what is missing.
  • Finally, instead of simply iterating over the loaded data, we check that the pending_appointments for the given tower is empty before considering that all the data have been sent to the tower. If that is not the case, the newly added data is pulled and the retry continues until the condition holds.

@sr-gi
Copy link
Member Author

sr-gi commented Aug 11, 2022

Also notice this builds on top of #83, so mainly the last commit needs to be reviewed.

@sr-gi sr-gi added this to the v.0.1.2 milestone Aug 15, 2022
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda lost in this.
Doesn't the retry_notify(ExponentialBackoff { ... }).await block the manage retry thread?, so the tower being retried at the moment will miss all the appointments sent while it's being retried??
That's beside, when the tower is flagged as TemporaryUnreachable, we don't send pending appointments to it.

@@ -379,8 +387,10 @@ async fn on_commitment_revocation(
let mut state = plugin.state().lock().unwrap();
state.set_tower_status(tower_id, TowerStatus::TemporaryUnreachable);
state.add_pending_appointment(tower_id, &appointment);

state.unreachable_towers.send(tower_id).unwrap();
state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't the TemporaryUnreachable towers miss any new appointments we send later?
So the race condition is still there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually reviewed an old version of this where unreachable_towers was still holding only tower_id instead of (tower_id, locator). My bad for not flagging this as a draft.

self.wt_client
.lock()
.unwrap()
.set_tower_status(tower_id, crate::TowerStatus::TemporaryUnreachable);

This comment was marked as resolved.

@sr-gi sr-gi force-pushed the 84-unreachable-towers branch 3 times, most recently from e45ab73 to 5be19f2 Compare August 17, 2022 11:00
@sr-gi
Copy link
Member Author

sr-gi commented Aug 17, 2022

I'm kinda lost in this. Doesn't the retry_notify(ExponentialBackoff { ... }).await block the manage retry thread?, so the tower being retried at the moment will miss all the appointments sent while it's being retried?? That's beside, when the tower is flagged as TemporaryUnreachable, we don't send pending appointments to it.

Kind of, but I don't think it actually worked like that. retry_notify called add_appointment (formerly retry_tower) which in turn loaded the pending appointment list for the given tower from the database. This means that, if an appointment was appended to the database while the retrier was looping, that appointment would have been missed.

As I commented here I think the issue comes from this being still a draft when you looked at it. Feel free to give a look at it now, it may make more sense.

@sr-gi sr-gi added Seeking Code Review review me pls bug Something isn't working cln-plugin Stuff related to watchtower-plugin labels Aug 23, 2022
@sr-gi sr-gi closed this Aug 29, 2022
@sr-gi sr-gi reopened this Aug 29, 2022
@sr-gi sr-gi added the hard to review sharpen your review knife label Aug 30, 2022
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to do another round of review since I might have missed some corner cases.

Also one thing I see this approach struggles with is when a tower is unreachable for so long and a user manually retries it. If the tower is still unreachable, the retrier will keep trying to send every single pending appointment to that tower and timeout every time, causing the retrier to do no useful work for so long.

I support having the channel to send (tower, appointment) instead of just (tower), but I think the retrying should be based on (tower)s and not (tower, appointment) permutation.

@@ -111,7 +111,6 @@ def test_unreachable_watchtower(node_factory, bitcoind, teosd):
time.sleep(1)

assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"]


def test_retry_watchtower(node_factory, bitcoind, teosd):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this test doesn't auto retry because we have watchtower-max-retry-time = 0 right?
It tests manual retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for test_retry_watchtower? That's the goal, yeah, make it so the retrier gives up straight away so we can manually retry.

Comment on lines 140 to -141
assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable" is redundant. Since we are waiting on it just two lines before.
I would support keeping assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"] instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll replace it by:

while l2.rpc.gettowerinfo(tower_id)["pending_appointments"]:
    time.sleep(1)

assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"

Which makes sure both that all pending appointments have been sent and that the tower is reachable after all

Comment on lines 48 to 50
if !wt_client.towers.contains_key(&tower_id) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you wrote this before rebasing on master. This does the same logic as the if condition beneath it:

if wt_client.towers.get(&tower_id).is_none() {
         log::info!("Skipping retrying abandoned tower {}", tower_id);
         continue;
}

I assume using contains_key would have better performance though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think some things got messed up after rebasing :S

wt_client.remove_pending_appointment(tower_id, appointment.locator);
AddAppointmentError::ApiError(e) => match e.error_code {
errors::INVALID_SIGNATURE_OR_SUBSCRIPTION_ERROR => {
log::warn!("There is a subscription issue with {}", tower_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a case that would mis-classify a tower state, goes as follows:
1- tower A was temp unreachable and sent to be retried
2- tower A is waiting its turn to be retried and more appointments accumulate
3- tower A started being retried and at halfway through sending all the pending appointment, it started giving back subscription errors.
Such a tower would have temp unreachable status, and then unreachable but never classified as subscription error.

AddAppointmentError::ApiError(e) => match e.error_code {
errors::INVALID_SIGNATURE_OR_SUBSCRIPTION_ERROR => {
log::warn!("There is a subscription issue with {}", tower_id);
return Err(Error::transient("Subscription error"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make a subscription error permanent? next trails will fail anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really cannot find any good reason why not

watchtower-plugin/src/retrier.rs Outdated Show resolved Hide resolved
.await;

let mut state = self.wt_client.lock().unwrap();
self.pending_appointments.lock().unwrap().remove(&tower_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a race condition here. Its scenario would be:
1- Tower A was temp unreachable and sent to be retried for more than once (say for 3 appointments)
2- It comes the turn for tower A to be retried for the first appointment and it fails and marks as unreachable (the problem here is that the appointment we just retried was never sent and also is removed from the pending appointments)
3- Tower A gets retried again (for the second appointment) and it succeeds (maybe it/we had a temporary internet issue), marking it as reachable.

In this scenario, the first appointment will never get sent to the tower (unless the user manually retry it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, I don't follow. Here the whole tower record is removed from pending_appointments. This only happens if the whole retry strategy fails, so the data in the retrier is wipped.

Comment on lines 57 to 74
self.add_pending_appointment(tower_id, locator);

log::info!("Retrying tower {}", tower_id);
match retry_notify(
let r = retry_notify(
ExponentialBackoff {
max_elapsed_time: Some(Duration::from_secs(self.max_elapsed_time_secs as u64)),
max_interval: Duration::from_secs(self.max_interval_time_secs as u64),
..ExponentialBackoff::default()
},
|| async { self.add_appointment(tower_id).await },
|| async { self.retry_tower(tower_id).await },
|err, _| {
log::warn!("Retry error happened with {}. {}", tower_id, err);
},
)
.await
{
.await;

let mut state = self.wt_client.lock().unwrap();
self.pending_appointments.lock().unwrap().remove(&tower_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missed appointment is added in line +57 and all the pending appointments for that tower are cleared in line +74.
This means there is only one appointment in the pending appointments at any given time?
Or maybe I am missing something :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not, appointments get appended to a collection if the tower is already being retried. I just realized that here something may have been messed up since we were creating a different async task for every (tower_id, locator) pair while there should only have been a single task per tower and we should have been appending data to it

@sr-gi
Copy link
Member Author

sr-gi commented Sep 4, 2022

@meryacine let see if this makes more sense now.

Look like some things got messed up after rebasing several PR over this one. I fixed it so there is only one task per tower, I don't know how it ended up being one task per (tower, locator) pair.

Now, if add_pending_appointment creates a new entry in the pending_appointments collection, then a new task is created (this means the tower had no pending appointments, so we need to spin up a task to deal with those) meanwhile if add_pending_appointment appends the locator to an existing entry, then no new task is created (the existing task will pick up that data at some point).

Also I've made subscription errors permanent. I feel like I had a good reason for them to be temporary, but I cannot find any at the moment so it may not really be the case.

@mariocynicys
Copy link
Collaborator

Concept ACK ba2c444

.await;

let mut state = wt_client.lock().unwrap();
retriers.lock().unwrap().remove(&tower_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can bring a race condition, if the retry_notify just finished and also the main retry manager loop got an appointment A from that tower that needs to be retried, the appointment might be added and then the retrier for that tower removed right after. deleting the newly added appointment in the way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've been trying to find a way to fix it but I'm unsure what sync primitive would work here, haven't managed to do so so far :S

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the most elegant way, but I think cfdd3fb should have fixed the race condition

.pending_appointments
.lock()
.unwrap()
.insert(locator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relating to the race condition above, this insert might get invoked after the tower have finished being retried, thus never being retried at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cannot happen though. Both this method and the manage_retry lock retriers, so if the tower has finished retrying, the if branch would be hit instead of this one.

In any case, the aforementioned race condition needs fixing.

Comment on lines 502 to 505
state
.unreachable_towers
.send((tower_id, appointment.locator))
.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that one before, but, we should not send an unreachable tower (not temporarily) to the retrier, right?

Copy link
Member Author

@sr-gi sr-gi Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we shouldn't. That also covers subscription errors btw.

So we should add them to pending (db) but not send them to the retrier if they are not temporarily unreachable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it in 56ec2eb alongside other minor things regarding the status. The only one worth mentioning is splitting is_unreachable in two: is_unreachable and is_temporary_unreachable given we have a good use case for it now.

Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we fixed all the corner cases in here.

I won't be surprised to see more though, as this has come to be fairly complicated.

@sr-gi
Copy link
Member Author

sr-gi commented Sep 15, 2022

I think we fixed all the corner cases in here.

I won't be surprised to see more though, as this has come to be fairly complicated.

Agreed. I think we should revisit the retrier at some point and try to simplify it.

Are you happy with the commits being squashed and this being merged?

@mariocynicys
Copy link
Collaborator

mariocynicys commented Sep 15, 2022

Are you happy with the commits being squashed and this being merged?

Yeah I think they are good to go now. But I have played with this PR a little bit in the seek of simplifying it. You might want to take a look at my branch first.

sr-gi and others added 2 commits September 16, 2022 23:44
This is an attempt to rework the retrier logic to simplify how it works
and make it less error prone. This is done by making the retry manager
object responsible for both:
1- adding new retriers and extending current ones
2- removing retriers when they finish their work

This way, we don't need a mutex to gaurd the retriers hashmap & we are
sure there is no adding/extending retriers and removing them happending
at the same time, because only the retry manager does it and not
single retriers (i.e. retriers can't remove themselves from the retriers
hashmap).

The retry manager logic goes as follows:
1- drain the unreachable towers channel till it's empty, and store the pending appointments (locators to be exact) in the pending appointments set for each retrier.
2- remove any finished retrier (ones that succeeded and have no more pending appointments) and failed retriers (ones that failed to send their appointments).
3- start all the non-running retriers left after removing failed and finished retrieres.

Retriers will signal thier status so that the retry manager could
determine which retriers to keep, which to remove, and which to re-start.

We also set tower as unreachable when destroying the tower's retrier
and not after completing backoff. This makes it so that the tower is
unreachable until its retrier is destroyed, thus manual tower retry
by the user will fail with an error till the tower's retrier is destroyed.

If we were to set the unreachable tower status after the backoff, then manual
user retries might get discarded completely without an error because retrier
set the tower state to unreachable too early thus allowing the user to
perform manual retries, but if the user does manual retry, it won't get
carried out, since the retry manager will remove that retrier anyway as
it failed to deliver its pending appointments.
@sr-gi sr-gi merged commit 3c19619 into talaia-labs:master Sep 17, 2022
sr-gi added a commit to sr-gi/rust-teos that referenced this pull request Sep 19, 2022
One think I really dislike about talaia-labs#89 was having a method that cloned it's caller
to be able to work around spawning a task inside it that called a method of the same class.

Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing,
plus it also reduces the amount of things being cloned.
sr-gi added a commit to sr-gi/rust-teos that referenced this pull request Sep 19, 2022
One thing I really disliked about talaia-labs#89 was having a method that cloned its caller to be able to work around spawning a task inside it that called a method of the same class.

Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the number of things being cloned.
sr-gi added a commit to sr-gi/rust-teos that referenced this pull request Sep 20, 2022
One thing I really disliked about talaia-labs#89 was having a method that cloned its caller to be able to work around spawning a task inside it that called a method of the same class.

Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the number of things being cloned.
@sr-gi sr-gi removed the Seeking Code Review review me pls label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cln-plugin Stuff related to watchtower-plugin hard to review sharpen your review knife
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants