New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes race condition in #84 #89
Conversation
Also notice this builds on top of #83, so mainly the last commit needs to be reviewed. |
d520f72
to
1a2e5a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda lost in this.
Doesn't the retry_notify(ExponentialBackoff { ... }).await
block the manage retry thread?, so the tower being retried at the moment will miss all the appointments sent while it's being retried??
That's beside, when the tower is flagged as TemporaryUnreachable
, we don't send pending appointments to it.
@@ -379,8 +387,10 @@ async fn on_commitment_revocation( | |||
let mut state = plugin.state().lock().unwrap(); | |||
state.set_tower_status(tower_id, TowerStatus::TemporaryUnreachable); | |||
state.add_pending_appointment(tower_id, &appointment); | |||
|
|||
state.unreachable_towers.send(tower_id).unwrap(); | |||
state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But won't the TemporaryUnreachable
towers miss any new appointments we send later?
So the race condition is still there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually reviewed an old version of this where unreachable_towers
was still holding only tower_id
instead of (tower_id, locator)
. My bad for not flagging this as a draft.
watchtower-plugin/src/retrier.rs
Outdated
self.wt_client | ||
.lock() | ||
.unwrap() | ||
.set_tower_status(tower_id, crate::TowerStatus::TemporaryUnreachable); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
e45ab73
to
5be19f2
Compare
Kind of, but I don't think it actually worked like that. As I commented here I think the issue comes from this being still a draft when you looked at it. Feel free to give a look at it now, it may make more sense. |
5be19f2
to
1ac0823
Compare
1ac0823
to
f2e99fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to do another round of review since I might have missed some corner cases.
Also one thing I see this approach struggles with is when a tower is unreachable for so long and a user manually retries it. If the tower is still unreachable, the retrier will keep trying to send every single pending appointment to that tower and timeout every time, causing the retrier to do no useful work for so long.
I support having the channel to send (tower, appointment) instead of just (tower), but I think the retrying should be based on (tower)s and not (tower, appointment) permutation.
@@ -111,7 +111,6 @@ def test_unreachable_watchtower(node_factory, bitcoind, teosd): | |||
time.sleep(1) | |||
|
|||
assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable" | |||
assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"] | |||
|
|||
|
|||
def test_retry_watchtower(node_factory, bitcoind, teosd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, this test doesn't auto retry because we have watchtower-max-retry-time = 0
right?
It tests manual retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for test_retry_watchtower
? That's the goal, yeah, make it so the retrier gives up straight away so we can manually retry.
assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable" | ||
assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
is redundant. Since we are waiting on it just two lines before.
I would support keeping assert not l2.rpc.gettowerinfo(tower_id)["pending_appointments"]
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll replace it by:
while l2.rpc.gettowerinfo(tower_id)["pending_appointments"]:
time.sleep(1)
assert l2.rpc.gettowerinfo(tower_id)["status"] == "reachable"
Which makes sure both that all pending appointments have been sent and that the tower is reachable after all
watchtower-plugin/src/retrier.rs
Outdated
if !wt_client.towers.contains_key(&tower_id) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you wrote this before rebasing on master. This does the same logic as the if condition beneath it:
if wt_client.towers.get(&tower_id).is_none() {
log::info!("Skipping retrying abandoned tower {}", tower_id);
continue;
}
I assume using contains_key
would have better performance though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think some things got messed up after rebasing :S
wt_client.remove_pending_appointment(tower_id, appointment.locator); | ||
AddAppointmentError::ApiError(e) => match e.error_code { | ||
errors::INVALID_SIGNATURE_OR_SUBSCRIPTION_ERROR => { | ||
log::warn!("There is a subscription issue with {}", tower_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a case that would mis-classify a tower state, goes as follows:
1- tower A was temp unreachable and sent to be retried
2- tower A is waiting its turn to be retried and more appointments accumulate
3- tower A started being retried and at halfway through sending all the pending appointment, it started giving back subscription errors.
Such a tower would have temp unreachable status, and then unreachable but never classified as subscription error.
watchtower-plugin/src/retrier.rs
Outdated
AddAppointmentError::ApiError(e) => match e.error_code { | ||
errors::INVALID_SIGNATURE_OR_SUBSCRIPTION_ERROR => { | ||
log::warn!("There is a subscription issue with {}", tower_id); | ||
return Err(Error::transient("Subscription error")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make a subscription error permanent? next trails will fail anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really cannot find any good reason why not
watchtower-plugin/src/retrier.rs
Outdated
.await; | ||
|
||
let mut state = self.wt_client.lock().unwrap(); | ||
self.pending_appointments.lock().unwrap().remove(&tower_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a race condition here. Its scenario would be:
1- Tower A was temp unreachable and sent to be retried for more than once (say for 3 appointments)
2- It comes the turn for tower A to be retried for the first appointment and it fails and marks as unreachable (the problem here is that the appointment we just retried was never sent and also is removed from the pending appointments)
3- Tower A gets retried again (for the second appointment) and it succeeds (maybe it/we had a temporary internet issue), marking it as reachable.
In this scenario, the first appointment will never get sent to the tower (unless the user manually retry it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, I don't follow. Here the whole tower record is removed from pending_appointments
. This only happens if the whole retry strategy fails, so the data in the retrier is wipped.
watchtower-plugin/src/retrier.rs
Outdated
self.add_pending_appointment(tower_id, locator); | ||
|
||
log::info!("Retrying tower {}", tower_id); | ||
match retry_notify( | ||
let r = retry_notify( | ||
ExponentialBackoff { | ||
max_elapsed_time: Some(Duration::from_secs(self.max_elapsed_time_secs as u64)), | ||
max_interval: Duration::from_secs(self.max_interval_time_secs as u64), | ||
..ExponentialBackoff::default() | ||
}, | ||
|| async { self.add_appointment(tower_id).await }, | ||
|| async { self.retry_tower(tower_id).await }, | ||
|err, _| { | ||
log::warn!("Retry error happened with {}. {}", tower_id, err); | ||
}, | ||
) | ||
.await | ||
{ | ||
.await; | ||
|
||
let mut state = self.wt_client.lock().unwrap(); | ||
self.pending_appointments.lock().unwrap().remove(&tower_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missed appointment is added in line +57 and all the pending appointments for that tower are cleared in line +74.
This means there is only one appointment in the pending appointments at any given time?
Or maybe I am missing something :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not, appointments get appended to a collection if the tower is already being retried. I just realized that here something may have been messed up since we were creating a different async task for every (tower_id, locator)
pair while there should only have been a single task per tower and we should have been appending data to it
75bb616
to
4fb7083
Compare
@meryacine let see if this makes more sense now. Look like some things got messed up after rebasing several PR over this one. I fixed it so there is only one task per tower, I don't know how it ended up being one task per (tower, locator) pair. Now, if Also I've made subscription errors permanent. I feel like I had a good reason for them to be temporary, but I cannot find any at the moment so it may not really be the case. |
Concept ACK |
watchtower-plugin/src/retrier.rs
Outdated
.await; | ||
|
||
let mut state = wt_client.lock().unwrap(); | ||
retriers.lock().unwrap().remove(&tower_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can bring a race condition, if the retry_notify
just finished and also the main retry manager loop got an appointment A from that tower that needs to be retried, the appointment might be added and then the retrier for that tower removed right after. deleting the newly added appointment in the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I've been trying to find a way to fix it but I'm unsure what sync primitive would work here, haven't managed to do so so far :S
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the most elegant way, but I think cfdd3fb should have fixed the race condition
.pending_appointments | ||
.lock() | ||
.unwrap() | ||
.insert(locator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relating to the race condition above, this insert might get invoked after the tower have finished being retried, thus never being retried at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cannot happen though. Both this method and the manage_retry
lock retriers
, so if the tower has finished retrying, the if branch would be hit instead of this one.
In any case, the aforementioned race condition needs fixing.
watchtower-plugin/src/main.rs
Outdated
state | ||
.unreachable_towers | ||
.send((tower_id, appointment.locator)) | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that one before, but, we should not send an unreachable tower (not temporarily) to the retrier, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we shouldn't. That also covers subscription errors btw.
So we should add them to pending (db) but not send them to the retrier if they are not temporarily unreachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it in 56ec2eb alongside other minor things regarding the status. The only one worth mentioning is splitting is_unreachable
in two: is_unreachable
and is_temporary_unreachable
given we have a good use case for it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we fixed all the corner cases in here.
I won't be surprised to see more though, as this has come to be fairly complicated.
Agreed. I think we should revisit the retrier at some point and try to simplify it. Are you happy with the commits being squashed and this being merged? |
Yeah I think they are good to go now. But I have played with this PR a little bit in the seek of simplifying it. You might want to take a look at my branch first. |
This is an attempt to rework the retrier logic to simplify how it works and make it less error prone. This is done by making the retry manager object responsible for both: 1- adding new retriers and extending current ones 2- removing retriers when they finish their work This way, we don't need a mutex to gaurd the retriers hashmap & we are sure there is no adding/extending retriers and removing them happending at the same time, because only the retry manager does it and not single retriers (i.e. retriers can't remove themselves from the retriers hashmap). The retry manager logic goes as follows: 1- drain the unreachable towers channel till it's empty, and store the pending appointments (locators to be exact) in the pending appointments set for each retrier. 2- remove any finished retrier (ones that succeeded and have no more pending appointments) and failed retriers (ones that failed to send their appointments). 3- start all the non-running retriers left after removing failed and finished retrieres. Retriers will signal thier status so that the retry manager could determine which retriers to keep, which to remove, and which to re-start. We also set tower as unreachable when destroying the tower's retrier and not after completing backoff. This makes it so that the tower is unreachable until its retrier is destroyed, thus manual tower retry by the user will fail with an error till the tower's retrier is destroyed. If we were to set the unreachable tower status after the backoff, then manual user retries might get discarded completely without an error because retrier set the tower state to unreachable too early thus allowing the user to perform manual retries, but if the user does manual retry, it won't get carried out, since the retry manager will remove that retrier anyway as it failed to deliver its pending appointments.
12a64d8
to
f6a60a9
Compare
One think I really dislike about talaia-labs#89 was having a method that cloned it's caller to be able to work around spawning a task inside it that called a method of the same class. Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the amount of things being cloned.
One thing I really disliked about talaia-labs#89 was having a method that cloned its caller to be able to work around spawning a task inside it that called a method of the same class. Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the number of things being cloned.
One thing I really disliked about talaia-labs#89 was having a method that cloned its caller to be able to work around spawning a task inside it that called a method of the same class. Turns out you can have self as Arc<Self> which would completely prevent having to do such a thing, plus it also reduces the number of things being cloned.
The current implementation of the
cln-plugin
had a race condition where some pending appointments may be missed if they were added while theRetrier
was trying to send some data to the tower.The issue was discovered in #84, and goes as follows:
The
Retrier
used to load all the data from the database (in bulk) when starting a new retry and stored a copy of it to do the retry. This is somehow necessary since we cannot hold a reference of the data across futures. Therefore, if some data was added after the pending appointments were loaded, it would be missed.The solution goes through different stages:
WTClient
and theRetrier
shares a(tower_id, locator)
pair instead of simply atower_id
so theRetrier
is aware of all data that is sent to it. This means that theRetrier
now cannot miss data as long as it checks it back with theWTClient
onelocator
at a time.Retrier
still works in batches for each tower, however, a collection ofpending_appointments
is now hold by it so it can keep track of what is missing.pending_appointments
for the given tower is empty before considering that all the data have been sent to the tower. If that is not the case, the newly added data is pulled and the retry continues until the condition holds.