Skip to content

Commit

Permalink
Rework the retrier
Browse files Browse the repository at this point in the history
This is an attempt to rework the retrier logic to simplify how it works
and make it less error prone. This is done by making the retry manager
object responsible for both:
1- adding new retriers and extending current ones
2- removing retriers when they finish their work

This way, we don't need a mutex to gaurd the retriers hashmap & we are
sure there is no adding/extending retriers and removing them happending
at the same time, because only the retry manager does it and not
single retriers (i.e. retriers can't remove themselves from the retriers
hashmap).

The retry manager logic goes as follows:
1- drain the unreachable towers channel till it's empty, and store the pending appointments (locators to be exact) in the pending appointments set for each retrier.
2- remove any finished retrier (ones that succeeded and have no more pending appointments) and failed retriers (ones that failed to send their appointments).
3- start all the non-running retriers left after removing failed and finished retrieres.

Retriers will signal thier status so that the retry manager could
determine which retriers to keep, which to remove, and which to re-start.

We also set tower as unreachable when destroying the tower's retrier
and not after completing backoff. This makes it so that the tower is
unreachable until its retrier is destroyed, thus manual tower retry
by the user will fail with an error till the tower's retrier is destroyed.

If we were to set the unreachable tower status after the backoff, then manual
user retries might get discarded completely without an error because retrier
set the tower state to unreachable too early thus allowing the user to
perform manual retries, but if the user does manual retry, it won't get
carried out, since the retry manager will remove that retrier anyway as
it failed to deliver its pending appointments.
  • Loading branch information
mariocynicys authored and sr-gi committed Sep 16, 2022
1 parent c5f9c3a commit f6a60a9
Show file tree
Hide file tree
Showing 3 changed files with 236 additions and 130 deletions.
4 changes: 2 additions & 2 deletions watchtower-plugin/src/main.rs
Expand Up @@ -615,8 +615,8 @@ async fn main() -> Result<(), Error> {
60
};
tokio::spawn(async move {
RetryManager::new(state_clone)
.manage_retry(max_elapsed_time, max_interval_time, rx)
RetryManager::new(state_clone, rx, max_elapsed_time, max_interval_time)
.manage_retry()
.await
});
plugin.join().await
Expand Down

0 comments on commit f6a60a9

Please sign in to comment.