Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is an attempt to rework the retrier logic to simplify how it works and make it less error prone. This is done by making the retry manager object responsible for both: 1- adding new retriers and extending current ones 2- removing retriers when they finish their work This way, we don't need a mutex to gaurd the retriers hashmap & we are sure there is no adding/extending retriers and removing them happending at the same time, because only the retry manager does it and not single retriers (i.e. retriers can't remove themselves from the retriers hashmap). The retry manager logic goes as follows: 1- drain the unreachable towers channel till it's empty, and store the pending appointments (locators to be exact) in the pending appointments set for each retrier. 2- remove any finished retrier (ones that succeeded and have no more pending appointments) and failed retriers (ones that failed to send their appointments). 3- start all the non-running retriers left after removing failed and finished retrieres. Retriers will signal thier status so that the retry manager could determine which retriers to keep, which to remove, and which to re-start. We also set tower as unreachable when destroying the tower's retrier and not after completing backoff. This makes it so that the tower is unreachable until its retrier is destroyed, thus manual tower retry by the user will fail with an error till the tower's retrier is destroyed. If we were to set the unreachable tower status after the backoff, then manual user retries might get discarded completely without an error because retrier set the tower state to unreachable too early thus allowing the user to perform manual retries, but if the user does manual retry, it won't get carried out, since the retry manager will remove that retrier anyway as it failed to deliver its pending appointments.
- Loading branch information