Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between Minos and Necromancer #4109

Closed
dchristidis opened this issue Nov 6, 2020 · 0 comments · Fixed by #4111
Closed

Race condition between Minos and Necromancer #4109

dchristidis opened this issue Nov 6, 2020 · 0 comments · Fixed by #4111
Assignees
Milestone

Comments

@dchristidis
Copy link
Contributor

Motivation

A replica that is declared as bad which had previously been declared as temporarily unavailable might be reported as recovered even though no request is created. This can happen more easily when there’s a backlog of bad replicas to be processed by the Necromancer and the temporary unavailability expires during that period.

It’s also possible to reproduce this artificially:

  1. Stop all Necromancer instances.
  2. Declare a replica as temporarily unavailable. Wait until it is processed by Minos: the replicas state transitions from AVAILABLE to TEMPORARY_UNAVAILABLE and there’s a row in bad_replicas with the same state.
  3. Declare the replica as lost. Wait until it is processed by Minos: the replicas state transitions from TEMPORARY_UNAVAILABLE to BAD and there’s a new row in bad_replicas with the same state.
  4. Let the temporary unavailability expire naturally or manually update the expires_at column.
  5. Wait until the bad replica is processed by the Minos temporary expiration daemon: the first bad_replicas row is removed and the replicas state transitions from BAD to AVAILABLE.
  6. Restart the Necromancers. The main loop works on the replicas table, so it’s never picked up.
  7. Wait one hour or manually change the value of update_history_threshold so that list_bad_replicas_history() and update_bad_replicas_history() are called. There, Necromancer sees that there’s a bad_replicas row with state BAD but the replicas state is AVAILABLE. Consequently, bad_replicas transitions from BAD to RECOVERED without creating a request. Checkmate.

Modification

Some discussion on how to handle such cases might be necessary.

cserf added a commit to cserf/rucio that referenced this issue Nov 9, 2020
bari12 added a commit that referenced this issue Nov 10, 2020
…_Minos_and_Necromancer

Race condition between Minos and Necromancer : Closes #4109
@bari12 bari12 added this to the 1.23.10 milestone Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants