Fix the issues around long-running bag replications #993

alexwlchan · 2022-04-11T12:11:44Z

This is what I suspect to be the root cause of wellcomecollection/platform#5484. I'm going to write it out with quite a lot of context, because (1) I don't know how quickly we'll implement this and (2) it's helpful for somebody else to understand it.

Context

What is the bag replicator?

The bag replicator is the service that copies a bag (a collection of files) from one storage location to another. It runs three times when storing a normal bag:

Copying from working storage to our warm S3 storage
Copying from working storage to our cold S3 storage
Copying from working storage to our cold Azure storage

It calls ListObjects on the working storage, then does one-by-one copies of objects from the source to the destination.

How the bag replicator tries to do write-once operation

We want the bag replicator to write exactly once to the permanent storage location – so we can replicate an object, verify it, then be certain that it hasn't changed after it was verified.

We don't currently enable the S3 features that allow you to do that (Legal Holds, Object Lock, governance mode, etc.). We do enable legal holds in Azure.

So instead, the bag replicator checks for an existing object itself, and only writes the object if there isn't already an object at the destination. If there's already an object, it confirms it's the same as what it's about to replicate.

flowchart TD
    F[get object to replicate] --> DE{does the <br/>destination have<br/> this object?}
    DE-- yes -->DEY{are the source <br/>and destination <br/>objects the same?}
    DE-- no -->DEN["✅ replicate the object"]
    DEY-- yes -->DEYY["✅ nothing to do"]
    DEY-- no -->DEYN["⚠️ something has gone wrong"]

To prevent two processes doing this check simultaneously, the bag replicator will lock around the destination prefix – if it's got the lock, it should be the only thing working on this prefix.

This uses our Dynamo locking code, which has locks that expire after a certain period. (This is so applications that crash without releasing their locks don't gum up a process in perpetuity.)

S3 consistency and lazy checking

This approach had a bad interaction with the way S3 consistency used to work.

S3 used to be eventually consistent. If you called GetObject on a non-existent object, then PutObject, then GetObject again, you might get a 404 on the second call. This could cause verification failures, like so:

sequenceDiagram
  participant replicator
  participant verifier
  participant S3

  replicator->>S3: GetObject(f)
  S3->>replicator: 404 Not Found

  replicator->>S3: CopyObject(f, destination)
  S3->>replicator: OK, done!

  replicator->>verifier: you can verify f
  verifier->>S3: GetObject(f)
  S3->>verifier: 404 Not Found (!)

To get around this, the replicator starts by looking for existing objects in the prefix using a ListObjects call. If there aren't any, it knows that nobody is going to write any new objects because it's holding the lock, so it can skip checking for each object in turn. This avoids querying any object individually and hitting this consistency issue.

S3 is now strongly consistent so we no longer need this mitigation, but we haven't removed it yet.

The problem

I think this is the rough sequence of events with the problem bag:

The Azure replicator queue gets a message "Please replicate b32832631"
Task A picks up the message. It acquires a lock on the destination prefix, sees there are no objects already there, starts replicating.
For some reason, the task gets stuck (possibly because we're doing a lot of S3→Azure replication at the same time). It's still working, but it doesn't replicate any objects before the SQS timeout and prefix lock both expire.
Task B picks up the message. It acquires a lock on the destination prefix, sees there are no objects already there, starts replicating. At this point, both A and B are trying to replicate into the same prefix, and both think they have exclusive rights to do so.
Eventually the logjam clears, and both tasks are able to replicate objects into Azure.
Because we have legal holds enabled on the Azure container, at some point both tasks encounter an object that the other task has written, and the Azure copy fails:

This operation is not permitted as the blob is immutable due to one or more legal holds

Proposed changes

Changes I think we should definitely make:

Lock around individual destination objects, not the entire prefix. This problem became possible because the replicator kept working after its lock had expired. We could build a way to keep locks "fresh" and tie that to our SQS visibility timeout, but this is adding complexity to an already pretty-complicated bit of code.

This means we'd spend more on DynamoDB locking, but individual locks would only have to last as long as it takes to replicate a single object. The likelihood of a lock expiring while work is ongoing drops dramatically.
Always look for an existing object before replicating. We added this to work around an S3 limitation that no longer applies. We could remove this code and simplify the replicator. It's less efficient (because we'll check for lots of objects that don't exist), but combined with the individual locking gives us a stronger level of protection against overwrites.

It might be an issue if we ever want to expand our replication locations to another object store that's eventually consistent, but (1) we have no plans to do that and (2) we can't be sure this hypothetical future store would have the same failure mode.

And a change I considered but don't think we should do:

Handle a 409 Blob Already Exists from Azure. With the first two changes this should become practically impossible. If it does happen, we should let it bubble up as an error so we can work out what's going wrong earlier in the replicator.

alexwlchan · 2022-04-11T12:18:03Z

When should we do this?

We shouldn't rush into this; the replicator/verifier is the most critical part of the storage service, and we don't want to introduce new bugs. The existing code has also been extensively tested in production, and we know it's pretty reliable.

We should fix this, because it is a bug and will become more likely as we push more/bigger bags through the storage service.

Who should do this?

Another Scala developer pairing with me, as a way for them to get some experience with the storage service. It'd be better to wait and use this to do some knowledge sharing, than have me rush into writing a fix and miss this opportunity.

alexwlchan added 🐛 Bug Something isn't working ↪️ Bag replication labels Apr 11, 2022

alexwlchan mentioned this issue Apr 14, 2022

Reduce the Azure replicator concurrency to avoid intermittent issues #994

Merged

alexwlchan mentioned this issue Aug 1, 2022

Always check for existing objects in the prefix when replicating #1011

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the issues around long-running bag replications #993

Fix the issues around long-running bag replications #993

alexwlchan commented Apr 11, 2022 •

edited

Loading

alexwlchan commented Apr 11, 2022

Fix the issues around long-running bag replications #993

Fix the issues around long-running bag replications #993

Comments

alexwlchan commented Apr 11, 2022 • edited Loading

Context

What is the bag replicator?

How the bag replicator tries to do write-once operation

S3 consistency and lazy checking

The problem

Proposed changes

alexwlchan commented Apr 11, 2022

When should we do this?

Who should do this?

alexwlchan commented Apr 11, 2022 •

edited

Loading