Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we replicate to other storage providers? #1006

Open
alexwlchan opened this issue Jun 10, 2022 · 0 comments
Open

Could we replicate to other storage providers? #1006

alexwlchan opened this issue Jun 10, 2022 · 0 comments

Comments

@alexwlchan
Copy link
Contributor

alexwlchan commented Jun 10, 2022

Currently the storage service supports replicating to two storage providers:

  • Amazon S3
  • Azure Blob

We chose these because they're the two providers used by Wellcome, but it's designed to be extensible, e.g. to support Google Cloud Storage. What would it take?

These are some rough notes; not a comprehensive work list. It's meant to give a finger-in-the-air estimate.

Assumptions

Where it fits in the ingest process

This is where a new storage provider would sit in the ingest process:

graph LR
     PV[pre-replication<br/>verifier] --> R1[Replicator #1]
     A[...] --> PV
     PV --> R2[Replicator #2]
     PV --> R3[Replicator #3]
     R1 --> V1[Verifier #1]
     R2 --> V2[Verifier #2]
     R3 --> V3[Verifier #3]
     V1 --> RA[Replica aggregator]
     V2 --> RA
     V3 --> RA
     RA --> BR[Bag register]
     BR --> BT[Bag tagger]
Loading

Everything before the pre-replication verifier is happening in S3; it doesn't care about replication locations.

  • The pre-replication verifier sends a message "please replicate bag X", which gets fanned out to as many replicators as exist
  • You'd need a new replicator that can copy objects from S3 to the new storage provider
  • You'd need a new verifier that can read objects in the new storage provider
  • The replica aggregator doesn't care as much about exact storage providers; it works in terms of "primary" and "secondary" replicas (alternatively: warm/cold, active/backup). It expects to see exactly one primary replica and N secondary replicas, where N is configurable in the app config. It would need to be able to interpret a message form the bag verifier "I have verified a bag in location X in provider P", but that's it.
  • The bag register also doesn't care much about exact providers; it just needs to know how to serialise the provider location as JSON for the storage manifest
  • The bag tagger might need to care about the new location, depending on how you're using tags.

App code: implement a generic storage provider trait

To allow extensibility, the Scala code in the services is designed around generic traits. Off the top of my head, this is the rough set of operations it uses:

trait StorageLocation {
  namespace: string
  key: string

  def join(location: StorageLocation, suffix: string) -> StorageLocation
}

trait StorageProvider {
  def get(location) -> InputStream
  def put(location, InputStream) -> ()

  // note: this is used for verification; we write checksum tags to the object after it's been verified
  // if the storage provider doesn't support tags natively, these can be written to a sidecar database.
  def putTags(location, tags: Map[String, String]) -> ()

  def listPrefix(prefix: StorageLocation) -> List[StorageLocation]

  def copyFromS3(s3Location, location) -> ()

  // e.g. timeouts are retryable, object not found isn't
  // this is used by the storage service to retry when it's safe to do so, because often even big
  // storage providers get flaky under heavy load
  def isErrorRetryableOrTerminal(error) -> Boolean
}

We already have S3 and Azure implementations of these traits. Adding a third provider would require implementing this trait and plumbing it into the storage service apps that would work with these locations/provider, including:

  • the bag replicator (which replicates into the location)
  • the bag verifier (which verifies content written into the location)
  • the replica aggregator and bag register (which need to know they might get replicas in that location)

Note that some of this generic code isn't in the storage service repo, but in the storage library of the scala-libs repo.

Infra code: configuring the new replicator in Terraform

You'd need to modify the Terraform to plumb in the new replicator/verifier.

Ideally we'd do this in a way that minimised divergence from the existing Terraform modules, so it would be easy for users to stay in step with the core modules.

Time estimate

I think this would take months, not weeks.

  • It took months to add support for Azure replication.
  • Going from 2->3 should be faster than 1->2, because we're removed a lot of the hard-coded S3 calls, but it might expose other ways in which the code isn't as extensible as we'd like.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant