Skip to content

Design question — multiple storage backends or locations #255

Open
@alexdutton

Description

@alexdutton

As I understand it, the current design/implementation has the following:

  • a single swappable storage factory, configured by FILES_REST_STORAGE_FACTORY, the provided implementation being pyfs_storage_factory. As far as I can tell, this implementation isn't particularly/actually PyFS-specific.
  • invenio-s3 has its own storage factory, which implies that it's not currently possible to have multiple storage backends in use at the same time, without creating a new storage factory that hands off to the right storage class.
  • potentially many storage class implementations (with the default configured by FILES_REST_DEFAULT_STORAGE_CLASS)
  • the storage class can apparently be specified/overridden as Bucket.default_storage_class, FileInstance.storage_class (both a db.Column(db.String(1)), which implies it's never been used? see source)
  • according to the entity relationship diagram, a bucket has a default location, which is used to construct the FileInstance.uri. The location is not used after that, and so file URIs for filesystem-stored files are absolute paths on disk.
  • the storage class is passed into the factory by the models API user. e.g. invenio_files_rest.tasks:merge_multipartobject calls mp.merge_parts(), which has a **kwargs that gets passed all the way through to FileInstance.storage(), which defaults to storage_class being the one specfied by FILES_REST_DEFAULT_STORAGE_CLASS. In this scenario, the storage_class on the FileInstance gets ignored.

This seems a bit complicated, and doesn't yet necessarily meet its objectives of having a tidy interface and support for multiple storage backends.

What I would have expected:

  • A Location defines a storage class and a base path, e.g. (abstractly) (pyfs, /some/path), (s3, /bucket-name/some-path) or (s3bucket, /some/path)
  • A Bucket provides for a default location for new files, defaulting to the global default location
  • A FileInstance has a Location and a path. The location defaults to the bucket at creation time. The base path for the bucket and the path for the file instance are combined and used by the storage backend defined on the location
  • Storage backends are declared using entrypoints, with the entrypoint name being used as the key for the storage backend on the Location model. This key→class mapping can be overridden in config
  • FileInstance.storage() takes no **kwargs, and creates the storage class instance based on the above. Maybe it's a (cached?) property. Its behaviour is contained in a function that replaces the storage class factory and can be overridden, but generally an implementer would leave this alone unless they want to do something special.

This GitHub issue came about because I was trying to work out how one would support hooking up to multiple S3 buckets, and my assumption was that each would be a location for each bucket, but it wasn't immediately clear how to integrate this into the existing framework.

Could we improve on how this all works before we have a public release?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions