Content-addressable storage transformer (v3 protocol extension) #82

alimanfoo · 2020-06-26T12:40:48Z

This issue describes a concept for zarr v3 protocol extension which enables content-addressable storage to be layered on top of any underlying store. It is a thought experiment only, not a concrete proposal. Elaboration of suggestions at #76.

Goals:

Enable verification of store contents integrity
Enable reading from store state as at a specific point in time (time travel)

The protocol extension introduces a layer of indirection to the storage protocol. This can be thought of as a transformation layer which sits above the store and modifies the key/value store operations.

When attempting to store a given (key, value) pair, the storage transformer hashes the value, and the hash is used to obtain a content-addressable key.

E.g., if storing an encoded chunk value under key 'data/foo/bar/0.0', if the hash for the value is 'abcdefghijklmnopqrstuvwxyz', then the content-addressable key would be something like 'content/a/b/c/d/efghijklmnopqrstuvwxyz'. The way that the transformer generates content-addressable keys could be configured with depth, width and hash algorithm.

The transformer would then issue a request to the underlying store to set (content-addressable-key, value).

To keep track of the location of the content, a content metadata document would be created, which records the content-addressable key, together with any other metadata such as timestamp of creation. This could be a JSON document, and could record multiple versions of the content. E.g.:

[
    {
        "address": "content/a/b/c/d/efghijklmnopqrstuvwxyz",
        "timestamp": 1593171939
    },
    {
        "address": "content/z/yxwvutsrqponmlkjihgfedcba",
        "timestamp": 32503680000
    },
    ...
]

This JSON document could be stored under the original key, in this case 'data/foo/bar/0.0'.

In other words, when the transformer receives a request set(key, value), it does the following:

Hash the value and generate a content-addressable-key
Call set(content-addressable-key, value) on the underlying store
Call get(key, value) to retrieve the content metadata document (if it exists, otherwise create a new one)
Update the content metadata document to append a new entry with the address and timestamp.
Call set(key, content-metadata-document) on the underlying store

When the transformation layer receives a request to get(key, value), it does the following:

Call get(key) on the underlying store to retrieve the content metadata document
Parse the content metadata document and identify the most recent content addressable key for the value
Call get(content-addressable-key) on the underlying store to retrieve the value
Return the value

The transformation layer could also expose an API to set the time state for reading. E.g., if time state is set to time T, then when the transformation layer receives a request to get(key, value), it does the following:

Call get(key) on the underlying store to retrieve the content metadata document
Parse the content metadata document and identify the most recent content addressable key for the value with a timestamp not later than T
Call get(content-addressable-key) on the underlying store to retrieve the value
Return the value

In order to discover that content-addressable storage transformer extension is in use, this could be declared in the zarr entry point metadata document, e.g., zarr.json like:

{
    "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0",
    "metadata_encoding": "application/json",
    "extensions": [
        {
            "extension": "http://example.org/zarr/extension/content-addressable-storage-transformer",
            "must_understand": true,
            "configuration": {
                "algorithm": "sha256",
                "depth": 4,
                "width": 1
            }
        }
    ]
}

When a zarr v3 implementation opened a hierarchy using this extension, it could recognise that when parsing the entry point metadata document, and insert the appropriate store transformer if supported. I.e., a user opening such a hierarchy would not need to know that content-addressable storage was used, the implementation would discover that for itself.

There are several potential advantages of this scheme:

The group and array metadata keys would exist in the store as normal, and so any functionality inspecting the keys to infer which groups and arrays are present in the hierarchy would still work unchanged. I.e., the transformation layer could just pass through the list, list_pre and list_dir operations to the underlying store and everything should work as normal.
The metadata that tracks the content locations would be decentralised, allowing all of the same degrees of parallelism that the normal store provides. I.e., chunks could be written in parallel, and arrays could be created in parallel, without requiring any locking or synchronisation.
Because the extension is a transformation layer, any type of underlying store could be used. It would also be fine to migrate data between different types of storage, e.g., create data on a local filesystem, then copy up to object storage.
This extension does provide a slightly stronger guarantee against the underlying store getting corrupted by partially-successful writes, because the content metadata documents would only ever get updated after a successful content write operation.

Notes:

This extension does not provide any additional support for coping with situations where two writers may be in contention for the same chunk. If two writers attempt to write the same chunk in parallel, both chunk values will end up getting stored separately under different content addresses, but the content metadata document could get overwritten by one of the writers, i.e., one of the content versions could fail to be recorded in the content metadata.

alimanfoo · 2020-06-26T12:41:31Z

Hi @jakirkham, is this the kind of thing you were thinking of?

cc @Carreau

Carreau · 2020-06-26T16:49:29Z

Why not create meta-keys that are /path-to-chunk-timestamp which content is chunk-hash, or do you want to avoid non-listable stores ? If the store is listable, you can list('path-to-chunk-*') and retrieve the most recent.

agstephens · 2020-08-19T08:46:01Z

@alimanfoo: your proposal for a "Content-addressable storage transformer" is really interesting.

In the global management of ESGF climate model data, we have the need to provide checksums for sets of netCDF files with version timestamp. Around the world different nodes may take a copy of a data set (set of netCDF files) and verify the content against a manifest of checksums.

An alternative world-view would be to put the data sets into Zarr, and to manage the versioning internally. Your approach looks like it would cope with that use case.

There are 3 potential advantages:

Efficiency: less duplication of storage (because you only have to change chunks that need to change); reduced bandwidth requirements to replicate the entire data set when changes occur.
In-place fixing: an xarray/zarr interface could allow updates to be applied; new versions could be encoded in some code required to update the Zarr.
Interoperability: content can be managed on disk or object store without modification.

Have you had any interest from others to take this forward?

Carreau · 2020-08-21T15:59:54Z

We are currently still focusing on spec v3, I don't believe doing a content addressable storage will be too hard to do on top of spec v3.

jakirkham · 2020-08-21T17:20:55Z

Thanks for writing this up @alimanfoo! 😄 Yeah this is the kind of thing I was thinking about.

Agree Matthias. This would be more intended as something on top of v3.

Also to your subpoint on listable stores, yeah the idea was to capture the full listing under one JSON object. So it tries to capture the use case that motivated things like consolidated metadata and avoid listing as a requirement. Though maybe there are some wrinkles we would need to work out here.

@agstephens, I think you have more-or-less captured the motivations behind such an extension. Though I proposed the idea, clearly have not had time to push it forward myself. If this is something you'd be interested in exploring, that would be very helpful. 😄

alimanfoo · 2020-08-27T09:15:54Z

Thanks @agstephens for your comment, it's great to know this is potentially useful. As John and Matthias have said this will be easier to develop as an extension to the v3 core protocol, so we should probably stay focused on the v3 spec and implementations for the time being so we have a solid foundation to build on. But please do let us know if this extension is something you'd be interested in helping with at any point, either writing a proper spec or doing a prototype implementation.

alimanfoo · 2020-08-27T09:21:44Z

Also to your subpoint on listable stores, yeah the idea was to capture the full listing under one JSON object. So it tries to capture the use case that motivated things like consolidated metadata and avoid listing as a requirement.

Hi @jakirkham, just to say that this proposal doesn't create a full listing of all metadata objects in a single JSON document, i.e., it doesn't replace consolidated metadata. The metadata would still be scattered into lots of separate objects, one for each node in the hierarchy. All it does is create a layer of indirection when writing or reading any object in the store, that allows you to read objects as at a given time point, and also verify objects haven't got corrupted. Hope that makes sense.

jakirkham · 2021-01-27T20:21:11Z

Also learned about zchunk recently, which may be relevant here.

shoyer · 2021-05-09T20:38:10Z

A content addressable storage layer like this was part of Mandoline, a precursor to Zarr that I used years ago:
https://github.com/TheClimateCorporation/mandoline

Use cases like "time-travel" were indeed the main motivation for this feature.

So I think this could be useful, but it's worth keeping in mind that the size of this metadata can add up for arrays with lots of chunks. For such cases, a separate database layer for keeping track of metadata might make sense -- the metadata is too small to make sense to put in the object stores typically used for array chunks, which are focused on throughput rather than latency.

agstephens · 2021-05-10T07:22:26Z

@shoyer your suggestion about separating out a database layer from the objects is really interesting. I have heard others (@rabernat) make similar suggestions about some files staying on POSIX.

In an idealised future where everything lives in the cloud you would need to make sure the database is as accessible as the objects themselves. So you would need an API that queried the DB, and I suppose the content could be consolidated into a single "metadata" construct to avoid time wasted in excessive queries.

joshmoore · 2021-05-13T09:06:07Z

FWIW, I've started some investigating with http://datalad.org which is based on git-annex. From my limited experience, that means you can choose whether you publish the data and/or the history to each remote.

d70-t · 2021-06-14T20:31:22Z

I am playing around a bit with zarr on top of IPFS and have a use case similar to what @agstephens wrote: distributing datasets globally where only some nodes store some of the data.
I just wanted to add some thoughts here. However, I don't know if it really fits here, as it's a slightly different approach: the content addressable part is part of the filesystem and not a functionality within zarr.

IPFS basically implements a global content addressable file system. Blocks are hashed, then compiled to files (which are a list of blocks). The files are hashed again and are compiled to directories, thus forming a Merkle tree. This structure maps very well to the tree structure of a zarr dataset.
In the end, the whole dataset is identified by its hash and if anything is changed, the hash of the whole thing changes. If a dataset (or a variable) would be updated, it would be possible to record a reference back to the old version within the datasets (or variables) metadata.

So by adding a content addressable layer below zarr in stead of in the middle, I think the implementation could be a bit simpler. In fact, I think it already works quite ok with the current, unmodified version of zarr. The downside of using a content addressable layer below probably is that we won't benefit from all the different store implementations which are already there. We'd essentially have to use what IPFS (or whatever content addressable filesystem) can run on. So for now, I don't see which variant would turn out to be the better one in the end.

There's however one thing which I think is missing and which would be beneficial for both variants (storage transformer and zarr on content addressable storage): it might be very useful to find a way to expose a globally unique content-id (for datasets, variables and chunks) to the user API if such a thing is present in the underlying store. This would create some possibilities for optimization:

If I have two datasets defined on some coordinate grids and want to compute e.g. the difference of some data variables, I want to check if those variables are defined on the same grid, but I don't need the actual coordinate values to compute the difference. For the end result, I'd want to add the coordinate values to the dataset again. All these operations could be done without even downloading the coordinate values from a remote server, if I know the content-id.
If I want to enhance a dataset by adding a more variables and then write a new dataset, I don't want to load and store the other, already existing variables.
If I want to apply fixes to the metadata, I still want to create a new dataset, but I don't want to copy all the data.
(this probably requires more flexible chunking) If I want to subset a dataset or concatenate multiple datasets into a larger one, I shouldn't need to read an write all the data, many operations should be possible using only content-ids of the data chunks.
(probably not directly related) Theoretically, if two datasets share some variables, it should be possible to only load them once into memory (like a read only memory map of the same file into multiple processes would work).

The end result of all of those operations shouldn't depend on the content-id being visible to higher level APIs or the user due to automatic deduplication. But in the cases listed above, a large amount of re-hashing and data transfers could be avoided by providing this information.

rabernat · 2021-07-21T18:35:35Z

We have just discovered IPFS and Filecoin and are now very interested in this conversation. Tagging @jbusecke, @cisaacstern. Perhaps some folks from Protocol Labs might be able to weight in on this and help think about the Zarr / IPFS implementation.

Stefaan-V · 2021-07-27T17:15:43Z

Hey team, the Filecoin/IPFS community is more than happy to help here. @d70-t can you send me an email at collab@protocol.ai please and we can discuss over the phone. We can report back our findings to the team here.

martindurant · 2021-08-25T16:47:23Z

Sorry to be very late to this discussion. I wonder if ReferenceFileSystem provides the amount of indirection being talked about here. It is an fsspec implementation, so already works with zarr (v2!), and each file is either a short binary embedded in the reference structure (for metadata) or a link to some bytes range of some URL. The list of keys is, at its simplest, just a dictionary.

Similarly, name hashing and write mode could be implemented as an fsspec backend without need for explicit code in zarr or even an extension. Of course, you may argue that codifying this process as an extension is exactly the point of this thread.

d70-t · 2021-09-17T12:54:03Z

I just had another thought about the idea of making a content id available to higher level APIs and tracking it through computations:
If there would be a mechanism for say xarray to track that some array (or some part of an array) was moved unchanged from open_zarr through an arbitrary computation until to_zarr but potentially became part of another dataset / group or obtained a different set of metadata. Then to_zarr could potentially create copy calls in stead of write calls to the underlying filesystem. Those copy calls could then become either more efficient local copies (i.e. within a datacenter) or could become e.g. reflink (Copy on Write) copies on btrfs, XFS, Lustre, APFS etc... or could be content-links on content addressable storage systems.

So this wouldn't go as far as making the CID available to users (which would still be very good), but would already provide some of the benefits and that even for a larger set of underlying filesystems.

yarikoptic · 2021-10-08T14:41:27Z

@martindurant :

Sorry to be very late to this discussion. I wonder if ReferenceFileSystem provides the amount of indirection ...

Sounds interesting and viable -- I wonder if you or someone else tried to come up with some prototypical implementation following that idea?

martindurant · 2021-10-08T14:45:05Z

We certainly are using ReferenceFileSystem and zarr to create virtual datasets consisting of binary blobs inside other files - and it works great! That's a mapping of path->(other path, offset, size), so similar but different to the discussion here - but adapting or making a new driver for content addressing inspired by that implementation should not be too hard.

alimanfoo mentioned this issue Jul 9, 2020

Multi-threading meta-data lookups zarr-developers/zarr-python#575

Closed

jakirkham mentioned this issue Oct 9, 2020

zarr pointer to existing files zarr-developers/zarr-python#631

Closed

yarikoptic mentioned this issue Oct 6, 2021

Second zarr design doc dandi/dandi-archive#552

Merged

d70-t mentioned this issue Nov 18, 2021

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? pangeo-forge/roadmap#40

Open

jakirkham mentioned this issue Nov 18, 2021

Sharding Prototype I: implementation as translating Store zarr-developers/zarr-python#876

Closed

14 tasks

jakirkham mentioned this issue Dec 15, 2021

Sharding array chunks across hashed sub-directories #115

Open

rabernat mentioned this issue Dec 16, 2021

zarr validation and consistency checking zarr-developers/zarr-python#912

Closed

d70-t mentioned this issue Jan 19, 2022

checksums for chunks zarr-developers/zarr-python#392

Open

d70-t mentioned this issue Feb 4, 2022

Add Sharding Support zarr-developers/zarr-python#877

Closed

jstriebel mentioned this issue Feb 7, 2022

Allowing to add / Adding a sharding spec #127

Closed

jstriebel mentioned this issue Feb 24, 2022

Add Sharding v1.0 Spec & Storage Transformers to Zarr v3.0 #134

Merged

d70-t mentioned this issue Mar 9, 2022

identifying and referencing grids ugrid-conventions/ugrid-conventions#59

Open

jakirkham mentioned this issue Jun 16, 2022

idea: referencefilesystem as a way to prepend to zarr fsspec/kerchunk#179

Open

jhamman mentioned this issue Nov 9, 2022

Using kerchunk to reference large sets of netcdf4 files fsspec/kerchunk#240

Closed

jstriebel added the protocol-extension Protocol extension related issue label Nov 16, 2022

jakirkham mentioned this issue Nov 18, 2022

Proposal: Object versioning... #14

Open

jhamman mentioned this issue Feb 7, 2024

Manifest storage transformer #287

Open

jhamman mentioned this issue Mar 20, 2024

[v3] Design and implement storage transformer API zarr-developers/zarr-python#1718

Open

DahnJ mentioned this issue Jul 2, 2024

Incrementally-populated Zarr Arrays #300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content-addressable storage transformer (v3 protocol extension) #82

Content-addressable storage transformer (v3 protocol extension) #82

alimanfoo commented Jun 26, 2020

alimanfoo commented Jun 26, 2020

Carreau commented Jun 26, 2020

agstephens commented Aug 19, 2020

Carreau commented Aug 21, 2020

jakirkham commented Aug 21, 2020

alimanfoo commented Aug 27, 2020

alimanfoo commented Aug 27, 2020 •

edited

Loading

jakirkham commented Jan 27, 2021

shoyer commented May 9, 2021

agstephens commented May 10, 2021

joshmoore commented May 13, 2021

d70-t commented Jun 14, 2021

rabernat commented Jul 21, 2021

Stefaan-V commented Jul 27, 2021 •

edited

Loading

martindurant commented Aug 25, 2021

d70-t commented Sep 17, 2021

yarikoptic commented Oct 8, 2021

martindurant commented Oct 8, 2021

Content-addressable storage transformer (v3 protocol extension) #82

Content-addressable storage transformer (v3 protocol extension) #82

Comments

alimanfoo commented Jun 26, 2020

alimanfoo commented Jun 26, 2020

Carreau commented Jun 26, 2020

agstephens commented Aug 19, 2020

Carreau commented Aug 21, 2020

jakirkham commented Aug 21, 2020

alimanfoo commented Aug 27, 2020

alimanfoo commented Aug 27, 2020 • edited Loading

jakirkham commented Jan 27, 2021

shoyer commented May 9, 2021

agstephens commented May 10, 2021

joshmoore commented May 13, 2021

d70-t commented Jun 14, 2021

rabernat commented Jul 21, 2021

Stefaan-V commented Jul 27, 2021 • edited Loading

martindurant commented Aug 25, 2021

d70-t commented Sep 17, 2021

yarikoptic commented Oct 8, 2021

martindurant commented Oct 8, 2021

alimanfoo commented Aug 27, 2020 •

edited

Loading

Stefaan-V commented Jul 27, 2021 •

edited

Loading