Versioned arrays #76

alimanfoo · 2016-09-06T11:41:11Z

From @shoyer: What about indirection for chunks? The use case here is incrementally updated and/or versioned arrays (e.g., each day you update the last chunk with the latest data, without altering the rest of the data). In the typical way this is done, you hash each chunk and use hashes as keys in another store. The map from chunk indices to hashes is also something that you might want to store differently (e.g., in memory). I guess this would also be pretty easy to implement on top of the Storage abstraction.

alimanfoo · 2016-09-06T11:42:23Z

Hi @shoyer, can you explain this one a little more? Not fully getting it.

shoyer · 2016-09-06T16:01:38Z

Let me clarify the use-case first.

In weather and climate science, it's common to have large array datasets that are continuously updated. For example, every hour we might combine data from weather stations, satellites and physical models to estimate air temperature at every location on a grid. This quickly adds up to a large amount of data, but each incremental update is small. Imagine a movie being written frame-by-frame.

If it were always just a matter of always writing new chunks in new files, we could simply write the chunks to new files and then overwrite the array metadata. But it's common to want to overwrite existing chunks, too, either because of another incremental update (a separate quality control check) or because we need to use another chunking scheme for efficient access (e.g., to get a weather history for one location, we don't want to use one chunk/hour). Hence, indirection is useful, to enable atomic writes and/or "time travel" (like git) to view the state of the database at some prior point in time.

alimanfoo · 2016-09-22T12:34:52Z

Thanks @shoyer. So trying to make this concrete, correct me if any of the following is wrong. You have an array of air temperature estimates e. The shape of this array is something like (gridx, gridy, time) where gridx and gridy are the number of divisions in some 2D spatial grid, and time is in hours. The size of the first two dimensions is fixed, but the size of third (time) dimension grows as more data are added. Every hour you append a new array of shape (gridx, gridy) to the third (time) dimension.

When designing the chunk shape for this array, you want to be able to view the state of the whole grid at some specific time, but you also want to be able to view a time series for a specific grid square or some region of the spatial grid. To make this possible, you would naturally chunk all three dimensions, e.g., chunks might be (gridx//100, gridy//100, 1) - each chunk covers 1/10000 of the spatial grid and a single time point.

If this is all there was to it, then querying the state of the grid at some point in time could be done by indexing the third (time) dimension, e.g., e[:, :, t], and querying the history of some region could be done by indexing the first and second dimensions, e.g., e[x1:x2, y1:y2, :]. However (and this is where I'm a bit uncertain) querying the state of the grid at some point in time is not as simple as indexing the third (time) dimension, because data may get modified after it is initially appended, e.g., to correct an error or improve an estimate. In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Have I got this right? cc @benjeffery

shoyer · 2016-09-22T15:39:48Z

In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Yes, I think you have this right.

To be clear, while the ability to view previous versions of an array (like git) is important for some use cases (like reproducibility), it's also essential for atomic updates.

In practice, you do not want to use chunking like (gridx//100, gridy//100, 1), because that makes it very expensive to do queries at one location across all times (which is a major use case in climate/weather science). Instead, you want to use a larger chunk size along time, e.g., (gridx//1000, gridy//1000, 100), which means incremental updates involve overwriting chunks. It's nice to be able to swap in a new version of an array all at once, instead of piecemeal.

We made use of both these features in Mandoline, an array database written at my former employer.

alimanfoo · 2016-09-23T11:21:46Z

Ok, thanks. So the requirement for versioning, do you think that lives at
the storage layer? I.e., does it make sense to think of this in terms of
store classes that implement the basic MutableMapping interface for reading
and writing chunks, but also some further methods related to versioning,
e.g., committing versions, checking out versions, etc.? I.e., you do some
write on an array, then you call store.commit(), then do some more writes,
call store.commit(), then if you want to view a previous state, call
store.checkout(version), etc.?

On Thu, Sep 22, 2016 at 4:39 PM, Stephan Hoyer notifications@github.com
wrote:

In this case, as well as indexing the array on one or more dimensions, you
also want to ask what the state of the array was at some previous point in
time.

Yes, I think you have this right.

To be clear, while the ability to view previous versions of an array (like
git) is important for some use cases (like reproducibility), it's also
essential for atomic updates.

In practice, you do not want to use chunking like (gridx//100,
gridy//100, 1), because that makes it very expensive to do queries at one
location across all times (which is a major use case in climate/weather
science). Instead, you want to use a larger chunk size along time, e.g., (gridx//1000,
gridy//1000, 100), which means incremental updates involve overwriting
chunks. It's nice to be able to swap in a new version of an array all at
once, instead of piecemeal.

We made use of both these features in Mandoline
https://github.com/TheClimateCorporation/mandoline, an array database
written at my former employer.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/63#issuecomment-248941136, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QjG2gujIN7zQVduR-tWO3mUNQBviks5qsqFEgaJpZM4J1vth
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

shoyer · 2016-09-23T17:51:40Z

@alimanfoo Yes, I think it would make sense to implement this at the storage level.

alimanfoo · 2016-09-26T13:05:53Z

One simple way to implement this that immediately comes to mind would be to extend the DirectoryStore class and use git to provide versioning. Could even use GitPython to provide some convenience, like initialise the git repo when the store is first created, and provide Python API to make commits, checkout old versions, etc.

agstephens · 2020-06-06T19:25:42Z

Hi, my organisation works with climate simulations and they get very big (i.e. Tera to Peta-scale). The data is all stored in netCDF format at present and a single dataset is typically a set of netCDF files within a directory with a version label, such as v20190101.

A major issue for our data centres is the re-publication of data that has changed. However, some of the changes are very minor. They could be, for example, a change to:

a global attribute
a variable attribute
the name of a variable
a land/sea mask
a coordinate variable array
a small slice of large data array.

In any of the above cases, it is very costly for our community because:

the issue must be reported and documented
data must be re-generated by the scientists
new data must be transferred to the data centre
new data must be checked and ingested into the archive
a new version, even if only replacing a single attribute, doubles the size of a dataset

@alimanfoo and @shoyer: I am interested in the discussion you have had because I could Zarr being an excellent solution for resolving this issue. It could enable:

Version changes with minimal changes on the file system - saving space and quick to implement.
Massive savings for scientists and data centres in terms of time taken to re-generate entire data sets when the changes are known and could be easily projected on to an existing data set if the chunking structure was well described.

Do you know if anyone else is doing work looking at this issue with Zarr? Thanks

Carreau · 2020-06-06T20:44:03Z

@agstephens I think that would likely be possible using zarr, it also depends on whether you want zarr to be aware of versioned data, or not. At least even in current state, only changed files/keys will be synchronized by smarts tools but you do need to scan the whole archive.

I can imagine that a store could have a transaction logging all writes, allowing efficient queries of what have have changed, and only sync these. Depending on how users access data it might be more or less difficult to implement it in a way that keep the log consistant with the modifications.

I will also point out https://github.com/Quansight/versioned-hdf5

In any of the above cases, it is very costly for our community because:
[...]

How do you store the dataset(s), is that using a filesystem or an object store ? I thought that filesystem like ZFS had block deduplication and that snapshot transfer was smart enough to limit to change blocks ?

I also seem to remember that https://docs.datproject.org/ was trying to solve som of those issues.

DennisHeimbigner · 2020-06-07T21:12:18Z

My 2cents. The work in databases on temporal databases may be relevant here.
I am sure there any number of papers in ACM Transactions on Database Systems.
In terms of atomic update (probably not possible for S3), file system journaling
may also be relevant.

jakirkham · 2020-06-08T00:18:05Z

I suppose one way to address this would be to design a store to include timestamps as part of paths (perhaps underneath chunks and .z* files). This would allow the changed parts to be written under a new timestamp without altering previous ones. When reading we can just read the latest timestamp and ignore older ones.

Though there should be the possibility of exploring revision history. This could be done by filtering timestamps based on the revision someone wants to revisit and then selecting the latest amongst those (these steps can be fused for simplicity and performance).

agstephens · 2020-06-08T07:17:17Z

@Carreau , @DennisHeimbigner , @jakirkham: thanks for your responses regarding the versioning discussion.

In response to @Carreau: the data is currently all stored in NetCDF4 on a POSIX file system (although we are experimenting with object store).

I'll do some more reading this week following the leads that you have sent. My instinct is that the approach suggested by @jakirkham is worth exploring - i.e. that the DirectoryStore class manages an extra layer that includes version directories.

jakirkham · 2020-06-08T08:36:24Z

@agstephens, if you do explore this versioning idea with say a subclass of DirectoryStore, it would be interesting to hear what you discover. Is that a viable path? What issues do you encounter in implementation? Are there any consequences from that approach we should be aware of/reflect on?

joshmoore · 2020-06-09T07:34:27Z

Oddly enough, our team started discussing the same requirement Friday, too. The initial use case would be for metadata, but I can certainly also see the benefit of the functionality originally described by @shoyer. In our case, one goal would be to have global public URLs for multiple versions of the same dataset available without needing another tool to ensure API stability of existing URLs. (An example of another tool would be git cloning and then changing the branch.)

On a filesystem this could be done fairly trivially without extra storage with symlinks. There's no general solution I know of for cloud storage though. Following the above discussion, I too could see a way forward via Stores, but I'd like to raise a concern that this feels quite similar to Nested and Consolidated stores and that the issues in those two cases both led us to including those discriminators in the v3 spec.

Perhaps it's natural to start with an initial implementation and then choose to include that in the spec later, but I expect composition of the stores will be an almost immediate issue. (cf. zarr-developers/zarr-python#540)

alimanfoo changed the title ~~Indirection for chunks~~ Versioned arrays Sep 26, 2016

jakirkham transferred this issue from zarr-developers/zarr-python Jun 6, 2020

alimanfoo mentioned this issue Jun 26, 2020

Content-addressable storage transformer (v3 protocol extension) #82

Open

alphan mentioned this issue Aug 15, 2020

Use Git Large File Storage (LFS) for all binaries lowRISC/ot-sca#4

Merged

yarikoptic mentioned this issue Sep 21, 2021

Design for Zarr support dandi/dandi-archive#295

Merged

jstriebel mentioned this issue Feb 7, 2022

Allowing to add / Adding a sharding spec #127

Closed

jstriebel mentioned this issue Nov 16, 2022

Proposal: Object versioning... #14

Open

jstriebel added the protocol-extension Protocol extension related issue label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioned arrays #76

Versioned arrays #76

alimanfoo commented Sep 6, 2016

alimanfoo commented Sep 6, 2016

shoyer commented Sep 6, 2016

alimanfoo commented Sep 22, 2016

shoyer commented Sep 22, 2016

alimanfoo commented Sep 23, 2016

shoyer commented Sep 23, 2016

alimanfoo commented Sep 26, 2016

agstephens commented Jun 6, 2020

Carreau commented Jun 6, 2020

DennisHeimbigner commented Jun 7, 2020

jakirkham commented Jun 8, 2020

agstephens commented Jun 8, 2020

jakirkham commented Jun 8, 2020

joshmoore commented Jun 9, 2020

Versioned arrays #76

Versioned arrays #76

Comments

alimanfoo commented Sep 6, 2016

alimanfoo commented Sep 6, 2016

shoyer commented Sep 6, 2016

alimanfoo commented Sep 22, 2016

shoyer commented Sep 22, 2016

alimanfoo commented Sep 23, 2016

shoyer commented Sep 23, 2016

alimanfoo commented Sep 26, 2016

agstephens commented Jun 6, 2020

Carreau commented Jun 6, 2020

DennisHeimbigner commented Jun 7, 2020

jakirkham commented Jun 8, 2020

agstephens commented Jun 8, 2020

jakirkham commented Jun 8, 2020

joshmoore commented Jun 9, 2020