Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioned arrays #76

Open
alimanfoo opened this issue Sep 6, 2016 · 14 comments
Open

Versioned arrays #76

alimanfoo opened this issue Sep 6, 2016 · 14 comments
Labels
protocol-extension Protocol extension related issue

Comments

@alimanfoo
Copy link
Member

From @shoyer: What about indirection for chunks? The use case here is incrementally updated and/or versioned arrays (e.g., each day you update the last chunk with the latest data, without altering the rest of the data). In the typical way this is done, you hash each chunk and use hashes as keys in another store. The map from chunk indices to hashes is also something that you might want to store differently (e.g., in memory). I guess this would also be pretty easy to implement on top of the Storage abstraction.

@alimanfoo
Copy link
Member Author

Hi @shoyer, can you explain this one a little more? Not fully getting it.

@shoyer
Copy link

shoyer commented Sep 6, 2016

Let me clarify the use-case first.

In weather and climate science, it's common to have large array datasets that are continuously updated. For example, every hour we might combine data from weather stations, satellites and physical models to estimate air temperature at every location on a grid. This quickly adds up to a large amount of data, but each incremental update is small. Imagine a movie being written frame-by-frame.

If it were always just a matter of always writing new chunks in new files, we could simply write the chunks to new files and then overwrite the array metadata. But it's common to want to overwrite existing chunks, too, either because of another incremental update (a separate quality control check) or because we need to use another chunking scheme for efficient access (e.g., to get a weather history for one location, we don't want to use one chunk/hour). Hence, indirection is useful, to enable atomic writes and/or "time travel" (like git) to view the state of the database at some prior point in time.

@alimanfoo
Copy link
Member Author

Thanks @shoyer. So trying to make this concrete, correct me if any of the following is wrong. You have an array of air temperature estimates e. The shape of this array is something like (gridx, gridy, time) where gridx and gridy are the number of divisions in some 2D spatial grid, and time is in hours. The size of the first two dimensions is fixed, but the size of third (time) dimension grows as more data are added. Every hour you append a new array of shape (gridx, gridy) to the third (time) dimension.

When designing the chunk shape for this array, you want to be able to view the state of the whole grid at some specific time, but you also want to be able to view a time series for a specific grid square or some region of the spatial grid. To make this possible, you would naturally chunk all three dimensions, e.g., chunks might be (gridx//100, gridy//100, 1) - each chunk covers 1/10000 of the spatial grid and a single time point.

If this is all there was to it, then querying the state of the grid at some point in time could be done by indexing the third (time) dimension, e.g., e[:, :, t], and querying the history of some region could be done by indexing the first and second dimensions, e.g., e[x1:x2, y1:y2, :]. However (and this is where I'm a bit uncertain) querying the state of the grid at some point in time is not as simple as indexing the third (time) dimension, because data may get modified after it is initially appended, e.g., to correct an error or improve an estimate. In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Have I got this right? cc @benjeffery

@shoyer
Copy link

shoyer commented Sep 22, 2016

In this case, as well as indexing the array on one or more dimensions, you also want to ask what the state of the array was at some previous point in time.

Yes, I think you have this right.

To be clear, while the ability to view previous versions of an array (like git) is important for some use cases (like reproducibility), it's also essential for atomic updates.

In practice, you do not want to use chunking like (gridx//100, gridy//100, 1), because that makes it very expensive to do queries at one location across all times (which is a major use case in climate/weather science). Instead, you want to use a larger chunk size along time, e.g., (gridx//1000, gridy//1000, 100), which means incremental updates involve overwriting chunks. It's nice to be able to swap in a new version of an array all at once, instead of piecemeal.

We made use of both these features in Mandoline, an array database written at my former employer.

@alimanfoo
Copy link
Member Author

Ok, thanks. So the requirement for versioning, do you think that lives at
the storage layer? I.e., does it make sense to think of this in terms of
store classes that implement the basic MutableMapping interface for reading
and writing chunks, but also some further methods related to versioning,
e.g., committing versions, checking out versions, etc.? I.e., you do some
write on an array, then you call store.commit(), then do some more writes,
call store.commit(), then if you want to view a previous state, call
store.checkout(version), etc.?

On Thu, Sep 22, 2016 at 4:39 PM, Stephan Hoyer notifications@github.com
wrote:

In this case, as well as indexing the array on one or more dimensions, you
also want to ask what the state of the array was at some previous point in
time.

Yes, I think you have this right.

To be clear, while the ability to view previous versions of an array (like
git) is important for some use cases (like reproducibility), it's also
essential for atomic updates.

In practice, you do not want to use chunking like (gridx//100,
gridy//100, 1), because that makes it very expensive to do queries at one
location across all times (which is a major use case in climate/weather
science). Instead, you want to use a larger chunk size along time, e.g., (gridx//1000,
gridy//1000, 100), which means incremental updates involve overwriting
chunks. It's nice to be able to swap in a new version of an array all at
once, instead of piecemeal.

We made use of both these features in Mandoline
https://github.com/TheClimateCorporation/mandoline, an array database
written at my former employer.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/63#issuecomment-248941136, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QjG2gujIN7zQVduR-tWO3mUNQBviks5qsqFEgaJpZM4J1vth
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@shoyer
Copy link

shoyer commented Sep 23, 2016

@alimanfoo Yes, I think it would make sense to implement this at the storage level.

@alimanfoo alimanfoo changed the title Indirection for chunks Versioned arrays Sep 26, 2016
@alimanfoo
Copy link
Member Author

One simple way to implement this that immediately comes to mind would be to extend the DirectoryStore class and use git to provide versioning. Could even use GitPython to provide some convenience, like initialise the git repo when the store is first created, and provide Python API to make commits, checkout old versions, etc.

@agstephens
Copy link

Hi, my organisation works with climate simulations and they get very big (i.e. Tera to Peta-scale). The data is all stored in netCDF format at present and a single dataset is typically a set of netCDF files within a directory with a version label, such as v20190101.

A major issue for our data centres is the re-publication of data that has changed. However, some of the changes are very minor. They could be, for example, a change to:

  • a global attribute
  • a variable attribute
  • the name of a variable
  • a land/sea mask
  • a coordinate variable array
  • a small slice of large data array.

In any of the above cases, it is very costly for our community because:

  • the issue must be reported and documented
  • data must be re-generated by the scientists
  • new data must be transferred to the data centre
  • new data must be checked and ingested into the archive
  • a new version, even if only replacing a single attribute, doubles the size of a dataset

@alimanfoo and @shoyer: I am interested in the discussion you have had because I could Zarr being an excellent solution for resolving this issue. It could enable:

  1. Version changes with minimal changes on the file system - saving space and quick to implement.
  2. Massive savings for scientists and data centres in terms of time taken to re-generate entire data sets when the changes are known and could be easily projected on to an existing data set if the chunking structure was well described.

Do you know if anyone else is doing work looking at this issue with Zarr? Thanks

@Carreau
Copy link
Contributor

Carreau commented Jun 6, 2020

@agstephens I think that would likely be possible using zarr, it also depends on whether you want zarr to be aware of versioned data, or not. At least even in current state, only changed files/keys will be synchronized by smarts tools but you do need to scan the whole archive.

I can imagine that a store could have a transaction logging all writes, allowing efficient queries of what have have changed, and only sync these. Depending on how users access data it might be more or less difficult to implement it in a way that keep the log consistant with the modifications.

I will also point out https://github.com/Quansight/versioned-hdf5

In any of the above cases, it is very costly for our community because:
[...]

How do you store the dataset(s), is that using a filesystem or an object store ? I thought that filesystem like ZFS had block deduplication and that snapshot transfer was smart enough to limit to change blocks ?

I also seem to remember that https://docs.datproject.org/ was trying to solve som of those issues.

@jakirkham jakirkham transferred this issue from zarr-developers/zarr-python Jun 6, 2020
@DennisHeimbigner
Copy link

My 2cents. The work in databases on temporal databases may be relevant here.
I am sure there any number of papers in ACM Transactions on Database Systems.
In terms of atomic update (probably not possible for S3), file system journaling
may also be relevant.

@jakirkham
Copy link
Member

I suppose one way to address this would be to design a store to include timestamps as part of paths (perhaps underneath chunks and .z* files). This would allow the changed parts to be written under a new timestamp without altering previous ones. When reading we can just read the latest timestamp and ignore older ones.

Though there should be the possibility of exploring revision history. This could be done by filtering timestamps based on the revision someone wants to revisit and then selecting the latest amongst those (these steps can be fused for simplicity and performance).

@agstephens
Copy link

@Carreau , @DennisHeimbigner , @jakirkham: thanks for your responses regarding the versioning discussion.

In response to @Carreau: the data is currently all stored in NetCDF4 on a POSIX file system (although we are experimenting with object store).

I'll do some more reading this week following the leads that you have sent. My instinct is that the approach suggested by @jakirkham is worth exploring - i.e. that the DirectoryStore class manages an extra layer that includes version directories.

@jakirkham
Copy link
Member

@agstephens, if you do explore this versioning idea with say a subclass of DirectoryStore, it would be interesting to hear what you discover. Is that a viable path? What issues do you encounter in implementation? Are there any consequences from that approach we should be aware of/reflect on?

@joshmoore
Copy link
Member

Oddly enough, our team started discussing the same requirement Friday, too. The initial use case would be for metadata, but I can certainly also see the benefit of the functionality originally described by @shoyer. In our case, one goal would be to have global public URLs for multiple versions of the same dataset available without needing another tool to ensure API stability of existing URLs. (An example of another tool would be git cloning and then changing the branch.)

On a filesystem this could be done fairly trivially without extra storage with symlinks. There's no general solution I know of for cloud storage though. Following the above discussion, I too could see a way forward via Stores, but I'd like to raise a concern that this feels quite similar to Nested and Consolidated stores and that the issues in those two cases both led us to including those discriminators in the v3 spec.

Perhaps it's natural to start with an initial implementation and then choose to include that in the spec later, but I expect composition of the stores will be an almost immediate issue. (cf. zarr-developers/zarr-python#540)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

8 participants