Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider support for storing chunks in z-order #40

Open
shoyer opened this issue Jul 26, 2016 · 8 comments
Open

Consider support for storing chunks in z-order #40

shoyer opened this issue Jul 26, 2016 · 8 comments
Labels
enhancement New features or improvements

Comments

@shoyer
Copy link
Contributor

shoyer commented Jul 26, 2016

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

@alimanfoo
Copy link
Member

That's a neat trick. However, unless I'm missing something (very possible),
I think that simply chunking an array will produce a very similar effect.
I.e., if you have an array with shape (100, 100) and you set the chunk
shape to (10, 10), then a request to retrieve data for the range [5:15,
5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer notifications@github.com wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@alimanfoo
Copy link
Member

Sorry accidentally sent last comment incomplete. Basically I think that
chunking gives you data locality when taking multidimensional slices of an
array, depending on the chosen chunk shape. So it's not obvious to me how
z-order could be used with chunked arrays to improve I/O.

Cc @benjeffery

On Saturday, 30 July 2016, Alistair Miles <alimanfoo@googlemail.com
javascript:_e(%7B%7D,'cvml','alimanfoo@googlemail.com');> wrote:

That's a neat trick. However, unless I'm missing something (very
possible), I think that simply chunking an array will produce a very
similar effect. I.e., if you have an array with shape (100, 100) and you
set the chunk shape to (10, 10), then a request to retrieve data for the
range [5:15, 5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer notifications@github.com
wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@shoyer
Copy link
Contributor Author

shoyer commented Jul 31, 2016

This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the fixed overhead for accessing each individual chunk, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case.

@alimanfoo
Copy link
Member

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

Just so I'm completely with you, could you expand a bit more on what you
mean by the second sentence ("Given the overhead for accessing each
due...")?

On Sunday, July 31, 2016, Stephan Hoyer notifications@github.com wrote:

This would be for storing data within chunks, supposing support for
partially reading chunks (possible with many backends). Given the overhead
for accessing each due, there is a minimal chunk size at which it doesn't
make sense to chunk any further (depending on details, perhaps somewhere in
the range of 1e3 to 1e6 elements). This could yield significant benefits in
that case.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40#issuecomment-236452319, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QqVh6JC9nwWo8E3PcARQnm3tcYKZks5qbPtXgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@shoyer
Copy link
Contributor Author

shoyer commented Jul 31, 2016

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

A more common use case might be any slicing operations that are poorly aligned with chunking, even if they involve multiple chunks. For example, suppose I have an array with chunks of size (100, 100, 100), and now want to index out a single point along the first two dimensions, e.g., x[i, j, :]. Instead of needing to read each full chunk all the first two dimensions, I could read out much less on average.

I don't have a direct use case for this right now (since I'm not actually using zarr yet), but I can see lots of examples where this sort of thing might be useful.

Just so I'm completely with you, could you expand a bit more on what you mean by the second sentence ("Given the overhead for accessing each due...")?

Oops, I had a typo (fixed now). What I was referring to is that there is some fixed overhead associated with storing and manipulating each chunk in every storage and task scheduling system, which is why we don't chunk arrays into single values.

@alimanfoo
Copy link
Member

Thanks Stephan. Just to add that this relates to how Zarr might make use of the blosc_getitem function to extract partial contents of a chunk. I know Bcolz uses blosc_getitem but there things are simpler because they only consider 1D arrays, so far I haven't had the brain space to figure out how to use it for multidimensional arrays.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 2, 2016

blosc_getitem would indeed make good use of this. We would need some method for converting a tuple of range selections into start and nitems, and then load the appropriate items into the output array.

@alimanfoo
Copy link
Member

I don't have bandwidth to explore this myself but very happy to discuss further if there is interest and someone else has time. FWIW I think a natural starting point would be to add support for a codec operation analogous to blosc_getitem, i.e., decoding items from some sub-range of a chunk. This would need to be generalised and added as a method on the Codec class. The Codec abstract base class could provide a default implementation which falls back to decoding then slicing the whole chunk, for codecs that cannot provide an efficient implementation. How this works if there were filters (i.e., multiple codecs) on an array I am uncertain. Once this infrastructure was in place, I guess a Zorder codec class could be implemented as a pre-compression filter, providing the z-order transformation and mapping getitem calls down to the next level? Or maybe it's not that simple, I can't see the architecture immediately.

Btw the next release of Zarr will have all code for compression and filter codecs removed and obtained via a dependency on a new numcodecs package, so this issue may live more naturally there as a codec issue.

One other note, I think that although z-order might speed up access for small regions of an array, it would incur a cost for reading larger regions, because all read and write operations would need to pass through the z-order transformation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

2 participants