Consider support for storing chunks in z-order #40

shoyer · 2016-07-26T23:57:29Z

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

alimanfoo · 2016-07-30T11:56:40Z

That's a neat trick. However, unless I'm missing something (very possible),
I think that simply chunking an array will produce a very similar effect.
I.e., if you have an array with shape (100, 100) and you set the chunk
shape to (10, 10), then a request to retrieve data for the range [5:15,
5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer notifications@github.com wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

alimanfoo · 2016-07-30T13:56:05Z

Sorry accidentally sent last comment incomplete. Basically I think that
chunking gives you data locality when taking multidimensional slices of an
array, depending on the chosen chunk shape. So it's not obvious to me how
z-order could be used with chunked arrays to improve I/O.

Cc @benjeffery

On Saturday, 30 July 2016, Alistair Miles <alimanfoo@googlemail.com
javascript:_e(%7B%7D,'cvml','alimanfoo@googlemail.com');> wrote:

That's a neat trick. However, unless I'm missing something (very
possible), I think that simply chunking an array will produce a very
similar effect. I.e., if you have an array with shape (100, 100) and you
set the chunk shape to (10, 10), then a request to retrieve data for the
range [5:15, 5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer notifications@github.com
wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

shoyer · 2016-07-31T19:45:27Z

This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the fixed overhead for accessing each individual chunk, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case.

alimanfoo · 2016-07-31T21:11:52Z

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

Just so I'm completely with you, could you expand a bit more on what you
mean by the second sentence ("Given the overhead for accessing each
due...")?

On Sunday, July 31, 2016, Stephan Hoyer notifications@github.com wrote:

This would be for storing data within chunks, supposing support for
partially reading chunks (possible with many backends). Given the overhead
for accessing each due, there is a minimal chunk size at which it doesn't
make sense to chunk any further (depending on details, perhaps somewhere in
the range of 1e3 to 1e6 elements). This could yield significant benefits in
that case.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40#issuecomment-236452319, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QqVh6JC9nwWo8E3PcARQnm3tcYKZks5qbPtXgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

shoyer · 2016-07-31T21:33:41Z

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

A more common use case might be any slicing operations that are poorly aligned with chunking, even if they involve multiple chunks. For example, suppose I have an array with chunks of size (100, 100, 100), and now want to index out a single point along the first two dimensions, e.g., x[i, j, :]. Instead of needing to read each full chunk all the first two dimensions, I could read out much less on average.

I don't have a direct use case for this right now (since I'm not actually using zarr yet), but I can see lots of examples where this sort of thing might be useful.

Just so I'm completely with you, could you expand a bit more on what you mean by the second sentence ("Given the overhead for accessing each due...")?

Oops, I had a typo (fixed now). What I was referring to is that there is some fixed overhead associated with storing and manipulating each chunk in every storage and task scheduling system, which is why we don't chunk arrays into single values.

alimanfoo · 2016-08-02T08:53:53Z

Thanks Stephan. Just to add that this relates to how Zarr might make use of the blosc_getitem function to extract partial contents of a chunk. I know Bcolz uses blosc_getitem but there things are simpler because they only consider 1D arrays, so far I haven't had the brain space to figure out how to use it for multidimensional arrays.

shoyer · 2016-08-02T19:00:24Z

blosc_getitem would indeed make good use of this. We would need some method for converting a tuple of range selections into start and nitems, and then load the appropriate items into the output array.

alimanfoo · 2016-09-22T12:48:27Z

I don't have bandwidth to explore this myself but very happy to discuss further if there is interest and someone else has time. FWIW I think a natural starting point would be to add support for a codec operation analogous to blosc_getitem, i.e., decoding items from some sub-range of a chunk. This would need to be generalised and added as a method on the Codec class. The Codec abstract base class could provide a default implementation which falls back to decoding then slicing the whole chunk, for codecs that cannot provide an efficient implementation. How this works if there were filters (i.e., multiple codecs) on an array I am uncertain. Once this infrastructure was in place, I guess a Zorder codec class could be implemented as a pre-compression filter, providing the z-order transformation and mapping getitem calls down to the next level? Or maybe it's not that simple, I can't see the architecture immediately.

Btw the next release of Zarr will have all code for compression and filter codecs removed and obtained via a dependency on a new numcodecs package, so this issue may live more naturally there as a codec issue.

One other note, I think that although z-order might speed up access for small regions of an array, it would incur a cost for reading larger regions, because all read and write operations would need to pass through the z-order transformation.

alimanfoo mentioned this issue Nov 7, 2016

Filter for space-filling-curves #90

Open

alimanfoo added the enhancement New features or improvements label Nov 21, 2017

alimanfoo mentioned this issue Jun 4, 2018

memmap reads from directory store #265

Closed

alimanfoo mentioned this issue Nov 29, 2018

ZFP Compression zarr-developers/numcodecs#117

Open

jrbourbeau mentioned this issue Nov 25, 2019

Partial chunk read? #521

Closed

jrbourbeau mentioned this issue Apr 6, 2020

Partial chunk reads zarr-developers/zarr-specs#59

Open

ravwojdyla mentioned this issue Jan 12, 2021

Genetics data IO performance stats/doc sgkit-dev/sgkit#437

Open

JackKelly mentioned this issue Jul 19, 2021

Don't read entire chunks at a time openclimatefix/nowcasting_dataset#57

Closed

jakirkham mentioned this issue Sep 22, 2021

Supporting range queries fsspec/filesystem_spec#766

Open

joshmoore mentioned this issue Sep 23, 2021

Outreachy project proposals (Oct. 2021) zarr-developers/community#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider support for storing chunks in z-order #40

Consider support for storing chunks in z-order #40

shoyer commented Jul 26, 2016

alimanfoo commented Jul 30, 2016

alimanfoo commented Jul 30, 2016

shoyer commented Jul 31, 2016 •

edited

Loading

alimanfoo commented Jul 31, 2016

shoyer commented Jul 31, 2016 •

edited

Loading

alimanfoo commented Aug 2, 2016

shoyer commented Aug 2, 2016

alimanfoo commented Sep 22, 2016

Consider support for storing chunks in z-order #40

Consider support for storing chunks in z-order #40

Comments

shoyer commented Jul 26, 2016

alimanfoo commented Jul 30, 2016

alimanfoo commented Jul 30, 2016

shoyer commented Jul 31, 2016 • edited Loading

alimanfoo commented Jul 31, 2016

shoyer commented Jul 31, 2016 • edited Loading

alimanfoo commented Aug 2, 2016

shoyer commented Aug 2, 2016

alimanfoo commented Sep 22, 2016

shoyer commented Jul 31, 2016 •

edited

Loading

shoyer commented Jul 31, 2016 •

edited

Loading