When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

etienneschalk · 2023-11-11T15:08:50Z

Hi,

I noticed a discrepency between the behaviour of xarray's open_zarr and datatree's open_datatree with engine='zarr'.

I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)

To summarize:

Actual:

xarray's open_zarr
- No chunks kwarg: Stored chunks are used.
- With chunks='auto': Stored chunks are used.
datatree's open_datatree with engine='zarr'
- No chunks kwarg: No chunking performed.
- With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset

Expected:

I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto', the stored chunks were to be used.

Thanks!

The text was updated successfully, but these errors were encountered:

TomNicholas · 2023-11-13T17:45:20Z

Thanks for raising this @etienneschalk !

open_datatree internally calls xarray.open_dataset (not xarray.open_zarr) but there are currently some differences between xarray.open_dataset and xarray.open_zarr. Does the behaviour of open_datatree differ from xarray.open_dataset?

Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.

cc @jhamman

eschalkargans · 2023-11-22T16:41:51Z

Hello @TomNicholas ,

Does the behaviour of open_datatree differ from xarray.open_dataset?

After testing, the behaviour of open_datatree is indeed identical to xarray's open_dataset:

xarray's open_dataset
- No chunks kwarg: No chunking is performed. ~~Stored chunks are used.~~
- With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset ~~Stored chunks are used.~~

which is consistent with your statement:

open_datatree internally calls xarray.open_dataset

In that case, I would suggest to:

Keep the behaviour of open_datatree, as it uses and is of the same family as open_dataset (open_{xarray's data structure} syntax)
▶️ Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too. And updating the documentation of open_datatree to nudge users into using open_zarr instead if they want to use zarr

Do you think this would be a good idea?

Thanks!

TomNicholas · 2023-11-27T17:02:07Z

Thank you for testing!

Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.

Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too.

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

etienneschalk · 2023-12-02T10:51:07Z

Hi @TomNicholas

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

So, this means, while this discussion is not settled, implementing an open_zarr in datatree might be a waste of effort, if I understand correctly, in the case where open_zarr would be integrated into the open_dataset. However, if this happens in upstream xarray, the correct behaviour of open_zarr should be kept, not the one of the existing open_dataset that does not handle chunks properly.

Do you have a link to this discussion, by any chance? I would be interested to learn more about this.

Thanks, have a nice day!

keewis · 2023-12-12T14:33:16Z

the difference between open_zarr and open_dataset is that for open_zarr "auto" (the default) translates to {} if a chunk manager is available (like dask or cubed) or None otherwise, which are then forwarded to open_dataset. For open_dataset, the default is None (no chunks), while {} is the same as for open_zarr and "auto" is dask's auto-chunking (see dask.array.Array.rechunk for more details). So in summary, open_zarr is a wrapper of open_dataset, with a different default for chunks and a different meaning for "auto".

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})

TomNicholas · 2023-12-12T14:48:37Z

Thanks @keewis .

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?

keewis · 2023-12-12T14:52:33Z

I think the "deprecate open_zarr" issue should be fine to reuse for this: pydata/xarray#7495

etienneschalk · 2024-02-09T18:31:49Z

Thanks for the chunks={} tip! This is indeed the behaviour I expected.

This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.

I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg 🆗"

TomNicholas added the IO Representation of particular file formats as trees label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

etienneschalk commented Nov 11, 2023

TomNicholas commented Nov 13, 2023

eschalkargans commented Nov 22, 2023

TomNicholas commented Nov 27, 2023

etienneschalk commented Dec 2, 2023

keewis commented Dec 12, 2023 •

edited

Loading

TomNicholas commented Dec 12, 2023

keewis commented Dec 12, 2023

etienneschalk commented Feb 9, 2024

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

Comments

etienneschalk commented Nov 11, 2023

TomNicholas commented Nov 13, 2023

eschalkargans commented Nov 22, 2023

TomNicholas commented Nov 27, 2023

etienneschalk commented Dec 2, 2023

keewis commented Dec 12, 2023 • edited Loading

TomNicholas commented Dec 12, 2023

keewis commented Dec 12, 2023

etienneschalk commented Feb 9, 2024

keewis commented Dec 12, 2023 •

edited

Loading