Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored #276

Open
etienneschalk opened this issue Nov 11, 2023 · 8 comments
Labels
IO Representation of particular file formats as trees

Comments

@etienneschalk
Copy link
Contributor

Hi,

I noticed a discrepency between the behaviour of xarray's open_zarr and datatree's open_datatree with engine='zarr'.

I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)

To summarize:

Actual:

  • xarray's open_zarr
    • No chunks kwarg: Stored chunks are used.
    • With chunks='auto': Stored chunks are used.
  • datatree's open_datatree with engine='zarr'
    • No chunks kwarg: No chunking performed.
    • With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset

Expected:

I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto', the stored chunks were to be used.

Thanks!

@TomNicholas TomNicholas added the IO Representation of particular file formats as trees label Nov 13, 2023
@TomNicholas
Copy link
Collaborator

Thanks for raising this @etienneschalk !

open_datatree internally calls xarray.open_dataset (not xarray.open_zarr) but there are currently some differences between xarray.open_dataset and xarray.open_zarr. Does the behaviour of open_datatree differ from xarray.open_dataset?

Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.

cc @jhamman

@eschalkargans
Copy link

Hello @TomNicholas ,

Does the behaviour of open_datatree differ from xarray.open_dataset?

After testing, the behaviour of open_datatree is indeed identical to xarray's open_dataset:

  • xarray's open_dataset
    • No chunks kwarg: No chunking is performed. Stored chunks are used.
    • With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset Stored chunks are used.

which is consistent with your statement:

open_datatree internally calls xarray.open_dataset

In that case, I would suggest to:

  • Keep the behaviour of open_datatree, as it uses and is of the same family as open_dataset (open_{xarray's data structure} syntax)
  • ▶️ Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too. And updating the documentation of open_datatree to nudge users into using open_zarr instead if they want to use zarr

Do you think this would be a good idea?

Thanks!

@TomNicholas
Copy link
Collaborator

Thank you for testing!

Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.

Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too.

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

@etienneschalk
Copy link
Contributor Author

Hi @TomNicholas

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

So, this means, while this discussion is not settled, implementing an open_zarr in datatree might be a waste of effort, if I understand correctly, in the case where open_zarr would be integrated into the open_dataset. However, if this happens in upstream xarray, the correct behaviour of open_zarr should be kept, not the one of the existing open_dataset that does not handle chunks properly.

Do you have a link to this discussion, by any chance? I would be interested to learn more about this.

Thanks, have a nice day!

@keewis
Copy link
Contributor

keewis commented Dec 12, 2023

the difference between open_zarr and open_dataset is that for open_zarr "auto" (the default) translates to {} if a chunk manager is available (like dask or cubed) or None otherwise, which are then forwarded to open_dataset. For open_dataset, the default is None (no chunks), while {} is the same as for open_zarr and "auto" is dask's auto-chunking (see dask.array.Array.rechunk for more details). So in summary, open_zarr is a wrapper of open_dataset, with a different default for chunks and a different meaning for "auto".

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})

@TomNicholas
Copy link
Collaborator

Thanks @keewis .

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?

@keewis
Copy link
Contributor

keewis commented Dec 12, 2023

I think the "deprecate open_zarr" issue should be fine to reuse for this: pydata/xarray#7495

@etienneschalk
Copy link
Contributor Author

Thanks for the chunks={} tip! This is indeed the behaviour I expected.

This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.

I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg 🆗"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Representation of particular file formats as trees
Projects
None yet
Development

No branches or pull requests

4 participants