Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.open_mzar: open multiple zarr files (in parallel) #4003

Closed
wants to merge 58 commits into from

Conversation

Mikejmnez
Copy link
Contributor

This is associated with #3995 and somewhat mentioned in #3668. This is, emulating xarray.open_mfdataset, it allows xarray to read multiple zarr files from a glob or a list of files pointing to a zarr store. It follows the code for xarray.open_mfdataset and the description to the function is the same, as they do the same thing but for respective files type.

@pep8speaks
Copy link

pep8speaks commented Apr 25, 2020

Hello @Mikejmnez! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 514:13: F841 local variable 'overwrite_encoded_chunks' is assigned to but never used

Line 362:5: E303 too many blank lines (2)
Line 392:22: F821 undefined name 'get_chunk'
Line 396:22: F821 undefined name 'tokenize'
Line 399:16: F821 undefined name 'overwrite_encoded_chunks'

Line 1513:75: E225 missing whitespace around operator

Comment last updated at 2020-05-22 16:45:34 UTC

@dcherian
Copy link
Contributor

I think the better way to do this would be to add a kwarg to open_dataset that specifies the backend to use. For e.g. xr.open_dataset(..., format="zarr").

This would then delegate to open_zarr or a new open_netcdf or open_rasterio as appropriate. Then open_mfdataset would just work for all these formats without requiring duplicate code.

cc @pydata/xarray for thoughts.

@TomNicholas
Copy link
Contributor

+1 for having open_dataset and open_mfdataset as the main (ideally only) points of entry for users, which then delegate to different backend openers. That will keep the API neater, avoid duplicate code, and be easier to make into a completely general and extensible solution eventually.

@TomNicholas TomNicholas added this to In progress in Flexible Storage Backends via automation Apr 27, 2020
@Mikejmnez
Copy link
Contributor Author

I like this approach (add capability to open_mfdataset to open multiple zarr files), as it is the easiest and cleanest. I considered it, and I am glad this is coming up because I wanted to know different opinions. Two things influenced my decision to have open_mzarr separate from open_mfdataset:

  1. Zarr stores are inherently different from netcdf-files, which becomes more evident when openning multiple files given a glob-path (paths='directory*/subdirectory*/*'). zarr stores can potentially be recognized as directories rather than files (e.g. as opposed to paths='directory*/subdirectory*/*.nc'). This distinction comes into play when, for example, trying to open files (zarr vs netcdf) through intake-xarray. I know this an upstream behavior, but I think it needs to be considered and it is my end goal by allowing xarray to read multiple zarr files (in parallel) - To use intake-xarray to read them. The way to open files on intake-xarray (zarr vs others) is again kept separate, and uses different functions. This is,

For netcdf-files (intake-xarray/netcdf.py):

url_path = fsspec.open_local(paths, *kwargs) 

which can interpret a glob path. Then url is then passed to xarray.open_mfdataset

zarr files (intake-xarray/xzarr):

url_path = fsspec.mapper(paths, *kwargs) 

fsspec.mapper does not recognize glob-paths, and fspec.open_local, which does recognize globs, cannot detect zarr-stores (as these are recognized as directories rather than files with a known extension). See an issue I created about such behavior fsspec/filesystem_spec#286 (comment) (apologizes, I am new at github and don't know if this is the correct way to link issues across repositories)

  1. Zarr continues to be under development, and the behavior of zarr it appears will rely heavily on fsspec more in the future. I wonder if such future development is the reason why even on xarray, open_zarr is contained in a different file from open_mfdataset, a similar behavior also happening in intake-xarray.

I am extremely interested what people think about xarray and intake-xarray compatibility/development, when it comes with zarr files being read in parallel...

@dcherian dcherian mentioned this pull request May 5, 2020
23 tasks
Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, first of all, thanks for this! I've been using an earlier version of this branch for a couple of weeks now (at 4ce3007) which works fantastically. Just noticed that the newer version at 62893ab was broken though. This is my 'bad' attempt at helping move things forward a bit. Fingers crossed that this gets into the next xarray release 😄

if isinstance(chunks, int):
chunks = dict.fromkeys(ds.dims, chunks)

variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}
variables = {k: store.maybe_chunk(k, v, chunks, overwrite_encoded_chunks) for k, v in ds.variables.items()}

@@ -356,6 +358,48 @@ def encode_variable(self, variable):
def encode_attribute(self, a):
return encode_zarr_attr_value(a)


def get_chunk(name, var, chunks):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_chunk(name, var, chunks):
def get_chunk(self, name, var, chunks):

chunk_spec[dim] = chunks[dim]
return chunk_spec

def maybe_chunk(name, var, chunks):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def maybe_chunk(name, var, chunks):
def maybe_chunk(self, name, var, chunks, overwrite_encoded_chunks):

Comment on lines +394 to +396
if (var.ndim > 0) and (chunk_spec is not None):
# does this cause any data to be read?
token2 = tokenize(name, var._data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (var.ndim > 0) and (chunk_spec is not None):
# does this cause any data to be read?
token2 = tokenize(name, var._data)
if (var.ndim > 0) and (chunk_spec is not None):
from dask.base import tokenize
# does this cause any data to be read?
token2 = tokenize(name, var._data)

Comment on lines +473 to +474
if isinstance(filename_or_obj, MutableMapping):
if engine == 'zarr':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditions should be combined into a single if -- otherwise somebody could provided filename_or_obj as a mutable mapping with a different engine, which could result in undefined behavior.

if engine == 'zarr':
# on ZarrStore, mode='r', synchronizer=None, group=None,
# consolidated=False.
overwrite_encoded_chunks = backend_kwargs.pop("overwrite_encoded_chunks", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop modifies a dict inplace. It's best to make a copy of backend_kwargs first to ensure that objects users supply as arguments into xarray functions aren't unexpected modified.

@@ -383,6 +384,10 @@ def open_dataset(
represented using ``np.datetime64[ns]`` objects. If False, always
decode times to ``np.datetime64[ns]`` objects; if this is not possible
raise an error.
overwrite_encoded_chunks: bool, optional
Whether to drop the zarr chunks encoded for each variable when a
dataset is loaded with specified chunk sizes (default: False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add overwrite_encoded_chunks as a new argument in ZarrStore.open_group? That seems preferable to adding another highly backend specific method on to open_dataset.

(To be honest, I'm not entirely sure why we need overwrite_encoded_chunks... it's not obvious from the description why it's useful)

@weiji14
Copy link
Contributor

weiji14 commented Jun 29, 2020

@Mikejmnez, do you mind if I pick up working on this branch? I'd be really keen to see it get into xarray 0.16, and then it will be possible to resolve the intake-xarray issue at intake/intake-xarray#70. Not sure if it's possible to get commit access here, or if I should just submit a PR to your fork, or maybe there's a better way? Edit: I've opened up a pull request to the fork.

weiji14 added a commit to weiji14/xarray that referenced this pull request Jun 29, 2020
weiji14 added a commit to weiji14/xarray that referenced this pull request Jun 29, 2020
Don't pop the backend_kwargs dict as per pydata#4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr".
@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

@weiji14 could you kindly reopen your new pull request against the main xarray repository? Your pull request is currently in Mikejmnez/xarray

@weiji14
Copy link
Contributor

weiji14 commented Jun 30, 2020

Sure, I can move it, but I just wanted to make sure @Mikejmnez gets the credit for this PR. Edit: moved to #4187.

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

Sure, I can move it, but I just wanted to make sure @Mikejmnez gets the credit for this PR

Yes, absolutely! As long as you preserve his original commits and add yours on top of them, both of you will be credited in the Git history. If you're writing a release note in whats-new.rst about the feature, please include both of your names in the credits.

@Mikejmnez
Copy link
Contributor Author

@weiji14 @shoyer Thanks you guys! Sorry it has taken me long to come back to this PR - I really mean to come back to this but I got stuck with another bigger PR that is actually part of my main research project. Anyways, much appreciated for the help, cheers!!

  • Since I am a novice at this, on my end, should I close this PR?

@keewis
Copy link
Collaborator

keewis commented Jun 30, 2020

Since I am a novice at this, on my end, should I close this PR?

don't worry about that: we can close this PR when we merge #4187

@shoyer shoyer closed this in #4187 Sep 22, 2020
Flexible Storage Backends automation moved this from In progress to Done Sep 22, 2020
shoyer pushed a commit that referenced this pull request Sep 22, 2020
* create def for multiple zarr files and added commentary/definition, which matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``

* just as with ``xr.open_mfdatasets``, identify the paths as local directory paths/strings

* added error if no path

* finished copying similar code from `xr.open_mfdatasets`

* remove blank lines

* fixed typo

* added ``xr.open_mzarr()`` to the list of available modules to call

* imported missing function

* imported missing glob

* imported function from backend.api

* imported function to facilitate mzarr

* correctly imported functions from core to mzarr

* imported to use on open_mzarr

* removed lock and autoclose since not taken by ``open_zarr``

* fixed typo

* class is not needed since zarr stores don`t remain open

* removed old behavior

* set default

* listed open_mzarr

* removed unused imported function

* imported Path - hadn`t before

* remove unncessesary comments

* modified comments

* isorted zarr

* isorted

* erased open_mzarr. Added capability to open_dataset to open zarr files

* removed imported but unused

* comment to `zarr` engine

* added chunking code from `open_zarr`

* remove import `open_mzarr``

* removed `open_mzarr`` from top-level-function

* missing return in nested function

* moved outside of nested function, had touble with reading before assignement

* added missing argument associated with zarr stores, onto the definition of open_dataset

* isort zarr.py

* removed blank lines, fixed typo on `chunks`

* removed imported but unused

* restored conditional for `auto`

* removed imported but unused `dask.array`

* added capabilities for file_or_obj to be a mutablemapper such as `fsspec.get_mapper`, and thus compatible with `intake-xarray`

* moved to a different conditional since file_or_obj is a mutablemapping, not a str, path or AbstractDataStore

* isort api.py

* restored the option for when file_or_obk is a str, such as an url.

* fixed relabel

* update open_dataset for zarr files

* remove open_zarr from tests, now open_dataset(engine=`zarr`)

* remove extra file, and raise deprecating warning on open_zarr

* added internal call to open_dataset from depricated open_zarr

* defined engine=`zarr`

* correct argument for open_dataset

* pass arguments as backend_kwargs

* pass backend_kwargs as argument

* typo

* set `overwrite_enconded_chunks as backend_kwargs

* do not pass as backend, use for chunking

* removed commented code

* moved definitions to zarr backends

* Ensure class functions have necessary variables

Was missing some 'self' and other kwarg variables. Also linted using black.

* Combine MutableMapping and Zarr engine condition

As per #4003 (comment).

* Pop out overwrite_encoded_chunks after shallow copy backend_kwargs dict

Don't pop the backend_kwargs dict as per #4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr".

* Fix some errors noticed by PEP8

* Reorganize code in backends api.py and actually test using engine zarr

Merge at 1977ba1 wasn't done very well. Reorganized the logic of the code to reduce the diff with xarray master, and ensure that the zarr backend tests actually have engine="zarr" in them.

* Add back missing decode_timedelta kwarg

* Add back a missing engine="zarr" to test_distributed.py

* Ensure conditional statements make sense

* Fix UnboundLocalError on 'chunks' referenced before assignment

Need to pass in chunks to maybe_decode_store, to resolve UnboundLocalError: local variable 'chunks' referenced before assignment.

* Run isort to fix import order

* Fix tests where kwargs needs to be inside of backend_kwargs dict now

Also temporarily silence deprecate_auto_chunk tests using pytest.raises(TypeError). May remove those fully later.

* Change open_zarr to open_dataset with engine="zarr" in io.rst

* Fix test_distributed by wrapping consolidated in backend_kwargs dict

Patches cb6d066.

* Ensure read-only mode when using open_dataset with engine="zarr"

* Turn chunks from "auto" to None if dask is not available

* Add back a missing else statement in maybe_chunk

* Allow xfail test_vectorized_indexing when has_dask

Instead of when not has_dask.

* Typo on chunks arg in open_dataset

* Fix ZeroDivisionError by adding back check that chunks is not False

Yet another if-statement that wasn't properly transferred from zarr.py to api.py.

* Fix a typo that was causing TypeError: 'method' object is not iterable

* Move the `if not chunks` block to after auto detect

Patches logic of 6fbeadf to fix errors when Dask is not installed.

* Revert "Allow xfail test_vectorized_indexing when has_dask"

This reverts commit aca2012.

* Temporarily xfail test_vectorized_indexing with or without dask

* Put zarr in open_mfdataset engine list

* Test open_mfdataset_manyfiles with engine zarr

Zarr objects are folders with seem to cause issues with closing, so added a try-except to api.py to catch failures in f.close(). Some tests failing when chunks=None because a numpy array is returned instead of a dask array.

* Remember to set a ._file_obj when using Zarr

Yet another logic error fixed, resolves the try-except hack in b9a239e.

* Expect np.ndarray when using open_mfdataset on Zarr with chunks None

* Add an entry to what's new for open_mfdataset with Zarr engine

Plus a small formatting fix in open_mfdataset docstring

* Make zarr engine's custom chunk mechanism more in line with ds.chunk

Slightly edited the token name string to start with 'xarray' and include chunks in tokenize. Also replace the deprecated `_replace_vars_and_dims` method with just `_replace`.

* Workaround problem where dask arrays aren't returned when chunks is None

Revert 827e546 and workaround to get dask arrays by fixing some if-then logic in the code when `engine="zarr"` is involved. Things work fine when using chunks="auto", perhaps because the try `import dask.array` is needed to trigger loading into dask arrays? Also removed using chunks="auto" in some Zarr tests to simplify.

* Default to chunks="auto" for Zarr tests to fix test_vectorized_indexing

Revert hack in 6b99225 as test_vectorized_indexing now works on dask, specifically the negative slices test. It will still fail without dask, as was the behaviour before. Solution was to set `chunks="auto"` as the default when testing using `open_dataset` with `engine="zarr"`, similar to the default for `open_zarr`. Reverted some aspects of dce4e7c to ensure this `chunks="auto"`setting is visible throughout the Zarr test suite.

* Fix test by passing in chunk_store to backend_kwargs

* Revert "Change open_zarr to open_dataset with engine="zarr" in io.rst"

This reverts commit cd0b9ef.

* Remove open_zarr DeprecationWarning

Partially reverts b488363.

* Update open_dataset docstring to specify chunk options for zarr engine

* Let only chunks = None return non-chunked arrays

* Remove for-loop in test_manual_chunk since testing only one no_chunk

* Update open_dataset docstring to remove mention of chunks=None with Zarr

Co-authored-by: Miguel Jimenez-Urias <mjimen17@jh.edu>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
Development

Successfully merging this pull request may close these issues.

None yet

10 participants