xarray.open_mzar: open multiple zarr files (in parallel) #4003

Mikejmnez · 2020-04-25T04:08:50Z

This is associated with #3995 and somewhat mentioned in #3668. This is, emulating xarray.open_mfdataset, it allows xarray to read multiple zarr files from a glob or a list of files pointing to a zarr store. It follows the code for xarray.open_mfdataset and the description to the function is the same, as they do the same thing but for respective files type.

…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``

…ctory paths/strings

pep8speaks · 2020-04-25T04:08:54Z

Hello @Mikejmnez! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file xarray/backends/api.py:

Line 514:13: F841 local variable 'overwrite_encoded_chunks' is assigned to but never used

In the file xarray/backends/zarr.py:

Line 362:5: E303 too many blank lines (2)
Line 392:22: F821 undefined name 'get_chunk'
Line 396:22: F821 undefined name 'tokenize'
Line 399:16: F821 undefined name 'overwrite_encoded_chunks'

In the file xarray/tests/test_backends.py:

Line 1513:75: E225 missing whitespace around operator

Comment last updated at 2020-05-22 16:45:34 UTC

dcherian · 2020-04-26T23:30:46Z

I think the better way to do this would be to add a kwarg to open_dataset that specifies the backend to use. For e.g. xr.open_dataset(..., format="zarr").

This would then delegate to open_zarr or a new open_netcdf or open_rasterio as appropriate. Then open_mfdataset would just work for all these formats without requiring duplicate code.

cc @pydata/xarray for thoughts.

TomNicholas · 2020-04-26T23:53:31Z

+1 for having open_dataset and open_mfdataset as the main (ideally only) points of entry for users, which then delegate to different backend openers. That will keep the API neater, avoid duplicate code, and be easier to make into a completely general and extensible solution eventually.

Mikejmnez · 2020-04-27T17:45:25Z

I like this approach (add capability to open_mfdataset to open multiple zarr files), as it is the easiest and cleanest. I considered it, and I am glad this is coming up because I wanted to know different opinions. Two things influenced my decision to have open_mzarr separate from open_mfdataset:

Zarr stores are inherently different from netcdf-files, which becomes more evident when openning multiple files given a glob-path (paths='directory*/subdirectory*/*'). zarr stores can potentially be recognized as directories rather than files (e.g. as opposed to paths='directory*/subdirectory*/*.nc'). This distinction comes into play when, for example, trying to open files (zarr vs netcdf) through intake-xarray. I know this an upstream behavior, but I think it needs to be considered and it is my end goal by allowing xarray to read multiple zarr files (in parallel) - To use intake-xarray to read them. The way to open files on intake-xarray (zarr vs others) is again kept separate, and uses different functions. This is,

For netcdf-files (intake-xarray/netcdf.py):

url_path = fsspec.open_local(paths, *kwargs)

which can interpret a glob path. Then url is then passed to xarray.open_mfdataset

zarr files (intake-xarray/xzarr):

url_path = fsspec.mapper(paths, *kwargs)

fsspec.mapper does not recognize glob-paths, and fspec.open_local, which does recognize globs, cannot detect zarr-stores (as these are recognized as directories rather than files with a known extension). See an issue I created about such behavior fsspec/filesystem_spec#286 (comment) (apologizes, I am new at github and don't know if this is the correct way to link issues across repositories)

Zarr continues to be under development, and the behavior of zarr it appears will rely heavily on fsspec more in the future. I wonder if such future development is the reason why even on xarray, open_zarr is contained in a different file from open_mfdataset, a similar behavior also happening in intake-xarray.

I am extremely interested what people think about xarray and intake-xarray compatibility/development, when it comes with zarr files being read in parallel...

moved definitions to zarr backends

weiji14

Hi, first of all, thanks for this! I've been using an earlier version of this branch for a couple of weeks now (at 4ce3007) which works fantastically. Just noticed that the newer version at 62893ab was broken though. This is my 'bad' attempt at helping move things forward a bit. Fingers crossed that this gets into the next xarray release 😄

weiji14 · 2020-06-17T09:19:10Z

xarray/backends/api.py

+            if isinstance(chunks, int):
+                chunks = dict.fromkeys(ds.dims, chunks)
+
+            variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}


Suggested change

variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}

variables = {k: store.maybe_chunk(k, v, chunks, overwrite_encoded_chunks) for k, v in ds.variables.items()}

weiji14 · 2020-06-17T10:18:42Z

xarray/backends/zarr.py

@@ -356,6 +358,48 @@ def encode_variable(self, variable):
    def encode_attribute(self, a):
        return encode_zarr_attr_value(a)

+
+    def get_chunk(name, var, chunks):


Suggested change

def get_chunk(name, var, chunks):

def get_chunk(self, name, var, chunks):

weiji14 · 2020-06-17T10:20:23Z

xarray/backends/zarr.py

+                chunk_spec[dim] = chunks[dim]
+        return chunk_spec
+
+    def maybe_chunk(name, var, chunks):


Suggested change

def maybe_chunk(name, var, chunks):

def maybe_chunk(self, name, var, chunks, overwrite_encoded_chunks):

weiji14 · 2020-06-17T10:21:01Z

xarray/backends/zarr.py

+        if (var.ndim > 0) and (chunk_spec is not None):
+            # does this cause any data to be read?
+            token2 = tokenize(name, var._data)


Suggested change

if (var.ndim > 0) and (chunk_spec is not None):

# does this cause any data to be read?

token2 = tokenize(name, var._data)

if (var.ndim > 0) and (chunk_spec is not None):

from dask.base import tokenize

# does this cause any data to be read?

token2 = tokenize(name, var._data)

shoyer · 2020-06-18T05:37:19Z

xarray/backends/api.py

+    if isinstance(filename_or_obj, MutableMapping):
+        if engine == 'zarr':


These conditions should be combined into a single if -- otherwise somebody could provided filename_or_obj as a mutable mapping with a different engine, which could result in undefined behavior.

shoyer · 2020-06-18T05:41:00Z

xarray/backends/api.py

+        if engine == 'zarr':
+            # on ZarrStore, mode='r', synchronizer=None, group=None,
+            # consolidated=False.
+            overwrite_encoded_chunks = backend_kwargs.pop("overwrite_encoded_chunks", None)


pop modifies a dict inplace. It's best to make a copy of backend_kwargs first to ensure that objects users supply as arguments into xarray functions aren't unexpected modified.

shoyer · 2020-06-18T05:43:01Z

xarray/backends/api.py

@@ -383,6 +384,10 @@ def open_dataset(
        represented using ``np.datetime64[ns]`` objects.  If False, always
        decode times to ``np.datetime64[ns]`` objects; if this is not possible
        raise an error.
+    overwrite_encoded_chunks: bool, optional
+        Whether to drop the zarr chunks encoded for each variable when a
+        dataset is loaded with specified chunk sizes (default: False)


Could we add overwrite_encoded_chunks as a new argument in ZarrStore.open_group? That seems preferable to adding another highly backend specific method on to open_dataset.

(To be honest, I'm not entirely sure why we need overwrite_encoded_chunks... it's not obvious from the description why it's useful)

weiji14 · 2020-06-29T22:15:08Z

@Mikejmnez, do you mind if I pick up working on this branch? I'd be really keen to see it get into xarray 0.16, and then it will be possible to resolve the intake-xarray issue at intake/intake-xarray#70. ~~Not sure if it's possible to get commit access here, or if I should just submit a PR to your fork, or maybe there's a better way?~~ Edit: I've opened up a pull request to the fork.

As per pydata#4003 (comment).

Don't pop the backend_kwargs dict as per pydata#4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr".

shoyer · 2020-06-30T02:11:18Z

@weiji14 could you kindly reopen your new pull request against the main xarray repository? Your pull request is currently in Mikejmnez/xarray

weiji14 · 2020-06-30T02:24:23Z

Sure, I can move it, but I just wanted to make sure @Mikejmnez gets the credit for this PR. Edit: moved to #4187.

shoyer · 2020-06-30T02:40:28Z

Sure, I can move it, but I just wanted to make sure @Mikejmnez gets the credit for this PR

Yes, absolutely! As long as you preserve his original commits and add yours on top of them, both of you will be credited in the Git history. If you're writing a release note in whats-new.rst about the feature, please include both of your names in the credits.

Mikejmnez · 2020-06-30T04:45:42Z

@weiji14 @shoyer Thanks you guys! Sorry it has taken me long to come back to this PR - I really mean to come back to this but I got stuck with another bigger PR that is actually part of my main research project. Anyways, much appreciated for the help, cheers!!

Since I am a novice at this, on my end, should I close this PR?

keewis · 2020-06-30T11:30:10Z

Since I am a novice at this, on my end, should I close this PR?

don't worry about that: we can close this PR when we merge #4187

* create def for multiple zarr files and added commentary/definition, which matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine`` * just as with ``xr.open_mfdatasets``, identify the paths as local directory paths/strings * added error if no path * finished copying similar code from `xr.open_mfdatasets` * remove blank lines * fixed typo * added ``xr.open_mzarr()`` to the list of available modules to call * imported missing function * imported missing glob * imported function from backend.api * imported function to facilitate mzarr * correctly imported functions from core to mzarr * imported to use on open_mzarr * removed lock and autoclose since not taken by ``open_zarr`` * fixed typo * class is not needed since zarr stores don`t remain open * removed old behavior * set default * listed open_mzarr * removed unused imported function * imported Path - hadn`t before * remove unncessesary comments * modified comments * isorted zarr * isorted * erased open_mzarr. Added capability to open_dataset to open zarr files * removed imported but unused * comment to `zarr` engine * added chunking code from `open_zarr` * remove import `open_mzarr`` * removed `open_mzarr`` from top-level-function * missing return in nested function * moved outside of nested function, had touble with reading before assignement * added missing argument associated with zarr stores, onto the definition of open_dataset * isort zarr.py * removed blank lines, fixed typo on `chunks` * removed imported but unused * restored conditional for `auto` * removed imported but unused `dask.array` * added capabilities for file_or_obj to be a mutablemapper such as `fsspec.get_mapper`, and thus compatible with `intake-xarray` * moved to a different conditional since file_or_obj is a mutablemapping, not a str, path or AbstractDataStore * isort api.py * restored the option for when file_or_obk is a str, such as an url. * fixed relabel * update open_dataset for zarr files * remove open_zarr from tests, now open_dataset(engine=`zarr`) * remove extra file, and raise deprecating warning on open_zarr * added internal call to open_dataset from depricated open_zarr * defined engine=`zarr` * correct argument for open_dataset * pass arguments as backend_kwargs * pass backend_kwargs as argument * typo * set `overwrite_enconded_chunks as backend_kwargs * do not pass as backend, use for chunking * removed commented code * moved definitions to zarr backends * Ensure class functions have necessary variables Was missing some 'self' and other kwarg variables. Also linted using black. * Combine MutableMapping and Zarr engine condition As per #4003 (comment). * Pop out overwrite_encoded_chunks after shallow copy backend_kwargs dict Don't pop the backend_kwargs dict as per #4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr". * Fix some errors noticed by PEP8 * Reorganize code in backends api.py and actually test using engine zarr Merge at 1977ba1 wasn't done very well. Reorganized the logic of the code to reduce the diff with xarray master, and ensure that the zarr backend tests actually have engine="zarr" in them. * Add back missing decode_timedelta kwarg * Add back a missing engine="zarr" to test_distributed.py * Ensure conditional statements make sense * Fix UnboundLocalError on 'chunks' referenced before assignment Need to pass in chunks to maybe_decode_store, to resolve UnboundLocalError: local variable 'chunks' referenced before assignment. * Run isort to fix import order * Fix tests where kwargs needs to be inside of backend_kwargs dict now Also temporarily silence deprecate_auto_chunk tests using pytest.raises(TypeError). May remove those fully later. * Change open_zarr to open_dataset with engine="zarr" in io.rst * Fix test_distributed by wrapping consolidated in backend_kwargs dict Patches cb6d066. * Ensure read-only mode when using open_dataset with engine="zarr" * Turn chunks from "auto" to None if dask is not available * Add back a missing else statement in maybe_chunk * Allow xfail test_vectorized_indexing when has_dask Instead of when not has_dask. * Typo on chunks arg in open_dataset * Fix ZeroDivisionError by adding back check that chunks is not False Yet another if-statement that wasn't properly transferred from zarr.py to api.py. * Fix a typo that was causing TypeError: 'method' object is not iterable * Move the `if not chunks` block to after auto detect Patches logic of 6fbeadf to fix errors when Dask is not installed. * Revert "Allow xfail test_vectorized_indexing when has_dask" This reverts commit aca2012. * Temporarily xfail test_vectorized_indexing with or without dask * Put zarr in open_mfdataset engine list * Test open_mfdataset_manyfiles with engine zarr Zarr objects are folders with seem to cause issues with closing, so added a try-except to api.py to catch failures in f.close(). Some tests failing when chunks=None because a numpy array is returned instead of a dask array. * Remember to set a ._file_obj when using Zarr Yet another logic error fixed, resolves the try-except hack in b9a239e. * Expect np.ndarray when using open_mfdataset on Zarr with chunks None * Add an entry to what's new for open_mfdataset with Zarr engine Plus a small formatting fix in open_mfdataset docstring * Make zarr engine's custom chunk mechanism more in line with ds.chunk Slightly edited the token name string to start with 'xarray' and include chunks in tokenize. Also replace the deprecated `_replace_vars_and_dims` method with just `_replace`. * Workaround problem where dask arrays aren't returned when chunks is None Revert 827e546 and workaround to get dask arrays by fixing some if-then logic in the code when `engine="zarr"` is involved. Things work fine when using chunks="auto", perhaps because the try `import dask.array` is needed to trigger loading into dask arrays? Also removed using chunks="auto" in some Zarr tests to simplify. * Default to chunks="auto" for Zarr tests to fix test_vectorized_indexing Revert hack in 6b99225 as test_vectorized_indexing now works on dask, specifically the negative slices test. It will still fail without dask, as was the behaviour before. Solution was to set `chunks="auto"` as the default when testing using `open_dataset` with `engine="zarr"`, similar to the default for `open_zarr`. Reverted some aspects of dce4e7c to ensure this `chunks="auto"`setting is visible throughout the Zarr test suite. * Fix test by passing in chunk_store to backend_kwargs * Revert "Change open_zarr to open_dataset with engine="zarr" in io.rst" This reverts commit cd0b9ef. * Remove open_zarr DeprecationWarning Partially reverts b488363. * Update open_dataset docstring to specify chunk options for zarr engine * Let only chunks = None return non-chunked arrays * Remove for-loop in test_manual_chunk since testing only one no_chunk * Update open_dataset docstring to remove mention of chunks=None with Zarr Co-authored-by: Miguel Jimenez-Urias <mjimen17@jh.edu> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

Miguel Jimenez-Urias added 18 commits April 22, 2020 23:59

create def for multiple zarr files and added commentary/definition, w…

f55ed1c

…hich matches almost exactly that of ``xr.open_mfdatasets``, but withou ``engine``

just as with xr.open_mfdatasets, identify the paths as local dire…

49f6512

…ctory paths/strings

added error if no path

f35a3e5

finished copying similar code from xr.open_mfdatasets

9f728aa

remove blank lines

8d0a844

fixed typo

b3b0f1d

added xr.open_mzarr() to the list of available modules to call

2221943

imported missing function

ac35e7c

imported missing glob

64654f3

imported function from backend.api

d5a5cef

imported function to facilitate mzarr

4c0ef19

correctly imported functions from core to mzarr

d158c21

imported to use on open_mzarr

5171420

removed lock and autoclose since not taken by open_zarr

e1e51bb

fixed typo

b6bf2cf

class is not needed since zarr stores don`t remain open

3bc4be8

removed old behavior

a79b125

set default

2d3bbb5

Miguel Jimenez-Urias added 7 commits April 25, 2020 00:13

listed open_mzarr

f7cf580

removed unused imported function

53c8623

imported Path - hadn`t before

34d755e

remove unncessesary comments

b39b37e

modified comments

276006a

isorted zarr

6f04be6

isorted

aa97e1a

TomNicholas added this to In progress in Flexible Storage Backends via automation Apr 27, 2020

Miguel Jimenez-Urias added 9 commits April 29, 2020 15:53

remove extra file, and raise deprecating warning on open_zarr

b488363

added internal call to open_dataset from depricated open_zarr

bae7f10

defined engine=zarr

37ff214

correct argument for open_dataset

b8b98f5

pass arguments as backend_kwargs

5c37329

pass backend_kwargs as argument

831f15b

typo

80dd7da

set `overwrite_enconded_chunks as backend_kwargs

4ebf380

do not pass as backend, use for chunking

4ce3007

dcherian mentioned this pull request May 5, 2020

0.16.0 release #4031

Closed

23 tasks

Miguel Jimenez-Urias added 3 commits May 22, 2020 11:27

removed commented code

89a780b

moved definitions to zarr backends

6f6eb23

Merge pull request #1 from Mikejmnez/new_branch

62893ab

moved definitions to zarr backends

WeatherGod mentioned this pull request Jun 10, 2020

Should we make "rasterio" an engine option? #4142

Closed

weiji14 reviewed Jun 17, 2020

View reviewed changes

shoyer reviewed Jun 18, 2020

View reviewed changes

weiji14 mentioned this pull request Jun 18, 2020

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr weiji14/deepicedrain#100

Open

6 tasks

weiji14 mentioned this pull request Jun 29, 2020

Xarray open_mfdataset with engine Zarr Mikejmnez/xarray#2

Closed

4 tasks

weiji14 added a commit to weiji14/xarray that referenced this pull request Jun 29, 2020

Combine MutableMapping and Zarr engine condition

afbcf78

As per pydata#4003 (comment).

weiji14 mentioned this pull request Jun 30, 2020

Xarray open_mfdataset with engine Zarr #4187

Merged

4 tasks

shoyer closed this in #4187 Sep 22, 2020

Flexible Storage Backends automation moved this from In progress to Done Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray.open_mzar: open multiple zarr files (in parallel) #4003

xarray.open_mzar: open multiple zarr files (in parallel) #4003

Mikejmnez commented Apr 25, 2020

pep8speaks commented Apr 25, 2020 •

edited

Loading

dcherian commented Apr 26, 2020

TomNicholas commented Apr 26, 2020

Mikejmnez commented Apr 27, 2020

weiji14 left a comment •

edited

Loading

weiji14 Jun 17, 2020

weiji14 Jun 17, 2020

weiji14 Jun 17, 2020

weiji14 Jun 17, 2020

shoyer Jun 18, 2020

shoyer Jun 18, 2020

shoyer Jun 18, 2020

weiji14 commented Jun 29, 2020 •

edited

Loading

shoyer commented Jun 30, 2020

weiji14 commented Jun 30, 2020 •

edited

Loading

shoyer commented Jun 30, 2020

Mikejmnez commented Jun 30, 2020

keewis commented Jun 30, 2020

	variables = {k: backends.ZarrStore.open_group.maybe_chunk(k, v, chunks) for k, v in ds.variables.items()}
	variables = {k: store.maybe_chunk(k, v, chunks, overwrite_encoded_chunks) for k, v in ds.variables.items()}

	def get_chunk(name, var, chunks):
	def get_chunk(self, name, var, chunks):

	def maybe_chunk(name, var, chunks):
	def maybe_chunk(self, name, var, chunks, overwrite_encoded_chunks):

		if isinstance(filename_or_obj, MutableMapping):
		if engine == 'zarr':

xarray.open_mzar: open multiple zarr files (in parallel) #4003

xarray.open_mzar: open multiple zarr files (in parallel) #4003

Conversation

Mikejmnez commented Apr 25, 2020

pep8speaks commented Apr 25, 2020 • edited Loading

Comment last updated at 2020-05-22 16:45:34 UTC

dcherian commented Apr 26, 2020

TomNicholas commented Apr 26, 2020

Mikejmnez commented Apr 27, 2020

weiji14 left a comment • edited Loading

Choose a reason for hiding this comment

weiji14 Jun 17, 2020

Choose a reason for hiding this comment

weiji14 Jun 17, 2020

Choose a reason for hiding this comment

weiji14 Jun 17, 2020

Choose a reason for hiding this comment

weiji14 Jun 17, 2020

Choose a reason for hiding this comment

shoyer Jun 18, 2020

Choose a reason for hiding this comment

shoyer Jun 18, 2020

Choose a reason for hiding this comment

shoyer Jun 18, 2020

Choose a reason for hiding this comment

weiji14 commented Jun 29, 2020 • edited Loading

shoyer commented Jun 30, 2020

weiji14 commented Jun 30, 2020 • edited Loading

shoyer commented Jun 30, 2020

Mikejmnez commented Jun 30, 2020

keewis commented Jun 30, 2020

pep8speaks commented Apr 25, 2020 •

edited

Loading

weiji14 left a comment •

edited

Loading

weiji14 commented Jun 29, 2020 •

edited

Loading

weiji14 commented Jun 30, 2020 •

edited

Loading