-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xr.open_zarr
is 3x slower than zarr.open
, even at scale
#9111
Comments
Thanks for opening this issue @max-sixty. Would you mind splitting the timing into a the following two steps: %timeit ds = xr.open_zarr("test.zarr")["foo"]
%timeit ds.compute()
%timeit arr = zarr.open("test.zarr")["foo"]
%timeit arr[:] I would like to know if the difference is in the metadata loading step or in the fetching/decoding of chunks. I may also suggest avoiding dask chunks in the xarray case just to narrow things down. |
OK, no great updates with those unfortunately. Slightly adjusting to: da = xr.open_zarr("test.zarr")["foo"]
arr = zarr.open("test.zarr")["foo"]
%timeit xr.open_zarr("test.zarr")["foo"]
%timeit da.compute()
%timeit zarr.open("test.zarr")["foo"]
%timeit arr[:] ...shows there's basically nothing in the metadata step — the array is hopefully big enough to amortize that:
If we adjust to avoiding dask + chunks, we get the same result (assuming this is what you meant, lmk if not, probably this isn't the simplest way?): da = xr.DataArray(np.random.rand(10000, 10000), name="foo")
da.encoding = dict(
chunks=(10000, 10000), preferred_chunks={"dim_0": 10000, "dim_1": 10000}
)
da.to_dataset().to_zarr("test.zarr", mode="w")
%timeit xr.open_zarr("test.zarr")["foo"].compute()
%timeit zarr.open("test.zarr")["foo"][:]
|
Curious, I'm seeing quite similar (comparable between xarray and zarr) results but with slightly different versions. I'll update this post when I've used your exact set of versions. UPDATE: I've run this with a bunch of xarray/zarr/dask versions and I'm not seeing the differences you were seeing above 🤷 . da = xr.open_zarr("test.zarr", chunks=None)["foo"] # no dask, just a lazy backend array
arr = zarr.open("test.zarr")["foo"]
%timeit xr.open_zarr("test.zarr")["foo"]
%timeit da.compute()
%timeit zarr.open("test.zarr")["foo"]
%timeit arr[:]
|
OK so it depends on how we do the chunks: import numpy as np
import zarr
import xarray as xr
import dask
print(zarr.__version__, xr.__version__, dask.__version__)
da = xr.DataArray(np.random.rand(10000, 10000), name="foo")
da.to_dataset().to_zarr("default-chunks.zarr", mode="w")
da.to_dataset().chunk(None).to_zarr("no-chunks.zarr", mode="w")
for chunk_type in ["no-chunks", "default-chunks"]:
print(f"\n# {chunk_type}")
print("zarr read")
%timeit zarr.open(f"{chunk_type}.zarr")["foo"][:]
print("xarray default read")
%timeit xr.open_zarr(f"{chunk_type}.zarr")["foo"].compute()
print("xarray chunk=None read")
%timeit xr.open_zarr(f"{chunk_type}.zarr", chunks=None)["foo"].compute()
Conclusion:
|
Yes, this very likely. But this seems like a lot of overhead — ~2x even on a huge array. Here's an array 10x the size with 10 chunks; it's 6.25s vs 3.12s. And this is with 10 chunks, so it should get some credit for the parallelization. import numpy as np
import zarr
import xarray as xr
import dask
print(zarr.__version__, xr.__version__, dask.__version__)
(
xr.DataArray(np.random.rand(100000, 10000), name="foo")
.to_dataset()
.chunk(10000)
.to_zarr("test.zarr", mode="w")
)
%timeit -n 1 -r 1 xr.open_zarr("test.zarr", chunks={}).compute()
%timeit -n 1 -r 1 xr.open_zarr("test.zarr", chunks=None).compute()
%timeit -n 1 -r 1 xr.open_zarr("test.zarr").compute()
%timeit -n 1 -r 1 zarr.open("test.zarr")["foo"][:]
...that said, possibly this is a dask issue? |
Makes them more structure, consistent. I think removes a mistake re the default chunks arg in `open_zarr` (it's not `None`, it's `auto`). Adds a comment re performance with `chunks=None`, closing pydata#9111
* Improve zarr chunks docs Makes them more structure, consistent. I think removes a mistake re the default chunks arg in `open_zarr` (it's not `None`, it's `auto`). Adds a comment re performance with `chunks=None`, closing #9111
Updated docs so will close |
What is your issue?
I'm doing some benchmarks on Xarray + Zarr vs. some other formats, and I get quite a surprising result — in a very simple array, xarray is adding a lot of overhead to reading a Zarr array.
Here's a quick script — super simple, just a single chunk. It's 800MB of data — so not some tiny array where reading a metadata json file or allocating an index is going to throw the results.
So:
Having a quick look with
py-spy
suggests there might be some thread contention, but not sure how much is really contention vs. idle threads waiting.Making the array 10x bigger (with 10 chunks) reduces the relative difference, but it's still fairly large:
Any thoughts on what might be happening? Is the benchmark at least correct?
The text was updated successfully, but these errors were encountered: