Set write_empty_chunks to default to False #853

jni · 2021-10-20T02:13:47Z

In #738, despite general agreement that most people expect write_empty_chunks to default to False, @d-v-b decided to go the safe, uncontroversial route to get the PR merged, and made the default value True. All this PR does is set the default to False.

Convenient links to discussion suggesting write_empty_chunks=False should be the default:

@meggart says: "In Zarr.jl, not writing empty chunks is already the default [...] so I would be in favor of changing the default behavior in Python as well.
@shoyer says: "it is not clear to me why you would not want to do this. I think this could be the new default behavior."
@martindurant says: "I am also surprised that this is not the default, and would agree that I can't see people wanting to use the current behaviour ever."

There were performance concerns with FSSpec bulk delete items, but it seems that they were addressed upstream.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

d-v-b · 2021-10-20T02:26:21Z

Sounds great to me :) Thanks for doing what I didn't have the courage to do.

jakirkham · 2021-11-30T21:52:52Z

@joshmoore wdyt?

rabernat · 2021-11-30T21:57:38Z

👍 from me. I think this is a much better default.

jakirkham · 2021-11-30T22:01:06Z

There are some lines scattered throughout the tests like these

zarr-python/zarr/tests/test_core.py

Line 91 in edced8e

write_empty_chunks = kwargs.pop('write_empty_chunks', True)

zarr-python/zarr/tests/test_sync.py

Line 103 in edced8e

write_empty_chunks = kwargs.pop('write_empty_chunks', True)

Do we want to update those as well?

jni · 2021-11-30T23:36:19Z

Do we want to update those as well?

🤷 All this means is that more tests will run with the default True. This might be useful for historical reasons. Both True and False are explicitly tested so I don't think we need to worry.

jni · 2021-12-02T01:51:05Z

This is now rebased on the latest master.

joshmoore · 2021-12-02T08:46:47Z

Per #901, @d-v-b was working on a blurb for blog/release notes/docs so that users are maximally informed of the change.

joshmoore · 2021-12-07T09:22:35Z

Capturing initial results from @d-v-b here: "users with "dense" data will see a ~15% increase in write times"

d-v-b · 2021-12-07T15:59:14Z

N.B. that result doesn't include compression time. I will provide improved benchmarking code later today with a more realistic workflow.

d-v-b · 2021-12-07T20:34:53Z

some interesting benchmarking results. see this repo for code: https://github.com/d-v-b/zarr-benchmarks

Each test writes a 64 MB chunk 128 times with write_empty_chunks drawn from {True, False} and the write value drawn from {array of 0s, array of random ints} , i.e. 4 different conditions. The plots show histrograms with an X axis that is the write duration normalized to the median of the time required under the (write_empty_chunks=False, write_value=0) condition.

First, with no compressor:

Next, with Blosc compression:

And last, with GZip compression:

Takeaways:

The overhead associated with the emptiness checking is exposed with Blosc compression and no compressor.
With blosc, writing empty chunks is actually faster than an emptiness check -> no write.
With GZip, handling empty chunks is much faster with write_empty_chunks=False, and the overhead associated with emptiness checking is not evident when writing "dense" chunks.

Happy to hear suggestions for changing the benchmark / plotting. E.g., maybe we want to show absolute duration instead of relative durations.

shoyer · 2021-12-07T22:12:05Z

What sort of local disk are you using? I can imagine very different results for SSD vs HDD vs cloud object stores. Extra copies are never desirable, but in most cases I would guess the overhead of checking is acceptable.

d-v-b · 2021-12-08T00:03:18Z

@shoyer this is on an SSD, so pretty ideal in terms of storage latency.

d-v-b · 2021-12-14T21:39:20Z

@joshmoore I added content to the tutorial docs, let me know if you want more or an expanded release note.

I'm not sure the benchmarks are that illuminating or surprising, so maybe we don't really need to involve them?

joshmoore · 2021-12-17T17:14:55Z

The location looks good. Could I suggest we try to get two or four simple writes with the different flags and inputs. Something like:

data = np.random.randint(0, 255, size=shape, dtype=DTYPE)
 arr = zarr.open(zarr.NestedDirectoryStore(store), 
                 write_empty_chunks=write_empty_chunks, ...)
 arr[:] = 255

from https://github.com/d-v-b/zarr-benchmarks/blob/main/src/zarr-benchmarks/empty_chunks.py with some form of marker ("15% slower") for someone to see at a glance.

I'm not sure the benchmarks are that illuminating or surprising, so maybe we don't really need to involve them?

Agreed. Happy to have them in your repo. I find them pretty fascinating. ;)

jakirkham · 2022-01-04T22:03:34Z

@d-v-b, did you have a chance to try the other flags Josh mentioned above? 🙂

… into default-write-empty-chunks-false

d-v-b · 2022-01-05T21:46:19Z

I added a short benchmark example, but I can't build the docs for some reason (tox -e docs yields some inscrutable error from a different part of the codebase)

d-v-b · 2022-01-05T21:47:14Z

Oh and there's a new test failure in an unrelated part of the codebase. Not sure what's going on there.

joshmoore · 2022-01-06T09:19:10Z

Oy vey. That's a new year for you. The new failures is related to the category filter. I don't remember anything that should impact it. Just starting to look at the docs failure.

(Thanks for the benchmark example!)

joshmoore · 2022-01-06T10:10:24Z

Ok, @d-v-b, that should fix the docs build. Let's see what's failing on this run.

jni · 2022-01-10T04:20:27Z

Hi both, I'm unclear about whether I need to do anything here, please ping me directly if you want me to update something. 😊

joshmoore · 2022-01-26T08:52:58Z

Sorry for dropping the ball here. Don't think there's anything needed from your side, @jni. Just very unsure what's up with this failing test. Other than that, I'd very much like to get 2.11 out ASAP.

The call to `np.any(array)` in zarr.util.all_equal triggers the following ValueError: ``` > return ufunc.reduce(obj, axis, dtype, out, **passkwargs) E ValueError: invalid literal for int() with base 10: 'baz' ``` Extending the catch block allows test_array_with_categorize_filter to pass, but it's unclear if this points to a deeper issue.

joshmoore · 2022-01-26T09:20:20Z

Ok. Pushed a proposed fix related to @d-v-b's all_equal.

jakirkham · 2022-01-26T09:49:30Z

zarr/util.py

@@ -670,7 +670,7 @@ def all_equal(value: Any, array: Any):
        # optimized to return on the first truthy value in `array`.
        try:
            return not np.any(array)
-        except TypeError:  # pragma: no cover
+        except (TypeError, ValueError):  # pragma: no cover


What causes the ValueError? Is there a particular array type or value that np.any is raising on?

It occurs when a category value (baz) gets passed to int().

from failing build

______________________ test_array_with_categorize_filter _______________________ def test_array_with_categorize_filter(): # setup data = np.random.choice(['foo', 'bar', 'baz'], size=100) flt = Categorize(dtype=data.dtype, labels=['foo', 'bar', 'baz']) filters = [flt] for compressor in compressors: > a = array(data, chunks=5, compressor=compressor, filters=filters) zarr/tests/test_filters.py:172: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ zarr/creation.py:366: in array z[...] = data zarr/core.py:1285: in __setitem__ self.set_basic_selection(pure_selection, value, fields=fields) zarr/core.py:1380: in set_basic_selection return self._set_basic_selection_nd(selection, value, fields=fields) zarr/core.py:1680: in _set_basic_selection_nd self._set_selection(indexer, value, fields=fields) zarr/core.py:1732: in _set_selection self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields) zarr/core.py:1994: in _chunk_setitem self._chunk_setitem_nosync(chunk_coords, chunk_selection, value, zarr/core.py:2002: in _chunk_setitem_nosync if (not self.write_empty_chunks) and all_equal(self.fill_value, cdata): zarr/util.py:672: in all_equal return not np.any(array) <__array_function__ internals>:180: in any ??? /usr/share/miniconda/envs/minimal/lib/python3.10/site-packages/numpy/core/fromnumeric.py:2395: in any return _wrapreduction(a, np.logical_or, 'any', axis, None, out, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ obj = array(['baz', 'baz', 'baz', 'bar', 'foo'], dtype='<U3') ufunc = <ufunc 'logical_or'>, method = 'any', axis = None, dtype = None out = None, kwargs = {'keepdims': <no value>, 'where': <no value>} passkwargs = {} def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs): passkwargs = {k: v for k, v in kwargs.items() if v is not np._NoValue} if type(obj) is not mu.ndarray: try: reduction = getattr(obj, method) except AttributeError: pass else: # This branch is needed for reductions like any which don't # support a dtype. if dtype is not None: return reduction(axis=axis, dtype=dtype, out=out, **passkwargs) else: return reduction(axis=axis, out=out, **passkwargs) > return ufunc.reduce(obj, axis, dtype, out, **passkwargs) E ValueError: invalid literal for int() with base 10: 'baz' /usr/share/miniconda/envs/minimal/lib/python3.10/site-packages/numpy/core/fromnumeric.py:86: ValueError

Interesting. Looks like the error message and type changed from NumPy 1.21 to 1.22. Don't think that was intentional, but could be wrong. Raised issue ( numpy/numpy#20898 )

In any event this workaround seems reasonable in the interim

Great, thanks! Once this is green, I'll move forward with 2.11.0.

jakirkham · 2022-01-26T10:46:54Z

Hi both, I'm unclear about whether I need to do anything here, please ping me directly if you want me to update something. 😊

If we can convince both you, @jni, and @d-v-b to write a few lines for the blogpost ( #901 ), that would be fantastic! 😃

Can just post in that issue if you like

joshmoore · 2022-01-26T11:48:32Z

If we can convince both you, @jni, and @d-v-b to write a few lines for the blogpost ( #901 )

e.g. was just mentioning to @MSanKeys963 how cool the use case from https://forum.image.sc/t/data-store-and-library-backend-for-napari-plugin/61779 is ;)

(....still waiting on actions...)

joshmoore · 2022-01-26T14:51:12Z

Trying to reopen to get tests to complete.

joshmoore · 2022-01-26T15:13:41Z

I've not been able to get the 3.8 and 3.9 tests to run on this PR yet. (3-4 retries)

jakirkham · 2022-01-26T18:14:33Z

Looks like some of the past commits in this PR (though not all) had the same issue

joshmoore · 2022-01-27T08:48:33Z

Oddly py 3.7 is still spinning here compared to #951

joshmoore · 2022-02-04T09:48:58Z

Power cycling for CI.

joshmoore · 2022-02-04T10:07:30Z

Holy moly that was an adventure! Looks like we're good to move forward with 2.11. (🍺 🍸 or similar all around this weekend)

jni force-pushed the default-write-empty-chunks-false branch from d83df68 to f3280a7 Compare October 21, 2021 00:22

jni mentioned this pull request Oct 21, 2021

Optimize setitem with chunk equal to fill_value, round 2 #738

Merged

6 tasks

jakirkham mentioned this pull request Dec 1, 2021

Blogpost for Zarr 2.11 #901

Closed

Set write_empty_chunks to default to False

6b461a7

jni force-pushed the default-write-empty-chunks-false branch from 3e7dbea to 6b461a7 Compare December 2, 2021 01:48

Add release entry for write_empty_chunks default

d09c8d9

Merge branch 'master' into default-write-empty-chunks-false

a042c8a

add Empty chunks section to tutorial.rst

5091df0

joshmoore mentioned this pull request Dec 21, 2021

Drop shortcut fsspec[s3] for dependency #920

Merged

Merge branch 'master' into default-write-empty-chunks-false

9d3d83c

joshmoore mentioned this pull request Dec 23, 2021

Allow disabling filling of missing chunks #489

Open

7 tasks

d-v-b added 2 commits January 5, 2022 16:15

add benchmarky example

21647ab

Merge branch 'master' of https://github.com/zarr-developers/zarr-python…

b9af4be

… into default-write-empty-chunks-false

proper formatting of code block

6ad8695

Fix abstore deprecated strings

422beaa

Merge branch 'master' into default-write-empty-chunks-false

d27e4ed

jakirkham reviewed Jan 26, 2022

View reviewed changes

Merge branch 'master' into default-write-empty-chunks-false

18971b5

joshmoore closed this Jan 26, 2022

joshmoore reopened this Jan 26, 2022

joshmoore mentioned this pull request Jan 26, 2022

Default write empty chunks false #951

Closed

joshmoore added 2 commits January 27, 2022 07:37

Add --timeout argument to all uses of pytest

8b12549

Pin fasteners to 0.16.3 (see zarr-developers#952)

080a854

joshmoore closed this Feb 4, 2022

joshmoore reopened this Feb 4, 2022

jakirkham approved these changes Feb 4, 2022

View reviewed changes

joshmoore merged commit f461eb7 into zarr-developers:master Feb 4, 2022

jni mentioned this pull request Mar 20, 2022

Incorrect default fill value causes byte arrays to become numeric when write_empty_chunks=False #965

Closed

Set write_empty_chunks to default to False #853

Set write_empty_chunks to default to False #853

Conversation

jni commented Oct 20, 2021

d-v-b commented Oct 20, 2021

jakirkham commented Nov 30, 2021

rabernat commented Nov 30, 2021

jakirkham commented Nov 30, 2021

jni commented Nov 30, 2021

jni commented Dec 2, 2021

joshmoore commented Dec 2, 2021

joshmoore commented Dec 7, 2021

d-v-b commented Dec 7, 2021

d-v-b commented Dec 7, 2021

shoyer commented Dec 7, 2021

d-v-b commented Dec 8, 2021

d-v-b commented Dec 14, 2021 • edited Loading

joshmoore commented Dec 17, 2021

jakirkham commented Jan 4, 2022

d-v-b commented Jan 5, 2022

d-v-b commented Jan 5, 2022

joshmoore commented Jan 6, 2022

joshmoore commented Jan 6, 2022

jni commented Jan 10, 2022

joshmoore commented Jan 26, 2022

joshmoore commented Jan 26, 2022

jakirkham Jan 26, 2022

Choose a reason for hiding this comment

joshmoore Jan 26, 2022

Choose a reason for hiding this comment

joshmoore Jan 26, 2022

Choose a reason for hiding this comment

jakirkham Jan 26, 2022

Choose a reason for hiding this comment

joshmoore Jan 26, 2022

Choose a reason for hiding this comment

jakirkham commented Jan 26, 2022

joshmoore commented Jan 26, 2022

joshmoore commented Jan 26, 2022

joshmoore commented Jan 26, 2022

jakirkham commented Jan 26, 2022

joshmoore commented Jan 27, 2022

joshmoore commented Feb 4, 2022

joshmoore commented Feb 4, 2022

d-v-b commented Dec 14, 2021 •

edited

Loading