Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set write_empty_chunks to default to False #853

Merged

Conversation

jni
Copy link
Contributor

@jni jni commented Oct 20, 2021

In #738, despite general agreement that most people expect write_empty_chunks to default to False, @d-v-b decided to go the safe, uncontroversial route to get the PR merged, and made the default value True. All this PR does is set the default to False.

Convenient links to discussion suggesting write_empty_chunks=False should be the default:

  • @meggart says: "In Zarr.jl, not writing empty chunks is already the default [...] so I would be in favor of changing the default behavior in Python as well.
  • @shoyer says: "it is not clear to me why you would not want to do this. I think this could be the new default behavior."
  • @martindurant says: "I am also surprised that this is not the default, and would agree that I can't see people wanting to use the current behaviour ever."

There were performance concerns with FSSpec bulk delete items, but it seems that they were addressed upstream.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b
Copy link
Contributor

d-v-b commented Oct 20, 2021

Sounds great to me :) Thanks for doing what I didn't have the courage to do.

@jakirkham
Copy link
Member

@joshmoore wdyt?

@rabernat
Copy link
Contributor

👍 from me. I think this is a much better default.

@jakirkham
Copy link
Member

There are some lines scattered throughout the tests like these

write_empty_chunks = kwargs.pop('write_empty_chunks', True)

write_empty_chunks = kwargs.pop('write_empty_chunks', True)

Do we want to update those as well?

@jni
Copy link
Contributor Author

jni commented Nov 30, 2021

Do we want to update those as well?

🤷 All this means is that more tests will run with the default True. This might be useful for historical reasons. Both True and False are explicitly tested so I don't think we need to worry.

@jakirkham jakirkham mentioned this pull request Dec 1, 2021
@jni jni force-pushed the default-write-empty-chunks-false branch from 3e7dbea to 6b461a7 Compare December 2, 2021 01:48
@jni
Copy link
Contributor Author

jni commented Dec 2, 2021

This is now rebased on the latest master.

@joshmoore
Copy link
Member

Per #901, @d-v-b was working on a blurb for blog/release notes/docs so that users are maximally informed of the change.

@joshmoore
Copy link
Member

Capturing initial results from @d-v-b here: "users with "dense" data will see a ~15% increase in write times"

@d-v-b
Copy link
Contributor

d-v-b commented Dec 7, 2021

N.B. that result doesn't include compression time. I will provide improved benchmarking code later today with a more realistic workflow.

@d-v-b
Copy link
Contributor

d-v-b commented Dec 7, 2021

some interesting benchmarking results. see this repo for code: https://github.com/d-v-b/zarr-benchmarks

Each test writes a 64 MB chunk 128 times with write_empty_chunks drawn from {True, False} and the write value drawn from {array of 0s, array of random ints} , i.e. 4 different conditions. The plots show histrograms with an X axis that is the write duration normalized to the median of the time required under the (write_empty_chunks=False, write_value=0) condition.

First, with no compressor:
image

Next, with Blosc compression:
image

And last, with GZip compression:
image

Takeaways:

  • The overhead associated with the emptiness checking is exposed with Blosc compression and no compressor.
  • With blosc, writing empty chunks is actually faster than an emptiness check -> no write.
  • With GZip, handling empty chunks is much faster with write_empty_chunks=False, and the overhead associated with emptiness checking is not evident when writing "dense" chunks.

Happy to hear suggestions for changing the benchmark / plotting. E.g., maybe we want to show absolute duration instead of relative durations.

@shoyer
Copy link
Contributor

shoyer commented Dec 7, 2021

What sort of local disk are you using? I can imagine very different results for SSD vs HDD vs cloud object stores. Extra copies are never desirable, but in most cases I would guess the overhead of checking is acceptable.

@d-v-b
Copy link
Contributor

d-v-b commented Dec 8, 2021

@shoyer this is on an SSD, so pretty ideal in terms of storage latency.

@d-v-b
Copy link
Contributor

d-v-b commented Dec 14, 2021

@joshmoore I added content to the tutorial docs, let me know if you want more or an expanded release note.

I'm not sure the benchmarks are that illuminating or surprising, so maybe we don't really need to involve them?

@joshmoore
Copy link
Member

The location looks good. Could I suggest we try to get two or four simple writes with the different flags and inputs. Something like:

data = np.random.randint(0, 255, size=shape, dtype=DTYPE)
 arr = zarr.open(zarr.NestedDirectoryStore(store), 
                 write_empty_chunks=write_empty_chunks, ...)
 arr[:] = 255

from https://github.com/d-v-b/zarr-benchmarks/blob/main/src/zarr-benchmarks/empty_chunks.py with some form of marker ("15% slower") for someone to see at a glance.

I'm not sure the benchmarks are that illuminating or surprising, so maybe we don't really need to involve them?

Agreed. Happy to have them in your repo. I find them pretty fascinating. ;)

@jakirkham
Copy link
Member

@d-v-b, did you have a chance to try the other flags Josh mentioned above? 🙂

@d-v-b
Copy link
Contributor

d-v-b commented Jan 5, 2022

I added a short benchmark example, but I can't build the docs for some reason (tox -e docs yields some inscrutable error from a different part of the codebase)

@d-v-b
Copy link
Contributor

d-v-b commented Jan 5, 2022

Oh and there's a new test failure in an unrelated part of the codebase. Not sure what's going on there.

@joshmoore
Copy link
Member

Oy vey. That's a new year for you. The new failures is related to the category filter. I don't remember anything that should impact it. Just starting to look at the docs failure.

(Thanks for the benchmark example!)

@joshmoore
Copy link
Member

Ok, @d-v-b, that should fix the docs build. Let's see what's failing on this run.

@jni
Copy link
Contributor Author

jni commented Jan 10, 2022

Hi both, I'm unclear about whether I need to do anything here, please ping me directly if you want me to update something. 😊

@joshmoore
Copy link
Member

Sorry for dropping the ball here. Don't think there's anything needed from your side, @jni. Just very unsure what's up with this failing test. Other than that, I'd very much like to get 2.11 out ASAP.

The call to `np.any(array)` in zarr.util.all_equal triggers the
following ValueError:

```
>       return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
E       ValueError: invalid literal for int() with base 10: 'baz'
```

Extending the catch block allows test_array_with_categorize_filter
to pass, but it's unclear if this points to a deeper issue.
@joshmoore
Copy link
Member

Ok. Pushed a proposed fix related to @d-v-b's all_equal.

@@ -670,7 +670,7 @@ def all_equal(value: Any, array: Any):
# optimized to return on the first truthy value in `array`.
try:
return not np.any(array)
except TypeError: # pragma: no cover
except (TypeError, ValueError): # pragma: no cover
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What causes the ValueError? Is there a particular array type or value that np.any is raising on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs when a category value (baz) gets passed to int().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from failing build
______________________ test_array_with_categorize_filter _______________________

    def test_array_with_categorize_filter():
    
        # setup
        data = np.random.choice(['foo', 'bar', 'baz'], size=100)
        flt = Categorize(dtype=data.dtype, labels=['foo', 'bar', 'baz'])
        filters = [flt]
    
        for compressor in compressors:
    
>           a = array(data, chunks=5, compressor=compressor, filters=filters)

zarr/tests/test_filters.py:172: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
zarr/creation.py:366: in array
    z[...] = data
zarr/core.py:1285: in __setitem__
    self.set_basic_selection(pure_selection, value, fields=fields)
zarr/core.py:1380: in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
zarr/core.py:1680: in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
zarr/core.py:1732: in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
zarr/core.py:1994: in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
zarr/core.py:2002: in _chunk_setitem_nosync
    if (not self.write_empty_chunks) and all_equal(self.fill_value, cdata):
zarr/util.py:672: in all_equal
    return not np.any(array)
<__array_function__ internals>:180: in any
    ???
/usr/share/miniconda/envs/minimal/lib/python3.10/site-packages/numpy/core/fromnumeric.py:2395: in any
    return _wrapreduction(a, np.logical_or, 'any', axis, None, out,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = array(['baz', 'baz', 'baz', 'bar', 'foo'], dtype='<U3')
ufunc = <ufunc 'logical_or'>, method = 'any', axis = None, dtype = None
out = None, kwargs = {'keepdims': <no value>, 'where': <no value>}
passkwargs = {}

    def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs):
        passkwargs = {k: v for k, v in kwargs.items()
                      if v is not np._NoValue}
    
        if type(obj) is not mu.ndarray:
            try:
                reduction = getattr(obj, method)
            except AttributeError:
                pass
            else:
                # This branch is needed for reductions like any which don't
                # support a dtype.
                if dtype is not None:
                    return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
                else:
                    return reduction(axis=axis, out=out, **passkwargs)
    
>       return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
E       ValueError: invalid literal for int() with base 10: 'baz'

/usr/share/miniconda/envs/minimal/lib/python3.10/site-packages/numpy/core/fromnumeric.py:86: ValueError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Looks like the error message and type changed from NumPy 1.21 to 1.22. Don't think that was intentional, but could be wrong. Raised issue ( numpy/numpy#20898 )

In any event this workaround seems reasonable in the interim

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Once this is green, I'll move forward with 2.11.0.

@jakirkham
Copy link
Member

Hi both, I'm unclear about whether I need to do anything here, please ping me directly if you want me to update something. 😊

If we can convince both you, @jni, and @d-v-b to write a few lines for the blogpost ( #901 ), that would be fantastic! 😃

Can just post in that issue if you like

@joshmoore
Copy link
Member

If we can convince both you, @jni, and @d-v-b to write a few lines for the blogpost ( #901 )

e.g. was just mentioning to @MSanKeys963 how cool the use case from https://forum.image.sc/t/data-store-and-library-backend-for-napari-plugin/61779 is ;)

(....still waiting on actions...)

@joshmoore
Copy link
Member

Trying to reopen to get tests to complete.

@joshmoore joshmoore closed this Jan 26, 2022
@joshmoore joshmoore reopened this Jan 26, 2022
@joshmoore
Copy link
Member

I've not been able to get the 3.8 and 3.9 tests to run on this PR yet. (3-4 retries)

@jakirkham
Copy link
Member

Looks like some of the past commits in this PR (though not all) had the same issue

@joshmoore
Copy link
Member

Oddly py 3.7 is still spinning here compared to #951

@joshmoore
Copy link
Member

Power cycling for CI.

@joshmoore joshmoore closed this Feb 4, 2022
@joshmoore joshmoore reopened this Feb 4, 2022
@joshmoore
Copy link
Member

Holy moly that was an adventure! Looks like we're good to move forward with 2.11. (🍺 🍸 or similar all around this weekend)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants