Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSStore: use ensure_bytes() #1285

Merged
merged 6 commits into from
Jan 19, 2023

Conversation

madsbk
Copy link
Contributor

@madsbk madsbk commented Dec 6, 2022

Fixes #1279

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Dec 6, 2022
@madsbk madsbk changed the title setitem: use ensure_bytes FSStore: use ensure_bytes() Dec 6, 2022
@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Merging #1285 (cfab20c) into main (385b5d3) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #1285   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines        14388     14404   +16     
=========================================
+ Hits         14388     14404   +16     
Impacted Files Coverage Δ
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

@madsbk
Copy link
Contributor Author

madsbk commented Dec 6, 2022

@normanrz can you check if this fixes the issue?

@normanrz
Copy link
Contributor

normanrz commented Dec 6, 2022

Yes! That looks good to me. Thanks for the quick fix!

@rabernat
Copy link
Contributor

rabernat commented Dec 6, 2022

Thanks for this PR @madsbk! I recently hit a similar issue involving zero-dimensional arrays (which have compression turned off by default).

I read the discussion in #1279, and I am left wondering why we leave it up to the store to call ensure_bytes. The abstraction in Zarr is very clear: the store is a key-value store where the keys are strings and the values are bytes. Therefore, IMO, it should be up to zarr.core to ensure that the data it is sending to the store are always bytes, not the store implementation.

What do others think about this?

@madsbk
Copy link
Contributor Author

madsbk commented Dec 6, 2022

The abstraction in Zarr is very clear: the store is a key-value store where the keys are strings and the values are bytes.

We changed this with #934, now we also allow NDArrayLike instances. The motivation is to support memory objects such as device memory. E.g., KvikIO implements a GDSStore that use GDS to speedup IO between device and host memory.

Having said that, I was surprised that fsspec.FSMap didn't support F-ordered numpy arrays. It might be possible to fix the issue in fsspec.FSMap as well.

@madsbk madsbk marked this pull request as ready for review December 6, 2022 13:31
@rabernat
Copy link
Contributor

rabernat commented Dec 6, 2022

We changed this with #934, now we also allow NDArrayLike instances.

I see that this has been implemented in the zarr-python implementation. And I thank everyone who worked on this amazing feature! 🚀 But that doesn't mean it is consistent with what the spec says:

The store interface defines a set of operations involving keys and values. In the context of this interface, a key is any string... A value is a sequence of bytes.

It is certainly straightforward to convert an uncompressed ndarray into a sequence of bytes. So maybe I am overthinking this... Or maybe the spec needs to be updated to reflect these new implementation changes? 🤔 My only goal here is to make sure that the spec and the implementation are consistent with each other. Otherwise we risk interoperability with other zarr implementations.

If we are fine with saying "a value is a sequence of bytes or a raw uncompressed array" then I am fine with leaving it up to the store implementation whether to convert arrays to bytes.

@madsbk
Copy link
Contributor Author

madsbk commented Dec 6, 2022

Good point and I agree, the spec and implementation should match up!

I must admit I am reading "A value is a sequence of bytes" in a more general sense, not as a reference to the bytes class specifically, which will also exclude types such as bytearray or memoryview.

@jakirkham
Copy link
Member

Hey Mads, thanks for following up on this issue.

That said, don't think we should be making this change. In general it is already the expectation in zarr-python & Numcodecs that Python Buffer Protocol supporting objects are used (memoryviews, ndarrays, bytearrays, bytes, etc.). Part of the reason for this is it allows zero-copy operations on the data (like casting, reshaping, etc.). It also can support things like memory-mapping natively (as mmap supports the Python Buffer Protocol).

How to word this in the spec is nuanced. The spec itself is not meant to be Python specific, but to support other languages as well. They may not have bytearray or memoryview for example or if they do it may be named differently and/or behave differently. So think we should consider language agnostic wording. It sounds like this is how you understood it as well.

Perhaps others who found this wording in the spec confusing can chime in on how to clarify it. Maybe we can say "a value contains binary data"?

@madsbk
Copy link
Contributor Author

madsbk commented Dec 7, 2022

@jakirkham, I am not sure I understand. Are you saying that we shouldn't use ensure_bytes() in FSStore (this PR) or that the spec should stay language agnostic? 

I have no strong opinion about the specifications, but I think it is important that we allow NDArrayLike objects in values. If we move ensure_bytes() to the zarr.core level, we are effectively saying that a value is always of type bytes

@jakirkham
Copy link
Member

Could we call ensure_ndarray_like instead?

@normanrz
Copy link
Contributor

normanrz commented Dec 7, 2022

Could we call ensure_ndarray_like instead?

I'm not sure that would work. I think the trick is that at some point, tobytes needs to be called with value.tobytes(order='A') to preserve the order. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tobytes.html

@jakirkham
Copy link
Member

jakirkham commented Dec 8, 2022

There's ensure_contiguous_ndarray_like, which can flatten the array with respect to order

@jstriebel
Copy link
Member

Since most other stores use ensure_bytes atm, I think this fix is valid for the moment, and important to fix an existing bug. I agree, that there should be a discussion whether every store should do this or not, but this should not block this PR IMO.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this fix for now, since it seems consistent with what most other stores do.

We can revisit the question of whether this ensure_bytes logic should live elsewhere in the future.

@jakirkham
Copy link
Member

Sorry to be a stickler. The main reason I bring up ensure_bytes here is this typically incurs the cost of a copy (with additional memory being used). Given this is happening with multiple memory segments that are being written together, this could result in more significant memory usage. Especially as we start to rely more on batching multiple reads & writes. While it may not always be possible to avoid these copies, they could be deferred until absolutely necessary (with the resulting bytes objects quickly discarded) to keep memory usage as low as possible (since not all memory segments would needing copying at once). Hope that clarifies the concern a bit more.

zarr/storage.py Outdated Show resolved Hide resolved
@normanrz
Copy link
Contributor

Any chance we could get this merged and released?

@rabernat
Copy link
Contributor

I'd like to hear from @jakirkham - John, would you be you okay with addressing ensure_contiguous_ndarray_like vs. ensure_bytes in a follow-on PR (where we can also address it for all the other stores that call ensure_bytes)? Or do you think this needs to be resolved before we merge?

@normanrz
Copy link
Contributor

normanrz commented Dec 19, 2022

I think @jakirkham's latest suggestion is a straightforward compromise that could be incorporated as part of this PR.

@madsbk
Copy link
Contributor Author

madsbk commented Dec 19, 2022

I think @jakirkham's latest suggestion is a straightforward compromise that could be incorporated as part of this PR.

Done

@joshmoore
Copy link
Member

Re-opening to try to fix the codecov build.

@joshmoore joshmoore closed this Dec 21, 2022
@joshmoore joshmoore reopened this Dec 21, 2022
@joshmoore
Copy link
Member

Now green. @jakirkham: does this match your expectations?

If so, happy to get it in and prep a release during the holidays. (Help with the release notes will be appreciated!)

@rabernat
Copy link
Contributor

If we're doing a release, let's get #1304 in too.

@joshmoore
Copy link
Member

If we're doing a release, let's get #1304 in too.

see #1309 (comment)

Leaving this PR for @jakirkham to review.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads! 🙏

LGTM. Made a few comments that we can follow up on separately

@@ -696,3 +701,28 @@ def all_equal(value: Any, array: Any):
# using == raises warnings from numpy deprecated pattern, but
# using np.equal() raises type errors for structured dtypes...
return np.all(value == array)


def ensure_contiguous_ndarray_or_bytes(buf) -> Union[NDArrayLike, bytes]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to move this to numcodecs.compat, but that can be follow-on work

Comment on lines +1400 to +1403
values = {
self._normalize_key(key): ensure_contiguous_ndarray_or_bytes(val)
for key, val in values.items()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we may consider is delaying the instantiation of the dict so these copies occur as these values are requested. This can be a follow-on though

@jakirkham jakirkham merged commit b9e9f5a into zarr-developers:main Jan 19, 2023
@joshmoore
Copy link
Member

@jakirkham: ok to get this in the 2.13.x series?

@madsbk: I'll get this out ASAP but might wait on any emergency needs for re: #1324

Suggestions for the docs/release.rst statement would be welcome. 🙏🏽

@jakirkham
Copy link
Member

Suggestions for the docs/release.rst statement would be welcome. 🙏🏽

Oops sorry should have asked for that here (or just added it). Submitted PR ( #1325 )

@madsbk madsbk deleted the FSStore_ensure_bytes branch January 20, 2023 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading/writing F-order data w/o compression is broken
6 participants