Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling filling of missing chunks #489

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

willirath
Copy link

@willirath willirath commented Oct 22, 2019

This is a first stab at solving #486 by overriding filling of missing chunks.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Test coverage is 100% (Coveralls passes)

Not sure about the following todo's:

  • New/modified features documented in docs/tutorial.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @willirath!

As is, Array._fill_missing_chunk only exists if Array.set_options() has been called (giving rise to the CI test failures). What do you think about adding

# initialize options
self.set_options()

to the bottom of Array.__init__ to initialize the default option values?

Also, it'd be great if there was a test to ensure the fill_missing_chunk= parameter results in the expected behavior

@willirath
Copy link
Author

I've added the initialization of the self._fill_missing_chunk attribute.

Regarding tests: Where should I add it? zarr/tests/test_core.py

I'd go for:

  • Ensure that _fill_missing_chunks is set after Array initialization.
  • Ensure that removing a chunk from a store (sufficient to use the default DictStore?) the KeyError
    • is raised for _fill_missing_chunks=False
    • is not raised otherwise

Regarding the structure: I'm not at all familiar with the internal design of zarr-python. Is this decentralized conditional raising really the way to go, or should this be abstracted away?

@joshmoore
Copy link
Member

Going to re-open to try to get travis green. Coveralls will stay red until a test is added.

@codecov
Copy link

codecov bot commented Feb 18, 2021

Codecov Report

Merging #489 (c7a2b16) into master (0fc6a1e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #489   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          32       33    +1     
  Lines       11256    11285   +29     
=======================================
+ Hits        11250    11279   +29     
  Misses          6        6           
Impacted Files Coverage Δ
zarr/core.py 100.00% <100.00%> (ø)
zarr/tests/test_missing.py 100.00% <100.00%> (ø)

@joshmoore
Copy link
Member

Hi @willirath, I've updated this branch and all existing tests are passing. Are you still interested in taking it forward?

@willirath
Copy link
Author

Thanks for pinging me. I'm still interested. It'll take a few days, though.

@bolliger32
Copy link

@willirath just checking if you've had any time to work on this lately. This functionality would be super helpful and thanks for filing the PR! I'm happy to try to work on these tests if you'd like someone else to push it forward.

@pep8speaks
Copy link

pep8speaks commented Apr 21, 2021

Hello @willirath! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-03-07 22:22:58 UTC

@willirath willirath marked this pull request as ready for review April 21, 2021 19:12
@willirath
Copy link
Author

@bolliger32 I've found some time to continue with this today. (:+1: for pinging me again. Adding some urgency usually helps finishing this kind of work.)

@joshmoore and @jrbourbeau I've added two tests relying on the zarr.create.array function for creating arrays with a MemoryStore from which it's easy to pop chunks.

From my PoV, this is ready for review.

@bolliger32
Copy link

@willirath amazing! Many thanks! 🙏

@willirath
Copy link
Author

Pinging for review again.

@joshmoore
Copy link
Member

joshmoore commented Dec 23, 2021

Thanks for the ping, @willirath, definitely time. I personally don't foresee too much review capacity during the holidays, but will leave the tab open. Obviously, if anyone else could jump in, that'd be wonderful! 🎆

Just in case you missed them, #853 and the proceeding #738 from @d-v-b might be of interest.

@joshmoore
Copy link
Member

Ah, nice. With a clearer 2022 head, I notice just how complementary this is to @d-v-b's #753 and @jni's #853. In Zarr 2.11, the default will become to not serialize empty (i.e. fill_value_only) chunks. With this PR, the user can prevent empty chunks from being deserialized. 👍 I'm going to update the branch in order to trigger another round of tests.

(I slightly wonder if there's not a need to unify settings/options/arguments but that's likely out of scope)

@joshmoore
Copy link
Member

Still green after an update to 2.11.1.

Probably the biggest question from my side is what else should fall into the set_option category to know if the design is right.

@joshmoore
Copy link
Member

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

(This is orthogonal to whether these values should actually be .zarray metadata, and in that case, they might should be @property values like fill_value)

@joshmoore joshmoore added good-first-issue Good place to get started as a new contributor. and removed low-hanging-fruit labels Nov 23, 2022
@tomwhite
Copy link

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

I have a need for this feature (see also #486 (comment)). It seems that adding fill_missing_chunks as an __init__ argument rather than introducing a new set_options mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 13, 2024

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

I have a need for this feature (see also #486 (comment)). It seems that adding fill_missing_chunks as an __init__ argument rather than introducing a new set_options mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.

Agreed with setting this parameter in __init__. We don't need to add set_options for this. PR would be welcome @tomwhite

@riley-brady
Copy link

riley-brady commented Mar 14, 2024

Bumping this if still relevant. We are having issues with the "missing rectangles" with S3, similar to pangeo-data/pangeo#691.

This seems to persist with using fsspec with fs.get_mapper("s3://*.zarr"), but seems to go away if we use s3fs.S3FileSystem and fs.get_mapper(...). It seems like a rate-limit issue on S3, just like with GCS. I am wondering if there was an upstream s3fs fix as well (as mentioned in the above issue thread for gcsfs).

It would be nice to fix upstream in Zarr if I'm understanding the issue correctly.


EDIT:

We were able to fix this issue on our AWS Sagemaker instances awhile back with using this in the header:

import boto3
import s3fs
session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
)

I just fixed the most recent manifestation of this issue that was showing up on our AWS Batch jobs through Kedro with distributed.LocalCluster() on on-demand EC2 instances reading from S3. I used a similar setup as above, but was testing adding the skip_instance_cache as well. It seems to have fully resolved our issues.

session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
    # This seems to help with writing 
    # https://github.com/pydata/xarray/issues/3831#issuecomment-1768393788
    skip_instance_cache=True,
)

The above code replaces the following setup we were using. I would assume fsspec calls s3fs under the hood when s3 is declared as the protocol. And I don't think this is a credential timeout issue since this was happening on 3-minute jobs (although many were being launched in parallel at once and targeting similar zarr stores, hence why I thin kit is a rate-limiting issue). So I'm surprised that just switching to s3fs fixes the problem for us.

fs = fsspec.filesystem(self._protocol, **self._storage_options)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-issue Good place to get started as a new contributor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants