Allow disabling filling of missing chunks #489

willirath · 2019-10-22T14:47:16Z

This is a first stab at solving #486 by overriding filling of missing chunks.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
Test coverage is 100% (Coveralls passes)

Not sure about the following todo's:

New/modified features documented in docs/tutorial.rst
Docs build locally (e.g., run tox -e docs)
AppVeyor and Travis CI passes

jrbourbeau

Thanks for the PR @willirath!

As is, Array._fill_missing_chunk only exists if Array.set_options() has been called (giving rise to the CI test failures). What do you think about adding

# initialize options
self.set_options()

to the bottom of Array.__init__ to initialize the default option values?

Also, it'd be great if there was a test to ensure the fill_missing_chunk= parameter results in the expected behavior

willirath · 2019-10-25T09:49:10Z

I've added the initialization of the self._fill_missing_chunk attribute.

Regarding tests: Where should I add it? zarr/tests/test_core.py

I'd go for:

Ensure that _fill_missing_chunks is set after Array initialization.
Ensure that removing a chunk from a store (sufficient to use the default DictStore?) the KeyError
- is raised for _fill_missing_chunks=False
- is not raised otherwise

Regarding the structure: I'm not at all familiar with the internal design of zarr-python. Is this decentralized conditional raising really the way to go, or should this be abstracted away?

joshmoore · 2020-04-27T10:22:35Z

Going to re-open to try to get travis green. Coveralls will stay red until a test is added.

codecov · 2021-02-18T16:56:00Z

Codecov Report

Merging #489 (c7a2b16) into master (0fc6a1e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #489   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          32       33    +1     
  Lines       11256    11285   +29     
=======================================
+ Hits        11250    11279   +29     
  Misses          6        6

Impacted Files	Coverage Δ
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_missing.py	`100.00% <100.00%> (ø)`

joshmoore · 2021-02-18T17:01:22Z

Hi @willirath, I've updated this branch and all existing tests are passing. Are you still interested in taking it forward?

willirath · 2021-02-19T08:55:02Z

Thanks for pinging me. I'm still interested. It'll take a few days, though.

bolliger32 · 2021-04-20T23:03:02Z

@willirath just checking if you've had any time to work on this lately. This functionality would be super helpful and thanks for filing the PR! I'm happy to try to work on these tests if you'd like someone else to push it forward.

pep8speaks · 2021-04-21T09:43:35Z

Hello @willirath! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-03-07 22:22:58 UTC

willirath · 2021-04-21T19:22:14Z

@bolliger32 I've found some time to continue with this today. (:+1: for pinging me again. Adding some urgency usually helps finishing this kind of work.)

@joshmoore and @jrbourbeau I've added two tests relying on the zarr.create.array function for creating arrays with a MemoryStore from which it's easy to pop chunks.

From my PoV, this is ready for review.

bolliger32 · 2021-04-21T19:24:17Z

@willirath amazing! Many thanks! 🙏

willirath · 2021-12-23T11:39:02Z

Pinging for review again.

joshmoore · 2021-12-23T15:42:19Z

Thanks for the ping, @willirath, definitely time. I personally don't foresee too much review capacity during the holidays, but will leave the tab open. Obviously, if anyone else could jump in, that'd be wonderful! 🎆

Just in case you missed them, #853 and the proceeding #738 from @d-v-b might be of interest.

joshmoore · 2022-01-06T12:17:45Z

Ah, nice. With a clearer 2022 head, I notice just how complementary this is to @d-v-b's #753 and @jni's #853. In Zarr 2.11, the default will become to not serialize empty (i.e. fill_value_only) chunks. With this PR, the user can prevent empty chunks from being deserialized. 👍 I'm going to update the branch in order to trigger another round of tests.

(I slightly wonder if there's not a need to unify settings/options/arguments but that's likely out of scope)

joshmoore · 2022-03-07T22:41:13Z

Still green after an update to 2.11.1.

Probably the biggest question from my side is what else should fall into the set_option category to know if the design is right.

joshmoore · 2022-05-04T11:18:27Z

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

(This is orthogonal to whether these values should actually be .zarray metadata, and in that case, they might should be @property values like fill_value)

tomwhite · 2023-08-14T11:03:20Z

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

I have a need for this feature (see also #486 (comment)). It seems that adding fill_missing_chunks as an __init__ argument rather than introducing a new set_options mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.

d-v-b · 2024-02-13T17:01:27Z

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

I have a need for this feature (see also #486 (comment)). It seems that adding fill_missing_chunks as an __init__ argument rather than introducing a new set_options mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.

Agreed with setting this parameter in __init__. We don't need to add set_options for this. PR would be welcome @tomwhite

riley-brady · 2024-03-14T22:16:13Z

Bumping this if still relevant. We are having issues with the "missing rectangles" with S3, similar to pangeo-data/pangeo#691.

This seems to persist with using fsspec with fs.get_mapper("s3://*.zarr"), but seems to go away if we use s3fs.S3FileSystem and fs.get_mapper(...). It seems like a rate-limit issue on S3, just like with GCS. I am wondering if there was an upstream s3fs fix as well (as mentioned in the above issue thread for gcsfs).

It would be nice to fix upstream in Zarr if I'm understanding the issue correctly.

EDIT:

We were able to fix this issue on our AWS Sagemaker instances awhile back with using this in the header:

import boto3
import s3fs
session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
)

I just fixed the most recent manifestation of this issue that was showing up on our AWS Batch jobs through Kedro with distributed.LocalCluster() on on-demand EC2 instances reading from S3. I used a similar setup as above, but was testing adding the skip_instance_cache as well. It seems to have fully resolved our issues.

session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
    # This seems to help with writing 
    # https://github.com/pydata/xarray/issues/3831#issuecomment-1768393788
    skip_instance_cache=True,
)

The above code replaces the following setup we were using. I would assume fsspec calls s3fs under the hood when s3 is declared as the protocol. And I don't think this is a credential timeout issue since this was happening on 3-minute jobs (although many were being launched in parallel at once and targeting similar zarr stores, hence why I thin kit is a rate-limiting issue). So I'm surprised that just switching to s3fs fixes the problem for us.

fs = fsspec.filesystem(self._protocol, **self._storage_options)

willirath added 2 commits October 22, 2019 16:45

Allow disabling filling of missing chunks

0d2217b

Fix typo in docstring

c627aaf

jrbourbeau reviewed Oct 24, 2019

View reviewed changes

Ensure that options are always initialized

a73e248

joshmoore closed this Apr 27, 2020

joshmoore reopened this Apr 27, 2020

alimanfoo added the low-hanging-fruit label Sep 24, 2020

Merge branch 'master' into override-fill-missing-chunks

4eab3f2

bolliger32 mentioned this pull request Apr 21, 2021

allow you to raise error on missing zarr chunks with open_dataset/open_zarr pydata/xarray#5197

Closed

joshmoore mentioned this pull request Apr 21, 2021

How to prevent Zarr from returning NaN for missing chunks? #486

Open

willirath and others added 5 commits April 21, 2021 10:41

Add test for array with missing chunk

75139ad

Don't raise on setting items on missing chunks

b69460f

Also test for slicing

806e0be

Merge branch 'master' into override-fill-missing-chunks

decc5da

Satisfy linter

843410c

willirath and others added 7 commits April 21, 2021 11:44

Satisfy linter

4d5d532

Fix test logic

6690a6d

Revert black

690e097

Revert black

55ae353

Don't raise on setting items on missing chunks

5a42221

Add test for zero-dim array

0a33f44

Merge branch 'master' into override-fill-missing-chunks

892c9c4

willirath marked this pull request as ready for review April 21, 2021 19:12

willirath requested a review from jrbourbeau April 22, 2021 07:59

joshmoore added 2 commits January 6, 2022 13:17

Merge branch 'master' into override-fill-missing-chunks

14be119

Merge branch 'master' into override-fill-missing-chunks

c7a2b16

joshmoore added good-first-issue Good place to get started as a new contributor. and removed low-hanging-fruit labels Nov 23, 2022

joshmoore mentioned this pull request Aug 25, 2023

Accessing restricted groups over s3 produces zero filled array #1504

Open

maxrjones mentioned this pull request Feb 13, 2024

Missing Data in Downscaled CMIP6 Datasets for China Region carbonplan/cmip6-downscaling#323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow disabling filling of missing chunks #489

Allow disabling filling of missing chunks #489

willirath commented Oct 22, 2019 •

edited

jrbourbeau left a comment

willirath commented Oct 25, 2019

joshmoore commented Apr 27, 2020

codecov bot commented Feb 18, 2021 •

edited

joshmoore commented Feb 18, 2021

willirath commented Feb 19, 2021

bolliger32 commented Apr 20, 2021

pep8speaks commented Apr 21, 2021 •

edited

willirath commented Apr 21, 2021

bolliger32 commented Apr 21, 2021

willirath commented Dec 23, 2021

joshmoore commented Dec 23, 2021 •

edited

joshmoore commented Jan 6, 2022

joshmoore commented Mar 7, 2022

joshmoore commented May 4, 2022

tomwhite commented Aug 14, 2023

d-v-b commented Feb 13, 2024

riley-brady commented Mar 14, 2024 •

edited

Allow disabling filling of missing chunks #489

Are you sure you want to change the base?

Allow disabling filling of missing chunks #489

Conversation

willirath commented Oct 22, 2019 • edited

jrbourbeau left a comment

Choose a reason for hiding this comment

willirath commented Oct 25, 2019

joshmoore commented Apr 27, 2020

codecov bot commented Feb 18, 2021 • edited

Codecov Report

joshmoore commented Feb 18, 2021

willirath commented Feb 19, 2021

bolliger32 commented Apr 20, 2021

pep8speaks commented Apr 21, 2021 • edited

Comment last updated at 2022-03-07 22:22:58 UTC

willirath commented Apr 21, 2021

bolliger32 commented Apr 21, 2021

willirath commented Dec 23, 2021

joshmoore commented Dec 23, 2021 • edited

joshmoore commented Jan 6, 2022

joshmoore commented Mar 7, 2022

joshmoore commented May 4, 2022

tomwhite commented Aug 14, 2023

d-v-b commented Feb 13, 2024

riley-brady commented Mar 14, 2024 • edited

willirath commented Oct 22, 2019 •

edited

codecov bot commented Feb 18, 2021 •

edited

pep8speaks commented Apr 21, 2021 •

edited

joshmoore commented Dec 23, 2021 •

edited

riley-brady commented Mar 14, 2024 •

edited